Concepts

Open Source Components

Filebeat

Positioning: Lightweight log collector Description: An open-source log collection component installed on container nodes, responsible for real-time monitoring of log files at specified paths. It collects log data through input modules, processes it, and forwards the logs to Kafka or directly delivers them to storage components via output modules. It supports capabilities such as multiline log aggregation and field filtering for preprocessing.

Elasticsearch

Positioning: Distributed search and analytics engine
Description: A full-text search engine based on Lucene, storing log data in JSON document format, and providing near real-time search capabilities. It supports dynamic mapping for automatic field type recognition and achieves fast keyword searches through inverted indexing, suitable for log searches and monitoring alerts.

ClickHouse

Positioning: Columnar analytical database
Description: High-performance columnar storage database designed for OLAP scenarios, implementing PB-level log data storage using the MergeTree engine. It supports high-speed aggregation queries, time partitioning, data TTL strategies, and S3-backed storage planning, making it suitable for log analysis and statistical reporting in batch computation scenarios.

Kafka

Positioning: Distributed message queue
Description: Serving as the messaging middleware for the log pipeline system, it provides high-throughput log buffering capabilities. When the Elasticsearch cluster experiences processing bottlenecks, it receives log data sent by Filebeat via Topics, facilitating traffic peak reduction and asynchronous consumption, ensuring the stability of the log collection end.

Core Functionality Concepts

Log Collection Pipeline

Description: The complete link from log data generation to storage, comprising four stages: Collection -> Transmission -> Buffering -> Storage. It supports two pipeline modes:

  • Direct Write Mode: Filebeat → Elasticsearch/ClickHouse
  • Buffer Mode: Filebeat → Kafka → Elasticsearch

Index

Description: The logical data partitioning unit in Elasticsearch, analogous to a table structure in databases. It supports time-based rolling index creation (e.g., logstash-2023.10.01) and automated hot-warm-cold tiered storage via Index Lifecycle Management (ILM).

Shards and Replicas

Description:

  • Shard: The physical storage unit resulting from Elasticsearch's horizontal splitting of an index, supporting distributed scalability.
  • Replica: A copy of each shard, providing data high availability and query load balancing.

Columnar Storage

Description: The core storage mechanism of ClickHouse, where data is compressed and stored by column, significantly reducing I/O consumption. It supports the following features:

  • Vectorized query execution engine
  • Data partitioning and sharding
  • Materialized views for pre-aggregation

Storage Policy

Description: The ClickHouse storage placement model that controls where metadata and log table data are stored. In ACP Logging, ClickHouse supports:

  • Default: Store ClickHouse data in local host paths or a StorageClass.
  • Separation of storage and compute: Store ClickHouse metadata locally or in a StorageClass, while storing log table data in remote S3 storage.

Cold and Hot Separate

Description: A tiered storage model for ClickHouse. Active data remains in the current ClickHouse storage, while cold data is transferred to S3 after the configured retention threshold is reached. This model is used when long-term retention is needed without keeping all data in active storage.

Key Technical Terms

Ingest Pipeline

Description: The data preprocessing pipeline in Elasticsearch, capable of performing ETL operations such as field renaming, Grok parsing, and conditional logic before data is written.

Consumer Group

Description: Kafka's parallel consumption mechanism, where multiple instances within the same consumer group can consume messages from different partitions in parallel, ensuring ordered message processing.

TTL (Time To Live)

Description: Data lifespan strategy, supporting two implementation methods:

  • Elasticsearch: Automatically deletes expired indices through ILM policies.
  • ClickHouse: Uses TTL expressions to delete expired data or move eligible data to cold storage.

Replication Factor

Description: The data redundancy configuration at the Kafka Topic level, defining the number of message replicas across different Brokers, enhancing data reliability.

Data Flow Model