Architecture

TOC

Functional Perspective

()'s complete functionality consists of Core and extensions based on two technical stacks: Operator and Cluster Plugin.

  • Core

    The minimal deliverable unit of , providing core capabilities such as cluster management, container orchestration, projects, and user administration.

    • Meets the highest security standards
    • Delivers maximum stability
    • Offers the longest support lifecycle
  • Extensions

    Extensions in both the Operator and Cluster Plugin stacks can be classified into:

    • Aligned – Life cycle strategy consisting of multiple maintenance streams, with alignment to .
    • Agnostic – Life cycle strategy consisting of multiple maintenance streams, released independently from .

    For more details about extensions, see Extend.

Deployment Perspective

is composed of a global cluster and one or more workload clusters.

  • global Cluster

    • The central hub for multi-cluster management
    • All clusters must be registered to global before they can be managed
    • Hosts multi-cluster and cross-cluster functionality
    • Kubernetes is deployed and managed by the platform
  • Workload Cluster

    • Hosts user workloads and services
    • Kubernetes may be deployed by the platform or provided by third parties
    • Supports Kubernetes services from major cloud providers as well as CNCF-compliant Kubernetes clusters
    • In certain scenarios, the global cluster may also host business workloads

Technical Perspective

Platform Component Runtime All platform components run as containers within a Kubernetes management cluster (the global cluster).

High Availability Architecture

  • The global cluster typically consists of at least three control plane nodes and multiple worker nodes
  • High availability of etcd is central to cluster HA; see Key Component High Availability Mechanisms for details
  • Load balancing can be provided by an external load balancer or a self-built VIP inside the cluster

Request Routing

  • Client requests first pass through the load balancer or self-built VIP
  • Requests are forwarded to ALB (the platform's default Kubernetes Ingress Gateway) running on designated ingress nodes (or control-plane nodes if configured)
  • ALB routes traffic to the target component pods according to configured rules

Replica Strategy

  • Core components run with at least two replicas
  • Key components (such as registry, MinIO, ALB) run with three replicas

Fault Tolerance & Self-healing

  • Achieved through cooperation between kubelet, kube-controller-manager, kube-scheduler, kube-proxy, ALB, and other components
  • Includes health checks, failover, and traffic redirection

Data Storage & Recovery

  • Control-plane configuration and platform state are stored in etcd as Kubernetes resources
  • In catastrophic failures, recovery can be performed from etcd snapshots

Primary / Standby Disaster Recovery

  • Two separate global clusters: Primary Cluster and Standby Cluster
  • The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster.
  • If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.

Key Component High Availability Mechanisms

etcd

  • Deployed on three (or five) control plane nodes
  • Uses the RAFT protocol for leader election and data replication
  • Three-node deployments tolerate up to one node failure; five-node deployments tolerate up to two
  • Supports local and remote S3 snapshot backups

Monitoring Components

  • Prometheus: Multiple instances, deduplication with Thanos Query, and cross-region redundancy
  • VictoriaMetrics: Cluster mode with distributed VMStorage, VMInsert, and VMSelect components

Logging Components

  • Nevermore collects logs and audit data
  • Kafka / Elasticsearch / Razor / Lanaya are deployed in distributed and multi-replica modes

Networking Components (CNI)

  • Kube-OVN / Calico / Flannel: Achieve HA via stateless DaemonSets or triple-replica control plane components

ALB

  • Operator deployed with three replicas, leader election enabled
  • Instance-level health checks and load balancing

Self-built VIP

  • High-availability virtual IP based on Keepalived
  • Supports heartbeat detection and active-standby failover

Harbor

  • ALB-based load balancing
  • PostgreSQL with Patroni HA
  • Redis Sentinel mode
  • Stateless services deployed in multiple replicas

Registry and MinIO

  • Registry deployed with three replicas
  • MinIO in distributed mode with erasure coding, data redundancy, and automatic recovery