Architecture

TOC

Introduction to

The () provides an enterprise-grade Kubernetes-based platform that enables organizations to build, deploy, and manage applications consistently across hybrid and multi-cloud environments. integrates core Kubernetes capabilities with enhanced management, observability, and security services, offering a unified control plane and flexible workload clusters.

The architecture follows a hub-and-spoke model, consisting of a global cluster and multiple workload clusters. This design provides centralized governance while allowing independent workload execution and scalability.

Core Architectural Components

Global Cluster

The global cluster serves as the centralized management and control hub of . It provides platform-wide services such as authentication, policy management, cluster lifecycle operations, and observability. It's also a central hub for multi-cluster management and provides cross-cluster functionality.

Key components include:

  • Gateway Acts as the main entry point to the platform. It manages API requests from the UI, CLI (kubectl), and automation tools, routing them to appropriate backend services.
  • Authentication and Authorization (Auth) Integrates with external Identity Providers (IdPs) to provide Single Sign-On (SSO) and RBAC-based access control.
  • Web Console Provides a web-based interface for . It interfaces with platform APIs through the gateway.
  • Cluster Management Handles the registration, provisioning, and lifecycle management of workload clusters.
  • Services
  • Operator Lifecycle Manager (OLM) and Cluster Plugins Manages the installation, updates, and lifecycle of operators and cluster extensions.
  • Internal Image Registry Offers an out-of-box integrated container image repository with role-based access.
  • Observability Provides centralized logging, metrics, and tracing for both the global and workload clusters.
  • Cluster Proxy Enables secure communication between the global cluster and workload clusters.

Workload Cluster

Workload clusters are Kubernetes-based environments managed by the global cluster. Each workload cluster runs isolated application workloads and inherits governance and configuration from the central control plane.

External Integrations

  • Identity Provider (IdP) Supports federated authentication via standard protocols (OIDC, SAML) for unified user management.
  • API and CLI Access Users can interact with through RESTful APIs, the web console, or command-line tools like kubectl and ac.
  • Load Balancer (VIP/DNS/SLB) Provides high availability and traffic distribution to the Gateway and ingress endpoints of the global and workload Clusters.

Scalability and High Availability

is designed for horizontal scalability and high availability:

  • Each component can be deployed redundantly to eliminate single points of failure.
  • The global cluster supports managing dozens to hundreds of workload clusters.
  • Workload clusters can scale independently according to workload demand.
  • The use of VIP/DNS/Ingress ensures seamless routing and failover.

Functional Perspective

()'s complete functionality consists of Core and extensions based on two technical stacks: Operator and Cluster Plugin.

  • Core

    The minimal deliverable unit of , providing core capabilities such as cluster management, container orchestration, projects, and user administration.

    • Meets the highest security standards
    • Delivers maximum stability
    • Offers the longest support lifecycle
  • Extensions

    Extensions in both the Operator and Cluster Plugin stacks can be classified into:

    • Aligned – Life cycle strategy consisting of multiple maintenance streams, with alignment to .
    • Agnostic – Life cycle strategy consisting of multiple maintenance streams, released independently from .

    For more details about extensions, see Extend.

Technical Perspective

Platform Component Runtime All platform components run as containers within a Kubernetes management cluster (the global cluster).

High Availability Architecture

  • The global cluster typically consists of at least three control plane nodes and multiple worker nodes
  • High availability of etcd is central to cluster HA; see Key Component High Availability Mechanisms for details
  • Load balancing can be provided by an external load balancer or a self-built VIP inside the cluster

Request Routing

  • Client requests first pass through the load balancer or self-built VIP
  • Requests are forwarded to ALB (the platform's default Kubernetes Ingress Gateway) running on designated ingress nodes (or control-plane nodes if configured)
  • ALB routes traffic to the target component pods according to configured rules

Replica Strategy

  • Core components run with at least two replicas
  • Key components (such as registry, MinIO, ALB) run with three replicas

Fault Tolerance & Self-healing

  • Achieved through cooperation between kubelet, kube-controller-manager, kube-scheduler, kube-proxy, ALB, and other components
  • Includes health checks, failover, and traffic redirection

Data Storage & Recovery

  • Control-plane configuration and platform state are stored in etcd as Kubernetes resources
  • In catastrophic failures, recovery can be performed from etcd snapshots

Primary / Standby Disaster Recovery

  • Two separate global clusters: Primary Cluster and Standby Cluster
  • The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster.
  • If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.

Key Component High Availability Mechanisms

etcd

  • Deployed on three (or five) control plane nodes
  • Uses the RAFT protocol for leader election and data replication
  • Three-node deployments tolerate up to one node failure; five-node deployments tolerate up to two
  • Supports local and remote S3 snapshot backups

Monitoring Components

  • Prometheus: Multiple instances, deduplication with Thanos Query, and cross-region redundancy
  • VictoriaMetrics: Cluster mode with distributed VMStorage, VMInsert, and VMSelect components

Logging Components

  • Nevermore collects logs and audit data
  • Kafka / Elasticsearch / Razor / Lanaya are deployed in distributed and multi-replica modes

Networking Components (CNI)

  • Kube-OVN / Calico / Flannel: Achieve HA via stateless DaemonSets or triple-replica control plane components

ALB

  • Operator deployed with three replicas, leader election enabled
  • Instance-level health checks and load balancing

Self-built VIP

  • High-availability virtual IP based on Keepalived
  • Supports heartbeat detection and active-standby failover

Harbor

  • ALB-based load balancing
  • PostgreSQL with Patroni HA
  • Redis Sentinel mode
  • Stateless services deployed in multiple replicas

Registry and MinIO

  • Registry deployed with three replicas
  • MinIO in distributed mode with erasure coding, data redundancy, and automatic recovery