Global Cluster Disaster Recovery
TOC
Overview
This solution is designed for disaster recovery scenarios involving the global cluster. The global cluster serves as the control plane of the platform and is responsible for managing other clusters. To ensure continuous platform service availability when the global cluster fails, this solution deploys two global clusters: a Primary Cluster and a Standby Cluster.
The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster. If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.
Supported Disaster Scenarios
- Irrecoverable system-level failure of the Primary Cluster rendering it inoperable;
- Failure of physical or virtual machines hosting the Primary Cluster, making it inaccessible;
- Network failure at the Primary Cluster location resulting in service interruption;
Unsupported Disaster Scenarios
- Failures of applications deployed within the
globalcluster; - Data loss caused by storage system failures (outside the scope of etcd synchronization);
The roles of Primary Cluster and Standby Cluster are relative: the cluster currently serving the platform is the Primary Cluster (DNS points to it), while the standby cluster is the Standby Cluster. After a failover, these roles are swapped.
Notes
-
This solution only synchronizes etcd data of the
globalcluster; it does not include data from registry, chartmuseum, or other components; -
In favor of facilitating troubleshooting and management, it is recommended to name nodes in a style like
standby-global-m1, to indicates which cluster the node belongs to (Primary or Standby). -
Disaster recovery of application data within the cluster is not supported;
-
Stable network connectivity is required between the two clusters to ensure reliable etcd synchronization;
-
If the clusters are based on heterogeneous architectures (e.g., x86 and ARM), use a dual-architecture installation package;
-
The following namespaces are excluded from etcd synchronization. If resources are created in these namespaces, users must back them up manually:
-
If both clusters are set to use built-in image registries, container images must be uploaded separately to each;
-
If the Primary Cluster deploys DevOps Eventing v3 (knative-operator) and instances thereof, the same components must be pre-deployed in the standby cluster.
Process Overview
- Prepare a unified domain name for platform access;
- Point the domain to the Primary Cluster's VIP and install the Primary Cluster;
- Temporarily switch DNS resolution to the standby VIP to install the Standby Cluster;
- Copy the ETCD encryption key of the Primary Cluster to the nodes that will later be the control plane nodes of Standby Cluster;
- Install and enable the etcd synchronization plugin;
- Verify sync status and perform regular checks;
- In case of failure, switch DNS to the standby cluster to complete disaster recovery.
Required Resources
-
A unified domain which will be the
Platform Access Address, and the TLS certificate plus private key for serving HTTPS on that domain; -
A dedicated virtual IP address for each cluster — one for the Primary Cluster and another for the Standby Cluster;
- Preconfigure the load balancer to route TCP traffic on ports
80,443,6443,2379, and11443to the control-plane nodes behind the corresponding VIP.
- Preconfigure the load balancer to route TCP traffic on ports
Procedure
Step 1: Install the Primary Cluster
While installing the primary cluster of the DR Environment,
- First of all, documenting all of the parameters set while following the guide of the installation web UI. It is necessary to keep some options the same while installing the standby cluster.
- A User-provisioned Load Balancer MUST be preconfigured to route traffic sent to the virtual IP. The
Self-built VIPoption is NOT available. - The
Platform Access Addressfield MUST be a domain, while theCluster EndpointMUST be the virtual IP address. - Both clusters MUST be configured to use
An Existing Certificate(has be the same one), request a legit certificate if necessary. TheSelf-signed Certificateoption is NOT available. - When
Image Repositoryis set toPlatform Deployment, bothUsernameandPasswordfields MUST NOT be empty; TheIP/Domainfield MUST be set to the domain used as thePlatform Access Address. - Both
HTTP PortandHTTPS Portfields ofPlatform Access AddressMUST be 80 and 443. - When coming to the second page the of the installation guide (Step:
Advanced), theOther Platform Access Addressesfield MUST include the virtual IP of current Cluster.
Refer to the following documentation to complete installation:
Step 2: Install the Standby Cluster
-
Temporarily point the domain name to the standby cluster's VIP;
-
Log into the first control plane node of the Primary Cluster and copy the etcd encryption config to all standby cluster control plane nodes:
-
Install the standby cluster in the same way as the primary cluster
While installing the standby cluster of the DR Environment, the following options MUST be set to the same as the primary cluster:
- The
Platform Access Addressfield. - All fields of
Certificate. - All fields of
Image Repository - Important: ensure the credentials of image repository and the admin user match those set on the Primary Cluster.
and MAKE SURE you followed the NOTES OF DR (Disaster Recovery Environment) INSTALLING in Step 1.
Refer to the following documentation to complete installation:
Step 3: Enable etcd Synchronization
-
When applicable, configure the load balancer to forward port
2379to control plane nodes of the corresponding cluster. ONLY TCP mode is supported; forwarding on L7 is not supported.INFOPort forwarding through a load balancer is not required. If direct access from the standby cluster to the active global cluster is available, specify the etcd addresses via Active Global Cluster ETCD Endpoints.
-
Access the standby global cluster Web Console using its VIP, and switch to Administrator view;
-
Navigate to Marketplace > Cluster Plugins, select the
globalcluster; -
Find etcd Synchronizer, click Install, configure parameters:
- When not forwarding port
2379through load balancer, its required to configure Active Global Cluster ETCD Endpoints correctly; - Use the default value of Data Check Interval;
- Leave Print detail logs switch disabled unless troubleshooting.
- When not forwarding port
Verify the sync Pod is running on the standby cluster:
Once “Start Sync update” appears, recreate one of the pods to re-trigger sync of resources with ownerReference dependencies:
Check sync status:
Output explanation:
LOCAL ETCD missed keys: Keys exist in the Primary but are missing from the standby. Often caused by GC due to resource order during sync. Restart one etcd-sync Pod to fix;LOCAL ETCD surplus keys: Extra keys exist only in the standby cluster. Confirm with ops team before deleting these keys from the standby.
If the following components are installed, restart their services:
-
Log Storage for Elasticsearch:
-
Monitoring for VictoriaMetrics:
Disaster Recovery Process
-
Restart Elasticsearch on the standby cluster in case it is necessary:
-
Verify data consistency in the standby cluster (same check as in Step 3);
-
Uninstall the etcd synchronization plugin;
-
Remove port forwarding for
2379from both VIPs; -
Switch the platform domain DNS to the standby VIP, which now becomes the Primary Cluster;
-
Verify DNS resolution:
-
Clear browser cache and access the platform page to confirm it reflects the former standby cluster;
-
Restart the following services (if installed):
-
Log Storage for Elasticsearch:
-
Monitoring for VictoriaMetrics:
-
cluster-transformer:
-
-
If workload clusters send monitoring data to the Primary, restart warlock in the workload cluster:
-
On the original Primary Cluster, repeat the Enable etcd Synchronization steps to convert it into the new standby cluster.
Routine Checks
Regularly check sync status on the standby cluster:
If any keys are missing or surplus, follow the instructions in the output to resolve them.
Uploading Packages
When using violet to upload packages to a standby cluster, the parameter --dest-repo <VIP addr of standby cluster> must be specified.
Otherwise, the packages will be uploaded to the image repository of the primary cluster, preventing the standby cluster from installing or upgrading extensions.
Also be awared that either authentication info of the standby cluster's image registry or --no-auth parameter MUST be provided.
For details of the violet push subcommand, please refer to Upload Packages.