global Cluster Disaster Recovery
TOC
Overview
This solution is designed for disaster recovery scenarios involving the global
cluster. The global
cluster serves as the control plane of the platform and is responsible for managing other clusters. To ensure continuous platform service availability when the global
cluster fails, this solution deploys two global
clusters: a Primary Cluster and a Standby Cluster.
The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster. If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.
Supported Disaster Scenarios
- Irrecoverable system-level failure of the Primary Cluster rendering it inoperable;
- Failure of physical or virtual machines hosting the Primary Cluster, making it inaccessible;
- Network failure at the Primary Cluster location resulting in service interruption;
Unsupported Disaster Scenarios
- Failures of applications deployed within the
global
cluster;
- Data loss caused by storage system failures (outside the scope of etcd synchronization);
The roles of Primary Cluster and Standby Cluster are relative: the cluster currently serving the platform is the Primary Cluster (DNS points to it), while the standby cluster is the Standby Cluster. After a failover, these roles are swapped.
Notes
-
This solution only synchronizes etcd data of the global
cluster; it does not include data from registry, chartmuseum, or other components;
-
Disaster recovery of application data within the cluster is not supported;
-
Stable network connectivity is required between the two clusters to ensure reliable etcd synchronization;
-
If the clusters are based on heterogeneous architectures (e.g., x86 and ARM), use a dual-architecture installation package;
-
The following namespaces are excluded from etcd synchronization. If resources are created in these namespaces, users must back them up manually:
cpaas-system
cert-manager
default
global-credentials
cpaas-system-global-credentials
kube-ovn
kube-public
kube-system
nsx-system
cpaas-solution
kube-node-lease
kubevirt
nativestor-system
operators
-
If both clusters use built-in image registries, packages must be uploaded separately to each;
-
If the Primary Cluster deploys DevOps Eventing v3 (knative-operator) and instances thereof, the same components must be pre-deployed in the standby cluster.
Process Overview
- Prepare a unified domain name for platform access;
- Point the domain to the Primary Cluster’s VIP and install the Primary Cluster;
- Temporarily switch DNS resolution to the standby VIP to install the Standby Cluster;
- Install and enable the etcd synchronization plugin;
- Verify sync status and perform regular checks;
- In case of failure, switch DNS to the standby cluster to complete disaster recovery.
Required Resources
-
One domain name used as the platform’s access entry point;
-
Two VIPs, one for each cluster (Primary and Standby);
- Both VIPs must forward TCP ports
80
, 443
, 6443
, 2379
, and 11443
to the three Control Plane nodes of the cluster.
Process
Step 1: Install the Primary Cluster
Refer to the following documentation to complete installation:
Configuration notes:
- Add the cluster’s VIP in
Other Platform Access Addresses
;
- The Platform Access Address must use the unified domain.
Step 2: Install the Standby Cluster
- Temporarily point the domain name to the standby cluster’s VIP;
- Log into the first master node of the Primary Cluster and copy the etcd encryption config to all standby cluster master nodes:
for i in 1.1.1.1 2.2.2.2 3.3.3.3 # Replace with standby cluster control plane node IPs
do
ssh $i "mkdir -p /etc/kubernetes/"
scp /etc/kubernetes/encryption-provider.conf $i:/etc/kubernetes/encryption-provider.conf
done
- Install the standby cluster in the same way as the primary, referring to:
Configuration notes:
- Use the same domain and certificates as the primary;
- Use the same image registry and credentials;
- Add the standby cluster’s VIP in
Other Platform Access Addresses
.
Step 3: Enable etcd Synchronization
-
Ensure port 2379
is properly forwarded to the control plane nodes of both clusters;
-
Access the standby global cluster Web Console using its VIP, and switch to Administrator view;
-
Navigate to Marketplace > Cluster Plugins, select the global
cluster;
-
Find etcd Synchronizer, click Install, configure parameters:
- Use the default sync interval;
- Leave log switch disabled unless troubleshooting.
Verify the sync Pod is running on the standby cluster:
kubectl get po -n cpaas-system -l app=etcd-sync
kubectl logs -n cpaas-system $(kubectl get po -n cpaas-system -l app=etcd-sync --no-headers | head -1) | grep -i "Start Sync update"
Once “Start Sync update” appears, recreat one of the pods to re-trigger sync of resources with ownerReference dependencies:
kubectl delete po -n cpaas-system $(kubectl get po -n cpaas-system -l app=etcd-sync --no-headers | head -1)
Check sync status:
mirror_svc=$(kubectl get svc -n cpaas-system etcd-sync-monitor -o jsonpath='{.spec.clusterIP}')
ipv6_regex="^[0-9a-fA-F:]+$"
if [[ $mirror_svc =~ $ipv6_regex ]]; then
export mirror_new_svc="[$mirror_svc]"
else
export mirror_new_svc=$mirror_svc
fi
curl $mirror_new_svc/check
Output explanation:
LOCAL ETCD missed keys
: Keys exist in the Primary but are missing from the standby. Often caused by GC due to resource order during sync. Restart one etcd-sync Pod to fix;
LOCAL ETCD surplus keys
: Extra keys exist only in the standby cluster. Confirm with ops team before deleting these keys from the standby.
If the following components are installed, restart their services:
Disaster Recovery Process
- If Log Storage for Elasticsearch is installed, run the following on the standby cluster:
# Copy installer/res/packaged-scripts/for-upgrade/ensure-asm-template.sh to /root and execute:
bash /root/ensure-asm-template.sh
# If it returns 401, restart Elasticsearch and retry
- Verify data consistency in the standby cluster (same check as in Step 3);
- Uninstall the etcd synchronization plugin;
- Remove port forwarding for
2379
from both VIPs;
- Switch the platform domain DNS to the standby VIP, which now becomes the Primary Cluster;
- Verify DNS resolution:
kubectl exec -it -n cpaas-system deployments/sentry -- nslookup <platform access domain>
# If not resolved correctly, restart coredns Pods and retry until success
- Clear browser cache and access the platform page to confirm it reflects the former standby cluster;
- Restart the following services (if installed):
- If workload clusters send monitoring data to the Primary, restart warlock in the workload cluster:
kubectl delete po -n cpaas-system -l service_name=warlock
- On the original Primary Cluster, repeat the Enable etcd Synchronization steps to convert it into the new standby cluster.
Routine Checks
Regularly check sync status on the standby cluster:
curl $(kubectl get svc -n cpaas-system etcd-sync-monitor -o jsonpath='{.spec.clusterIP}')/check
If any keys are missing or surplus, follow the instructions in the output to resolve them.