This solution is designed for disaster recovery scenarios involving the global
cluster. The global
cluster serves as the control plane of the platform and is responsible for managing other clusters. To ensure continuous platform service availability when the global
cluster fails, this solution deploys two global
clusters: a Primary Cluster and a Standby Cluster.
The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster. If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.
global
cluster;The roles of Primary Cluster and Standby Cluster are relative: the cluster currently serving the platform is the Primary Cluster (DNS points to it), while the standby cluster is the Standby Cluster. After a failover, these roles are swapped.
This solution only synchronizes etcd data of the global
cluster; it does not include data from registry, chartmuseum, or other components;
In favor of facilitating troubleshooting and management, it is recommended to name nodes in a style like standby-global-m1
, to indicates which cluster the node belongs to (Primary or Standby).
Disaster recovery of application data within the cluster is not supported;
Stable network connectivity is required between the two clusters to ensure reliable etcd synchronization;
If the clusters are based on heterogeneous architectures (e.g., x86 and ARM), use a dual-architecture installation package;
The following namespaces are excluded from etcd synchronization. If resources are created in these namespaces, users must back them up manually:
If both clusters are set to use built-in image registries, container images must be uploaded separately to each;
If the Primary Cluster deploys DevOps Eventing v3 (knative-operator) and instances thereof, the same components must be pre-deployed in the standby cluster.
A unified domain which will be the Platform Access Address
, and the TLS certificate plus private key for serving HTTPS on that domain;
A dedicated virtual IP address for each cluster — one for the Primary Cluster and another for the Standby Cluster;
80
, 443
, 6443
, 2379
, and 11443
to the control-plane nodes behind the corresponding VIP.While installing the primary cluster of the DR Environment,
Self-built VIP
option is NOT available.Platform Access Address
field MUST be a domain, while the Cluster Endpoint
MUST be the virtual IP address.An Existing Certificate
(has be the same one), request a legit certificate if necessary. The Self-signed Certificate
option is NOT available.Image Repository
is set to Platform Deployment
, both Username
and Password
fields MUST NOT be empty; The IP/Domain
field MUST be set to the domain used as the Platform Access Address
.HTTP Port
and HTTPS Port
fields of Platform Access Address
MUST be 80 and 443.Advanced
), the Other Platform Access Addresses
field MUST include the virtual IP of current Cluster.Refer to the following documentation to complete installation:
Temporarily point the domain name to the standby cluster's VIP;
Log into the first control plane node of the Primary Cluster and copy the etcd encryption config to all standby cluster control plane nodes:
Install the standby cluster in the same way as the primary cluster
While installing the standby cluster of the DR Environment, the following options MUST be set to the same as the primary cluster:
Platform Access Address
field.Certificate
.Image Repository
and MAKE SURE you followed the NOTES OF DR (Disaster Recovery Environment) INSTALLING
in Step 1.
Refer to the following documentation to complete installation:
Configure the load balancer to forward port 2379
to control plane nodes of the corresponding cluster. ONLY TCP mode is supported; forwarding on L7 is not supported.
Access the standby global cluster Web Console using its VIP, and switch to Administrator view;
Navigate to Marketplace > Cluster Plugins, select the global
cluster;
Find etcd Synchronizer, click Install, configure parameters:
Verify the sync Pod is running on the standby cluster:
Once “Start Sync update” appears, recreate one of the pods to re-trigger sync of resources with ownerReference dependencies:
Check sync status:
Output explanation:
LOCAL ETCD missed keys
: Keys exist in the Primary but are missing from the standby. Often caused by GC due to resource order during sync. Restart one etcd-sync Pod to fix;LOCAL ETCD surplus keys
: Extra keys exist only in the standby cluster. Confirm with ops team before deleting these keys from the standby.If the following components are installed, restart their services:
Log Storage for Elasticsearch:
Monitoring for VictoriaMetrics:
Restart Elasticsearch on the standby cluster in case it is necessary:
Verify data consistency in the standby cluster (same check as in Step 3);
Uninstall the etcd synchronization plugin;
Remove port forwarding for 2379
from both VIPs;
Switch the platform domain DNS to the standby VIP, which now becomes the Primary Cluster;
Verify DNS resolution:
Clear browser cache and access the platform page to confirm it reflects the former standby cluster;
Restart the following services (if installed):
Log Storage for Elasticsearch:
Monitoring for VictoriaMetrics:
cluster-transformer:
If workload clusters send monitoring data to the Primary, restart warlock in the workload cluster:
On the original Primary Cluster, repeat the Enable etcd Synchronization steps to convert it into the new standby cluster.
Regularly check sync status on the standby cluster:
If any keys are missing or surplus, follow the instructions in the output to resolve them.
When using violet
to upload packages to a standby cluster, you must specify the --dest-repo
parameter with the VIP of the standby cluster.
If this parameter is omitted, the package will be uploaded to the image repository of the primary cluster, preventing the standby cluster from installing or upgrading the corresponding extension.
Get the ETCD encryption key on any of the standby cluster's control plane nodes:
It should look like:
Merge that ETCD encryption key into the primary cluster's /etc/kubernetes/encryption-provider.conf
file, ensuring the key names are unique. For example, if the primary cluster's key is key1
, rename the standby's key to key2
:
Make sure the new /etc/kubernetes/encryption-provider.conf
file overwrites EVERY replicas on the control plane nodes of both clusters:
Restart the kube-apiserver on node 1.1.1.1