This solution is designed for disaster recovery scenarios involving the global
cluster. The global
cluster serves as the control plane of the platform and is responsible for managing other clusters. To ensure continuous platform service availability when the global
cluster fails, this solution deploys two global
clusters: a Primary Cluster and a Standby Cluster.
The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster. If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.
global
cluster;The roles of Primary Cluster and Standby Cluster are relative: the cluster currently serving the platform is the Primary Cluster (DNS points to it), while the standby cluster is the Standby Cluster. After a failover, these roles are swapped.
This solution only synchronizes etcd data of the global
cluster; it does not include data from registry, chartmuseum, or other components;
In favor of facilitating troubleshooting and management, it is recommended to name nodes in a style like standby-global-m1
, to indicates which cluster the node belongs to (Primary or Standby).
Disaster recovery of application data within the cluster is not supported;
Stable network connectivity is required between the two clusters to ensure reliable etcd synchronization;
If the clusters are based on heterogeneous architectures (e.g., x86 and ARM), use a dual-architecture installation package;
The following namespaces are excluded from etcd synchronization. If resources are created in these namespaces, users must back them up manually:
If both clusters are set to use built-in image registries, container images must be uploaded separately to each;
If the Primary Cluster deploys DevOps Eventing v3 (knative-operator) and instances thereof, the same components must be pre-deployed in the standby cluster.
A unified domain which will be the Platform Access Address
, and the TLS certificate plus private key for serving HTTPS on that domain;
A dedicated virtual IP address for each cluster — one for the Primary Cluster and another for the Standby Cluster;
80
, 443
, 6443
, 2379
, and 11443
to the control-plane nodes behind the corresponding VIP.While installing the primary cluster of the DR Environment,
Self-built VIP
option is NOT available.Platform Access Address
field MUST be a domain, while the Cluster Endpoint
MUST be the virtual IP address.An Existing Certificate
(has be the same one), request a legit certificate if necessary. The Self-signed Certificate
option is NOT available.Image Repository
is set to Platform Deployment
, both Username
and Password
fields MUST NOT be empty; The IP/Domain
field MUST be set to the domain used as the Platform Access Address
.HTTP Port
and HTTPS Port
fields of Platform Access Address
MUST be 80 and 443.Advanced
), the Other Platform Access Addresses
field MUST include the virtual IP of current Cluster.Refer to the following documentation to complete installation:
Temporarily point the domain name to the standby cluster's VIP;
Log into the first control plane node of the Primary Cluster and copy the etcd encryption config to all standby cluster control plane nodes:
Install the standby cluster in the same way as the primary cluster
While installing the standby cluster of the DR Environment, the following options MUST be set to the same as the primary cluster:
Platform Access Address
field.Certificate
.Image Repository
and MAKE SURE you followed the NOTES OF DR (Disaster Recovery Environment) INSTALLING
in Step 1.
Refer to the following documentation to complete installation:
When applicable, configure the load balancer to forward port 2379
to control plane nodes of the corresponding cluster. ONLY TCP mode is supported; forwarding on L7 is not supported.
Port forwarding through a load balancer is not required. If direct access from the standby cluster to the active global cluster is available, specify the etcd addresses via Active Global Cluster ETCD Endpoints.
Access the standby global cluster Web Console using its VIP, and switch to Administrator view;
Navigate to Marketplace > Cluster Plugins, select the global
cluster;
Find etcd Synchronizer, click Install, configure parameters:
2379
through load balancer, its required to configure Active Global Cluster ETCD Endpoints correctly;Verify the sync Pod is running on the standby cluster:
Once “Start Sync update” appears, recreate one of the pods to re-trigger sync of resources with ownerReference dependencies:
Check sync status:
Output explanation:
LOCAL ETCD missed keys
: Keys exist in the Primary but are missing from the standby. Often caused by GC due to resource order during sync. Restart one etcd-sync Pod to fix;LOCAL ETCD surplus keys
: Extra keys exist only in the standby cluster. Confirm with ops team before deleting these keys from the standby.If the following components are installed, restart their services:
Log Storage for Elasticsearch:
Monitoring for VictoriaMetrics:
Restart Elasticsearch on the standby cluster in case it is necessary:
Verify data consistency in the standby cluster (same check as in Step 3);
Uninstall the etcd synchronization plugin;
Remove port forwarding for 2379
from both VIPs;
Switch the platform domain DNS to the standby VIP, which now becomes the Primary Cluster;
Verify DNS resolution:
Clear browser cache and access the platform page to confirm it reflects the former standby cluster;
Restart the following services (if installed):
Log Storage for Elasticsearch:
Monitoring for VictoriaMetrics:
cluster-transformer:
If workload clusters send monitoring data to the Primary, restart warlock in the workload cluster:
On the original Primary Cluster, repeat the Enable etcd Synchronization steps to convert it into the new standby cluster.
Regularly check sync status on the standby cluster:
If any keys are missing or surplus, follow the instructions in the output to resolve them.
When using violet
to upload packages to a standby cluster, the parameter --dest-repo <VIP addr of standby cluster>
must be specified.
Otherwise, the packages will be uploaded to the image repository of the primary cluster, preventing the standby cluster from installing or upgrading extensions.
Also be awared that either authentication info of the standby cluster's image registry or --no-auth
parameter MUST be provided.
For details of the violet push
subcommand, please refer to Upload Packages.