Global Cluster Disaster Recovery

Overview

This solution is designed for disaster recovery scenarios involving the global cluster. The global cluster serves as the control plane of the platform and is responsible for managing other clusters. To ensure continuous platform service availability when the global cluster fails, this solution deploys two global clusters: a Primary Cluster and a Standby Cluster.

The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster. If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.

Supported Disaster Scenarios

Irrecoverable system-level failure of the Primary Cluster rendering it inoperable;
Failure of physical or virtual machines hosting the Primary Cluster, making it inaccessible;
Network failure at the Primary Cluster location resulting in service interruption;

Unsupported Disaster Scenarios

Failures of applications deployed within the global cluster;
Data loss caused by storage system failures (outside the scope of etcd synchronization);

The roles of Primary Cluster and Standby Cluster are relative: the cluster currently serving the platform is the Primary Cluster (DNS points to it), while the standby cluster is the Standby Cluster. After a failover, these roles are swapped.

Notes

This solution only synchronizes etcd data of the global cluster; it does not include data from registry, chartmuseum, or other components;
In favor of facilitating troubleshooting and management, it is recommended to name nodes in a style like standby-global-m1, to indicates which cluster the node belongs to (Primary or Standby).
Disaster recovery of application data within the cluster is not supported;
Stable network connectivity is required between the two clusters to ensure reliable etcd synchronization;
If the clusters are based on heterogeneous architectures (e.g., x86 and ARM), use a dual-architecture installation package;

The following namespaces are excluded from etcd synchronization. If resources are created in these namespaces, users must back them up manually:

cpaas-system
cert-manager
default
global-credentials
cpaas-system-global-credentials
kube-ovn
kube-public
kube-system
nsx-system
cpaas-solution
kube-node-lease
kubevirt
nativestor-system
operators

If both clusters are set to use built-in image registries, container images must be uploaded separately to each;
If the Primary Cluster deploys DevOps Eventing v3 (knative-operator) and instances thereof, the same components must be pre-deployed in the standby cluster.

Process Overview

Prepare a unified domain name for platform access;
Point the domain to the Primary Cluster's VIP and install the Primary Cluster;
Temporarily switch DNS resolution to the standby VIP to install the Standby Cluster;
Copy the ETCD encryption key of the Primary Cluster to the nodes that will later be the control plane nodes of Standby Cluster;
Install and enable the etcd synchronization plugin;
Verify sync status and perform regular checks;
In case of failure, switch DNS to the standby cluster to complete disaster recovery.

Required Resources

A unified domain which will be the Platform Access Address, and the TLS certificate plus private key for serving HTTPS on that domain;
A dedicated virtual IP address for each cluster — one for the Primary Cluster and another for the Standby Cluster;
- Preconfigure the load balancer to route TCP traffic on ports 80, 443, 6443, 2379, and 11443 to the control-plane nodes behind the corresponding VIP.

Procedure

Step 1: Install the Primary Cluster

NOTES OF DR (Disaster Recovery Environment) INSTALLING

While installing the primary cluster of the DR Environment,

First of all, documenting all of the parameters set while following the guide of the installation web UI. It is necessary to keep some options the same while installing the standby cluster.
A User-provisioned Load Balancer MUST be preconfigured to route traffic sent to the virtual IP. The Self-built VIP option is NOT available.
The Platform Access Address field MUST be a domain, while the Cluster Endpoint MUST be the virtual IP address.
Both clusters MUST be configured to use An Existing Certificate (has be the same one), request a legit certificate if necessary. The Self-signed Certificate option is NOT available.
When Image Repository is set to Platform Deployment, both Username and Password fields MUST NOT be empty; The IP/Domain field MUST be set to the domain used as the Platform Access Address.
Both HTTP Port and HTTPS Port fields of Platform Access Address MUST be 80 and 443.
When coming to the second page the of the installation guide (Step: Advanced), the Other Platform Access Addresses field MUST include the virtual IP of current Cluster.

Refer to the following documentation to complete installation:

Step 2: Install the Standby Cluster

Temporarily point the domain name to the standby cluster's VIP;

Log into the first control plane node of the Primary Cluster and copy the etcd encryption config to all standby cluster control plane nodes:

# Assume the primary cluster control plane nodes are 1.1.1.1, 2.2.2.2 & 3.3.3.3
# and the standby cluster control plane nodes are 4.4.4.4, 5.5.5.5 & 6.6.6.6
for i in 4.4.4.4 5.5.5.5 6.6.6.6  # Replace with standby cluster control plane node IPs
do
  ssh "<user>@$i" "sudo mkdir -p /etc/kubernetes/"
  scp /etc/kubernetes/encryption-provider.conf "<user>@$i:/tmp/encryption-provider.conf"
  ssh "<user>@$i" "sudo install -o root -g root -m 600 /tmp/encryption-provider.conf /etc/kubernetes/encryption-provider.conf && rm -f /tmp/encryption-provider.conf"
done

Install the standby cluster in the same way as the primary cluster

NOTES FOR INSTALLING STANDBY CLUSTER

While installing the standby cluster of the DR Environment, the following options MUST be set to the same as the primary cluster:

The Platform Access Address field.
All fields of Certificate.
All fields of Image Repository
Important: ensure the credentials of image repository and the admin user match those set on the Primary Cluster.

and MAKE SURE you followed the NOTES OF DR (Disaster Recovery Environment) INSTALLING in Step 1.

Refer to the following documentation to complete installation:

Step 3: Enable etcd Synchronization

When applicable, configure the load balancer to forward port 2379 to control plane nodes of the corresponding cluster. ONLY TCP mode is supported; forwarding on L7 is not supported.

INFO
Port forwarding through a load balancer is not required. If direct access from the standby cluster to the active global cluster is available, specify the etcd addresses via Active Global Cluster ETCD Endpoints.
Access the standby global cluster Web Console using its VIP, and switch to Administrator view;
Navigate to Marketplace > Cluster Plugins, select the global cluster;
Find etcd Synchronizer, click Install, configure parameters:
- When not forwarding port 2379 through load balancer, its required to configure Active Global Cluster ETCD Endpoints correctly;
- Use the default value of Data Check Interval;
- Leave Print detail logs switch disabled unless troubleshooting.

Verify the sync Pod is running on the standby cluster:

kubectl get po -n cpaas-system -l app=etcd-sync
kubectl logs -n cpaas-system $(kubectl get po -n cpaas-system -l app=etcd-sync --no-headers | head -1) | grep -i "Start Sync update"

Once “Start Sync update” appears, recreate one of the pods to re-trigger sync of resources with ownerReference dependencies:

kubectl delete po -n cpaas-system $(kubectl get po -n cpaas-system -l app=etcd-sync --no-headers | head -1)

Check sync status:

mirror_svc=$(kubectl get svc -n cpaas-system etcd-sync-monitor -o jsonpath='{.spec.clusterIP}')
ipv6_regex="^[0-9a-fA-F:]+$"
if [[ $mirror_svc =~ $ipv6_regex ]]; then
  export mirror_new_svc="[$mirror_svc]"
else
  export mirror_new_svc=$mirror_svc
fi
curl $mirror_new_svc/check

Output explanation:

LOCAL ETCD missed keys: Keys exist in the Primary but are missing from the standby. Often caused by GC due to resource order during sync. Restart one etcd-sync Pod to fix;
LOCAL ETCD surplus keys: Extra keys exist only in the standby cluster. Confirm with ops team before deleting these keys from the standby.

If the following components are installed, restart their services:

Log Storage for Elasticsearch:

kubectl delete po -n cpaas-system -l service_name=cpaas-elasticsearch

Monitoring for VictoriaMetrics:

kubectl delete po -n cpaas-system -l 'service_name in (alertmanager,vmselect,vminsert)'

Disaster Recovery Process

Restart Elasticsearch on the standby cluster in case it is necessary:

# Copy installer/res/packaged-scripts/for-upgrade/ensure-asm-template.sh to /root:
# DO NOT skip this step

# switch to the root user if necessary
sudo -i

# check whether the Log Storage for Elasticsearch is installed on global cluster
_es_pods=$(kubectl get po -n cpaas-system | grep cpaas-elasticsearch | awk '{print $1}')
if [[ -n "${_es_pods}" ]]; then
    # In case the script returned the 401 error, restart Elasticsearch
    # then execute the script to check the cluster again
    bash /root/ensure-asm-template.sh

    # Restart Elasticsearch
    xargs -r -t -- kubectl delete po -n cpaas-system <<< "${_es_pods}"
fi

Verify data consistency in the standby cluster (same check as in Step 3);
Uninstall the etcd synchronization plugin;
Remove port forwarding for 2379 from both VIPs;
Switch the platform domain DNS to the standby VIP, which now becomes the Primary Cluster;

Verify DNS resolution:

kubectl exec -it -n cpaas-system deployments/sentry -- nslookup <platform access domain>
# If not resolved correctly, restart coredns Pods and retry until success

Clear browser cache and access the platform page to confirm it reflects the former standby cluster;

Restart the following services (if installed):

Log Storage for Elasticsearch:

kubectl delete po -n cpaas-system -l service_name=cpaas-elasticsearch

Monitoring for VictoriaMetrics:

kubectl delete po -n cpaas-system -l 'service_name in (alertmanager,vmselect,vminsert)'

cluster-transformer:

kubectl delete po -n cpaas-system -l service_name=cluster-transformer

If workload clusters send monitoring data to the Primary, restart warlock in the workload cluster:
```
kubectl delete po -n cpaas-system -l service_name=warlock
```
On the original Primary Cluster, repeat the Enable etcd Synchronization steps to convert it into the new standby cluster.

Routine Checks

Regularly check sync status on the standby cluster:

curl $(kubectl get svc -n cpaas-system etcd-sync-monitor -o jsonpath='{.spec.clusterIP}')/check

If any keys are missing or surplus, follow the instructions in the output to resolve them.

Uploading Packages

WARNING

When using violet to upload packages to a standby cluster, the parameter --dest-repo <VIP addr of standby cluster> must be specified.
Otherwise, the packages will be uploaded to the image repository of the primary cluster, preventing the standby cluster from installing or upgrading extensions.

Also be awared that either authentication info of the standby cluster's image registry or --no-auth parameter MUST be provided.

For details of the violet push subcommand, please refer to Upload Packages.

#Global Cluster Disaster Recovery

#TOC

#Overview

#Supported Disaster Scenarios

#Unsupported Disaster Scenarios

#Notes

#Process Overview

#Required Resources

#Procedure

#Step 1: Install the Primary Cluster

#Step 2: Install the Standby Cluster

#Step 3: Enable etcd Synchronization

#Disaster Recovery Process

#Routine Checks

#Uploading Packages