Global Cluster Disaster Recovery

TOC

Overview

This solution is designed for disaster recovery scenarios involving the global cluster. The global cluster serves as the control plane of the platform and is responsible for managing other clusters. To ensure continuous platform service availability when the global cluster fails, this solution deploys two global clusters: a Primary Cluster and a Standby Cluster.

The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster. If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.

Supported Disaster Scenarios

  • Irrecoverable system-level failure of the Primary Cluster rendering it inoperable;
  • Failure of physical or virtual machines hosting the Primary Cluster, making it inaccessible;
  • Network failure at the Primary Cluster location resulting in service interruption;

Unsupported Disaster Scenarios

  • Failures of applications deployed within the global cluster;
  • Data loss caused by storage system failures (outside the scope of etcd synchronization);

The roles of Primary Cluster and Standby Cluster are relative: the cluster currently serving the platform is the Primary Cluster (DNS points to it), while the standby cluster is the Standby Cluster. After a failover, these roles are swapped.

Notes

  • This solution only synchronizes etcd data of the global cluster; it does not include data from registry, chartmuseum, or other components;

  • In favor of facilitating troubleshooting and management, it is recommended to name nodes in a style like standby-global-m1, to indicates which cluster the node belongs to (Primary or Standby).

  • Disaster recovery of application data within the cluster is not supported;

  • Stable network connectivity is required between the two clusters to ensure reliable etcd synchronization;

  • If the clusters are based on heterogeneous architectures (e.g., x86 and ARM), use a dual-architecture installation package;

  • The following namespaces are excluded from etcd synchronization. If resources are created in these namespaces, users must back them up manually:

    cpaas-system
    cert-manager
    default
    global-credentials
    cpaas-system-global-credentials
    kube-ovn
    kube-public
    kube-system
    nsx-system
    cpaas-solution
    kube-node-lease
    kubevirt
    nativestor-system
    operators
  • If both clusters are set to use built-in image registries, container images must be uploaded separately to each;

  • If the Primary Cluster deploys DevOps Eventing v3 (knative-operator) and instances thereof, the same components must be pre-deployed in the standby cluster.

Process Overview

  1. Prepare a unified domain name for platform access;
  2. Point the domain to the Primary Cluster's VIP and install the Primary Cluster;
  3. Temporarily switch DNS resolution to the standby VIP to install the Standby Cluster;
  4. Copy the ETCD encryption key of the Primary Cluster to the nodes that will later be the control plane nodes of Standby Cluster;
  5. Install and enable the etcd synchronization plugin;
  6. Verify sync status and perform regular checks;
  7. In case of failure, switch DNS to the standby cluster to complete disaster recovery.

Required Resources

  • A unified domain which will be the Platform Access Address, and the TLS certificate plus private key for serving HTTPS on that domain;

  • A dedicated virtual IP address for each cluster — one for the Primary Cluster and another for the Standby Cluster;

    • Preconfigure the load balancer to route TCP traffic on ports 80, 443, 6443, 2379, and 11443 to the control-plane nodes behind the corresponding VIP.

Process

Step 1: Install the Primary Cluster

NOTES OF DR (Disaster Recovery Environment) INSTALLING

While installing the primary cluster of the DR Environment,

  • First of all, documenting all of the parameters set while following the guide of the installation web UI. It is necessary to keep some options the same while installing the standby cluster.
  • A User-provisioned Load Balancer MUST be preconfigured to route traffic sent to the virtual IP. The Self-built VIP option is NOT available.
  • The Platform Access Address field MUST be a domain, while the Cluster Endpoint MUST be the virtual IP address.
  • Both clusters MUST be configured to use An Existing Certificate (has be the same one), request a legit certificate if necessary. The Self-signed Certificate option is NOT available.
  • When Image Repository is set to Platform Deployment, both Username and Password fields MUST NOT be empty; The IP/Domain field MUST be set to the domain used as the Platform Access Address.
  • Both HTTP Port and HTTPS Port fields of Platform Access Address MUST be 80 and 443.
  • When coming to the second page the of the installation guide (Step: Advanced), the Other Platform Access Addresses field MUST include the virtual IP of current Cluster.

Refer to the following documentation to complete installation:

Step 2: Install the Standby Cluster

  1. Temporarily point the domain name to the standby cluster's VIP;

  2. Log into the first control plane node of the Primary Cluster and copy the etcd encryption config to all standby cluster control plane nodes:

    # Assume the primary cluster control plane nodes are 1.1.1.1, 2.2.2.2 & 3.3.3.3
    # and the standby cluster control plane nodes are 4.4.4.4, 5.5.5.5 & 6.6.6.6
    for i in 4.4.4.4 5.5.5.5 6.6.6.6  # Replace with standby cluster control plane node IPs
    do
      ssh "<user>@$i" "sudo mkdir -p /etc/kubernetes/"
      scp /etc/kubernetes/encryption-provider.conf "<user>@$i:/tmp/encryption-provider.conf"
      ssh "<user>@$i" "sudo install -o root -g root -m 600 /tmp/encryption-provider.conf /etc/kubernetes/encryption-provider.conf && rm -f /tmp/encryption-provider.conf"
    done
  3. Install the standby cluster in the same way as the primary cluster

NOTES FOR INSTALLING STANDBY CLUSTER

While installing the standby cluster of the DR Environment, the following options MUST be set to the same as the primary cluster:

  • The Platform Access Address field.
  • All fields of Certificate.
  • All fields of Image Repository
  • Important: ensure the credentials of image repository and the admin user match those set on the Primary Cluster.

and MAKE SURE you followed the NOTES OF DR (Disaster Recovery Environment) INSTALLING in Step 1.

Refer to the following documentation to complete installation:

Step 3: Enable etcd Synchronization

  1. Configure the load balancer to forward port 2379 to control plane nodes of the corresponding cluster. ONLY TCP mode is supported; forwarding on L7 is not supported.

  2. Access the standby global cluster Web Console using its VIP, and switch to Administrator view;

  3. Navigate to Marketplace > Cluster Plugins, select the global cluster;

  4. Find etcd Synchronizer, click Install, configure parameters:

    • Use the default sync interval;
    • Leave log switch disabled unless troubleshooting.

Verify the sync Pod is running on the standby cluster:

kubectl get po -n cpaas-system -l app=etcd-sync
kubectl logs -n cpaas-system $(kubectl get po -n cpaas-system -l app=etcd-sync --no-headers | head -1) | grep -i "Start Sync update"

Once “Start Sync update” appears, recreate one of the pods to re-trigger sync of resources with ownerReference dependencies:

kubectl delete po -n cpaas-system $(kubectl get po -n cpaas-system -l app=etcd-sync --no-headers | head -1)

Check sync status:

mirror_svc=$(kubectl get svc -n cpaas-system etcd-sync-monitor -o jsonpath='{.spec.clusterIP}')
ipv6_regex="^[0-9a-fA-F:]+$"
if [[ $mirror_svc =~ $ipv6_regex ]]; then
  export mirror_new_svc="[$mirror_svc]"
else
  export mirror_new_svc=$mirror_svc
fi
curl $mirror_new_svc/check

Output explanation:

  • LOCAL ETCD missed keys: Keys exist in the Primary but are missing from the standby. Often caused by GC due to resource order during sync. Restart one etcd-sync Pod to fix;
  • LOCAL ETCD surplus keys: Extra keys exist only in the standby cluster. Confirm with ops team before deleting these keys from the standby.

If the following components are installed, restart their services:

  • Log Storage for Elasticsearch:

    kubectl delete po -n cpaas-system -l service_name=cpaas-elasticsearch
  • Monitoring for VictoriaMetrics:

    kubectl delete po -n cpaas-system -l 'service_name in (alertmanager,vmselect,vminsert)'

Disaster Recovery Process

  1. Restart Elasticsearch on the standby cluster in case it is necessary:

    # Copy installer/res/packaged-scripts/for-upgrade/ensure-asm-template.sh to /root:
    # DO NOT skip this step
    
    # switch to the root user if necessary
    sudo -i
    
    # check whether the Log Storage for Elasticsearch is installed on global cluster
    _es_pods=$(kubectl get po -n cpaas-system | grep cpaas-elasticsearch | awk '{print $1}')
    if [[ -n "${_es_pods}" ]]; then
        # In case the script returned the 401 error, restart Elasticsearch
        # then execute the script to check the cluster again
        bash /root/ensure-asm-template.sh
    
        # Restart Elasticsearch
        xargs -r -t -- kubectl delete po -n cpaas-system <<< "${_es_pods}"
    fi
  2. Verify data consistency in the standby cluster (same check as in Step 3);

  3. Uninstall the etcd synchronization plugin;

  4. Remove port forwarding for 2379 from both VIPs;

  5. Switch the platform domain DNS to the standby VIP, which now becomes the Primary Cluster;

  6. Verify DNS resolution:

    kubectl exec -it -n cpaas-system deployments/sentry -- nslookup <platform access domain>
    # If not resolved correctly, restart coredns Pods and retry until success
  7. Clear browser cache and access the platform page to confirm it reflects the former standby cluster;

  8. Restart the following services (if installed):

    • Log Storage for Elasticsearch:

      kubectl delete po -n cpaas-system -l service_name=cpaas-elasticsearch
    • Monitoring for VictoriaMetrics:

      kubectl delete po -n cpaas-system -l 'service_name in (alertmanager,vmselect,vminsert)'
    • cluster-transformer:

      kubectl delete po -n cpaas-system -l service_name=cluster-transformer
  9. If workload clusters send monitoring data to the Primary, restart warlock in the workload cluster:

    kubectl delete po -n cpaas-system -l service_name=warlock
  10. On the original Primary Cluster, repeat the Enable etcd Synchronization steps to convert it into the new standby cluster.

Routine Checks

Regularly check sync status on the standby cluster:

curl $(kubectl get svc -n cpaas-system etcd-sync-monitor -o jsonpath='{.spec.clusterIP}')/check

If any keys are missing or surplus, follow the instructions in the output to resolve them.

Uploading Packages

When using violet to upload packages to a standby cluster, you must specify the --dest-repo parameter with the VIP of the standby cluster. If this parameter is omitted, the package will be uploaded to the image repository of the primary cluster, preventing the standby cluster from installing or upgrading the corresponding extension.

FAQ

  • Here are the instructions in case that the ETCD encryption key of the standby cluster has not synced with the one of primary cluster before Installing the standby cluster.
  1. Get the ETCD encryption key on any of the standby cluster's control plane nodes:

    ssh <user>@<STANDBY cluster control plane node> sudo cat /etc/kubernetes/encryption-provider.conf

It should look like:

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
    - secrets
    providers:
    - aescbc:
        keys:
        - name: key1
          secret: MTE0NTE0MTkxOTgxMA==
  1. Merge that ETCD encryption key into the primary cluster's /etc/kubernetes/encryption-provider.conf file, ensuring the key names are unique. For example, if the primary cluster's key is key1, rename the standby's key to key2:

    apiVersion: apiserver.config.k8s.io/v1
    kind: EncryptionConfiguration
    resources:
      - resources:
        - secrets
        providers:
        - aescbc:
            keys:
            - name: key1
              secret: My4xNDE1OTI2NTM1ODk3
            - name: key2
              secret: MTE0NTE0MTkxOTgxMA==
  2. Make sure the new /etc/kubernetes/encryption-provider.conf file overwrites EVERY replicas on the control plane nodes of both clusters:

    # Let the control plane nodes of the primary cluster are 1.1.1.1, 2.2.2.2 & 3.3.3.3
    # the control plane nodes of the standby cluster are 4.4.4.4, 5.5.5.5 & 6.6.6.6
    
    # assume the 1.1.1.1 has already been configured to use both of the ETCD encryption keys,
    # login into the node 1.1.1.1, and issue the following commands:
    for i in \
        2.2.2.2 3.3.3.3 \
        4.4.4.4 5.5.5.5 6.6.6.6 \
    ; do
        scp /etc/kubernetes/encryption-provider.conf "<user>@${i}:/tmp/encryption-provider.conf"
        ssh "<user>@${i}" '
    #!/bin/bash
    set -euo pipefail
    
    sudo install -o root -g root -m 600 /tmp/encryption-provider.conf /etc/kubernetes/encryption-provider.conf && rm -f /tmp/encryption-provider.conf
    _pod_name="kube-apiserver"
    _pod_id=$(sudo crictl ps --name "${_pod_name}" --no-trunc --quiet)
    if [[ -z "${_pod_id}" ]]; then
        echo "FATAL: could not find pod `kube-apiserver` on node $(hostname)"
        exit 1
    fi
    sudo crictl rm --force "${_pod_id}"
    sudo systemctl restart kubelet.service
    '
    done
  3. Restart the kube-apiserver on node 1.1.1.1

    _pod_name="kube-apiserver"
    _pod_id=$(sudo crictl ps --name "${_pod_name}" --no-trunc --quiet)
    if [[ -z "${_pod_id}" ]]; then
        echo "FATAL: could not find pod `kube-apiserver` on node $(hostname)"
        exit 1
    fi
    sudo crictl rm --force "${_pod_id}"
    sudo systemctl restart kubelet.service