Migrate Existing Huawei Cloud Stack Clusters to Pool-Managed Persistent Disks

Use this guide when you upgrade an existing Huawei Cloud Stack (HCS) cluster from the older HCSMachineTemplate data-volume layout to the pool-managed persistent-disk model.

In HCS provider v1.0.1 or later, disks that must survive node replacement are declared in HCSMachineConfigPool.spec.configs[].persistentDisks[]. This includes the platform-required /var/cpaas disk.

INFO

Version

Use this procedure when the cluster runs ACP v4.3.1 or later and the target HCS provider version is v1.0.1 or later.

Overview

Older HCS clusters commonly placed /var/cpaas in HCSMachineTemplate.spec.template.spec.dataVolumes[]. That layout creates data volumes with the ECS. During rolling replacement, the old ECS and its template-owned data volumes may be deleted together.

The pool-managed model moves upgrade-preserved disks into HCSMachineConfigPool.spec.configs[].persistentDisks[]. Each persistent disk is bound to a fixed (hostname, slot) identity. During rolling replacement, the provider:

  1. Claims the existing EVS disk from the old ECS when it matches the pool declaration.
  2. Stops the old ECS.
  3. Detaches the EVS disk and waits until it is available.
  4. Deletes the old ECS.
  5. Creates the replacement ECS with the same EVS disk attached before first boot.
  6. Boots the replacement ECS, which mounts the existing file system without reformatting it.

Before You Start

Verify all of the following before you begin:

  • The management cluster has HCS provider v1.0.1 or later.
  • The workload cluster is healthy and all nodes are Ready.
  • The cluster uses HCSMachineConfigPool to assign fixed hostnames and IP addresses.
  • The preserved disks have non-empty mount paths in the old HCSMachineTemplate.spec.template.spec.dataVolumes[].
  • The relevant rollout strategies use maxSurge: 0.
  • You have a maintenance window for one-by-one node replacement.
  • You have a verified backup of etcd and platform configuration.
WARNING

Do not declare the same mount path in both HCSMachineTemplate.spec.template.spec.dataVolumes[] and HCSMachineConfigPool.spec.configs[].persistentDisks[]. The provider rejects this configuration to prevent data loss.

Inspect the Current Disk Layout

Identify the management-cluster objects that control the cluster:

kubectl get cluster -n cpaas-system
kubectl get kubeadmcontrolplane -n cpaas-system
kubectl get machinedeployment -n cpaas-system
kubectl get hcsmachinetemplate -n cpaas-system
kubectl get hcsmachineconfigpool -n cpaas-system
kubectl get hcsmachine -n cpaas-system

Inspect the current machine templates:

kubectl get hcsmachinetemplate <template-name> -n cpaas-system -o yaml

Record every dataVolumes[] entry that must be preserved. For each disk, record:

FieldSource
mountPathHCSMachineTemplate.spec.template.spec.dataVolumes[].mountPath
sizeHCSMachineTemplate.spec.template.spec.dataVolumes[].size
typeHCSMachineTemplate.spec.template.spec.dataVolumes[].type
formatHCSMachineTemplate.spec.template.spec.dataVolumes[].format

Decide Which Disks to Preserve

Move only the disks that must survive node replacement to the pool-managed model.

Use the following split:

PathRecommended declaration
/var/cpaasHCSMachineConfigPool.spec.configs[].persistentDisks[]
Monitoring or log data stored under /var/cpaasHCSMachineConfigPool.spec.configs[].persistentDisks[]
/var/lib/kubeletHCSMachineTemplate.spec.template.spec.dataVolumes[] unless your operational design requires retaining it
/var/lib/containerdHCSMachineTemplate.spec.template.spec.dataVolumes[] unless your operational design requires retaining it
/var/lib/etcdUse etcd backup and restore procedures for disaster recovery. Do not rely only on node-local disk retention.

For automatic migration, the provider claims an existing data volume by matching mountPath. If a preserved disk has no mountPath, automatic claim is not available. Use a supported operational procedure to record the existing EVS volumeID in status.persistentDiskStatus[], or migrate the data to a disk with a declared mount path before you trigger replacement.

Create New Machine Templates

Create new HCSMachineTemplate resources for the replacement nodes. Do not edit the existing templates in place.

Copy the current template:

kubectl get hcsmachinetemplate <current-template-name> -n cpaas-system -o yaml > new-template.yaml

Edit new-template.yaml:

  1. Set metadata.name to a new template name.
  2. Remove server-generated metadata, such as resourceVersion, uid, creationTimestamp, managedFields, and status.
  3. Leave runtime identity fields unset, including spec.template.spec.providerID and spec.template.spec.serverId.
  4. Remove the preserved paths from spec.template.spec.dataVolumes[].
  5. Keep only temporary data volumes that may be recreated with each ECS.
  6. Update spec.template.spec.imageName and other upgrade fields when this migration is part of a Kubernetes or image upgrade.

For example, after /var/cpaas moves to the pool, the template keeps temporary disks only:

hcs-machine-template-temporary-data-volumes.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCSMachineTemplate
metadata:
  name: <new-template-name>
  namespace: cpaas-system
spec:
  template:
    spec:
      imageName: <vm-image-name>
      flavorName: <instance-flavor>
      availabilityZone: <availability-zone>
      rootVolume:
        type: SSD
        size: 100
      configPoolRef:
        name: <pool-name>
      dataVolumes:
        - size: 100
          type: SSD
          mountPath: /var/lib/containerd
          format: xfs

Apply the new template:

kubectl apply -f new-template.yaml -n cpaas-system

Prepare the Rollout Strategy

Before you update the pool, confirm the rollout strategy for each controller that will use the updated pool. Skip the MachineDeployment command if the cluster has no worker pool in this migration.

kubectl get kubeadmcontrolplane <kcp-name> -n cpaas-system \
  -o jsonpath='{.spec.rolloutStrategy.rollingUpdate.maxSurge}{"\n"}'

kubectl get machinedeployment <md-name> -n cpaas-system \
  -o jsonpath='{.spec.strategy.rollingUpdate.maxSurge}{"\n"}'

Each returned value must be 0 for pools that use persistent disks.

If any returned value is not 0, patch the affected rollout strategy before you update the pool:

kubectl patch kubeadmcontrolplane <kcp-name> -n cpaas-system \
  --type='merge' \
  -p='{"spec":{"rolloutStrategy":{"type":"RollingUpdate","rollingUpdate":{"maxSurge":0}}}}'

kubectl patch machinedeployment <md-name> -n cpaas-system \
  --type='merge' \
  -p='{"spec":{"strategy":{"type":"RollingUpdate","rollingUpdate":{"maxSurge":0}}}}'

Update the Machine Configuration Pool

Edit the HCSMachineConfigPool that is referenced by the old and new HCSMachineTemplate.spec.template.spec.configPoolRef.name.

Add one persistentDisks[] entry under each hostname that must preserve the disk:

hcs-pool-persistent-disk.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCSMachineConfigPool
metadata:
  name: <pool-name>
  namespace: cpaas-system
  labels:
    cluster.x-k8s.io/cluster-name: <cluster-name>
spec:
  configs:
    - hostname: <node-hostname>
      networks:
        - subnetName: <subnet-name>
          ipAddress: <node-ip>
      persistentDisks:
        - slot: 0
          size: 40
          type: SSD
          mountPath: /var/cpaas
          format: xfs
          mountOptions:
            - defaults
            - noatime

Use these rules when you edit the pool:

  • Start slot at 0 for each hostname.
  • Keep slots contiguous for each hostname.
  • Set size, type, mountPath, and format to match the old data volume that you want to claim.
  • Add cluster.x-k8s.io/cluster-name: <cluster-name> if the pool does not already have it.
  • Keep the same persistent-disk declaration across all nodes that must preserve the same path.

Apply the updated pool only after the replacement template exists, has removed the preserved paths from dataVolumes[], and the rollout strategy uses maxSurge: 0:

kubectl apply -f hcs-pool-persistent-disk.yaml -n cpaas-system

Trigger Rolling Replacement

After you apply the updated pool, immediately point the control plane or worker controller to the new template in the same maintenance window. Do not leave a pool that declares /var/cpaas persistent disks while the active rollout still points to an old template that also declares /var/cpaas in dataVolumes[].

For a control plane migration, point the KubeadmControlPlane to the new template:

kubectl patch kubeadmcontrolplane <kcp-name> -n cpaas-system \
  --type='merge' \
  -p='{"spec":{"machineTemplate":{"infrastructureRef":{"name":"<new-template-name>"}}}}'

For a worker migration, point the MachineDeployment to the new template:

kubectl patch machinedeployment <md-name> -n cpaas-system \
  --type='merge' \
  -p='{"spec":{"template":{"spec":{"infrastructureRef":{"name":"<new-template-name>"}}}}}'

If the template reference already points to the target template and you need to force a one-by-one replacement, set rolloutAfter:

kubectl patch kubeadmcontrolplane <kcp-name> -n cpaas-system \
  --type='merge' \
  -p='{"spec": {"rolloutAfter": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}'

kubectl patch machinedeployment <md-name> -n cpaas-system \
  --type='merge' \
  -p='{"spec": {"rolloutAfter": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}'

Verify the Migration

Watch the rolling replacement:

kubectl get kubeadmcontrolplane <kcp-name> -n cpaas-system -w
kubectl get machinedeployment <md-name> -n cpaas-system -w
kubectl get machine -n cpaas-system -w
kubectl get hcsmachine -n cpaas-system -w

Inspect the pool status:

kubectl get hcsmachineconfigpool <pool-name> -n cpaas-system -o yaml

Each migrated disk appears under status.persistentDiskStatus:

PhaseMeaning
CreatingThe provider is creating the EVS volume for this hostname and slot.
AvailableThe EVS volume is ready and is not attached to any ECS.
AttachingThe replacement ECS exists; the provider is confirming the attachment.
AttachedThe EVS disk is attached to the current ECS.
DetachingThe disk is being detached before the old ECS is removed.
DeletingPool deletion is in progress; the EVS volume is being removed.
ErrorThe provider cannot safely continue. Inspect lastError.

Confirm the replacement node can read data from the preserved path:

kubectl --kubeconfig <workload-kubeconfig> debug node/<node-name> --image=busybox -- \
  chroot /host sh -c 'mount | grep /var/cpaas && ls -la /var/cpaas'

For a stronger data-retention check, write a marker before the rollout and read it after the replacement node becomes Ready:

kubectl --kubeconfig <workload-kubeconfig> debug node/<node-name> --image=busybox -- \
  chroot /host sh -c 'cat /var/cpaas/<marker-file>'

Failure Handling

Use the pool status to decide the next action when a disk enters phase: Error.

SymptomLikely causeAction
lastError reports a mount-path conflictThe same path exists in both dataVolumes[] and persistentDisks[]Remove the preserved path from the new HCSMachineTemplate, then retry the rollout.
lastError reports no mount path matchThe old data volume has no matching mountPathUse a supported operational procedure to record the existing EVS volumeID, or migrate the data to a disk with a declared mount path before retrying replacement.
lastError reports size or type mismatchThe pool declaration does not match the existing EVS diskCorrect the pool entry so size and type match the existing disk.
lastError reports an availability zone conflictThe existing EVS disk and the target ECS are in different availability zonesUse an ECS availability zone that matches the EVS disk, or migrate the disk to the target zone before retrying.
lastError reports that the EVS disk is attached to another ECSThe disk is attached outside the current replacement flowDetach the disk manually only after you confirm ownership, then let the provider reconcile again.
lastError reports that volumeID was not foundThe tracked EVS disk was deletedRestore or recreate the expected EVS disk with the correct ownership tags before retrying.

Do not delete the old HCSMachine or force-remove finalizers while a persistent disk is in an unresolved error state. The provider blocks deletion to avoid deleting the old ECS before it can safely claim and detach the disk.

Limitations and Recovery Notes

  • This procedure applies to clusters that use HCSMachineConfigPool with fixed hostnames and IP addresses.
  • Pool-managed persistent disks require one-by-one replacement. Keep maxSurge: 0 for each control plane or worker rollout that uses persistent disks.
  • The provider automatically claims existing data volumes by matching a non-empty mountPath.
  • dataVolumes[] that are not declared in persistentDisks[] remain template-owned and may be deleted with the old ECS.
  • After the provider accepts a persistent disk entry, treat slot, size, type, format, and mountPath as immutable.
  • mountOptions can change, but the change takes effect only on a replacement VM.
  • Single-control-plane HCS clusters are creation-only topologies in the documented upgrade workflow. Do not use this rolling migration procedure for a single-control-plane cluster.