Upgrade

This article describes how to upgrade from GPU-manager or an older Hami version to the latest Hami release.

GPU-manager to Hami

Note

  1. GPU-manager and Hami cannot be deployed on the same node, but they can coexist in the same cluster.
  2. During the upgrade, applications must be modified one by one, which will restart the corresponding workload pods.
  3. When you only have one GPU node, you must uninstall GPU-manager before installing Hami. You can achieve this by switching the node label while the two plugins are deployed. For example, remove the nvidia-device-enable=vgpu node label to delete the GPU-manager instance on that node, then add the gpu=on label to deploy the Hami plugin on it.

Procedure

Modify your applications one by one. Example:

Your old GPU-manager instance:

spec:
  containers:
    - image: your-image
      imagePullPolicy: IfNotPresent
      name: gpu
      resources:
        limits:
          cpu: '2'
          memory: 4Gi
          tencent.com/vcuda-core: "50"
          tencent.com/vcuda-memory: "8000"

Migrate to Hami:

spec:
  containers:
    - image: your-image
      imagePullPolicy: IfNotPresent
      name: gpu
      resources:
        limits:
          cpu: '2'
          memory: 4Gi
          nvidia.com/gpualloc: 1     # Request 1 physical GPU (required)
          nvidia.com/gpucores: "50"  # Request 50% of the compute resources per GPU (optional)
          nvidia.com/gpumem: 8000    # Request 8000MB of video memory per GPU (optional)

Hami to Hami

Important Changes (v2.5.0 → v2.8.0)

VersionParameter AvailabilityRequired Action After Upgrade
Hami v2.5Nvidia Runtime Class Name and Create Nvidia Runtime Class not included in the pop-up form.N/A
Hami v2.6These parameters must be configured when deploying a plugin instance on a new node.Update plugin deployment params:
- Nvidia Runtime Class Name: hami-nvidia
- Create Nvidia Runtime Class: true (enable switch)
Hami v2.8Helm value devicePlugin.nvidianodeSelector renamed to devicePlugin.nvidiaNodeSelector (capital N).If you had overridden this value, update the key name in your Helm values.
Hami v2.8Monitor resource config moved from devicePlugin.vgpuMonitor.resources to devicePlugin.monitor.resources.If you had customized monitor resources, update the value path.
Hami v2.8Alauda Build of Hami-WebUI v1.5.0 is not compatible.Upgrade Alauda Build of Hami-WebUI to v1.10.0. This version is compatible with Hami v2.7 and v2.8.

⚠️ Upgrading from v2.5 to v2.8.0 should not affect existing applications. ✅ It is recommended to restart applications with a rolling update to avoid unexpected issues.


Procedure

  1. Upgrade ACP version if needed.
  2. Upload the package of Hami v2.8.0 plugin to ACP.
  3. Go to the Administrator -> Clusters -> Target Cluster -> Functional Components page, then click the Upgrade button and you will see the Alauda Build of HAMi can be upgraded.
  4. Update some ConfigMaps that defines extended resources, which can be used to set extended resources on the ACP. Run the following script in your gpu cluster:
Click to expand code
kubectl apply -f - <<EOF
apiVersion: v1
data:
  dataType: integer
  defaultValue: "1"
  descriptionEn: Number of GPU jobs for resource quota. When create workload, declare how many physical GPUs needs and the requests of gpu core and gpu memory are the usage of per physical GPU
  descriptionZh: 资源配额代表 GPU 任务数。创建负载时代表申请的物理 gpu 个数, 申请的算力和显存都是每个物理 GPU 的使用量
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpualloc
  labelEn: gpu number
  labelZh: gpu 个数
  limits: optional
  requests: disabled
  resourceUnit: "count"
  relatedResources: "nvidia.com/gpucores,nvidia.com/gpumem"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpualloc
  namespace: kube-public
---
apiVersion: v1
data:
  dataType: integer
  defaultValue: "20"
  descriptionEn: vgpu cores, 100 cores represents the all computing power of a physical GPU
  descriptionZh: vgpu 算力, 100 算力代表一个物理 GPU 的全部算力
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpucores
  prefix: limits
  labelEn: vgpu cores
  labelZh: vgpu 算力
  limits: optional
  requests: disabled
  relatedResources: "nvidia.com/gpualloc,nvidia.com/gpumem"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
  ignoreNodeCheck: "true"
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpucores
  namespace: kube-public
---
apiVersion: v1
data:
  dataType: integer
  defaultValue: "4000"
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpumem
  prefix: limits
  labelEn: vgpu memory
  labelZh: vgpu 显存
  limits: optional
  requests: disabled
  resourceUnit: "Mi"
  relatedResources: "nvidia.com/gpualloc,nvidia.com/gpucores"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
  ignoreNodeCheck: "true"
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpumem
  namespace: kube-public
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cf-crl-hami-config
  namespace: kube-public
  labels:
    device-plugin.cpaas.io/config: "true"
data:
  deviceName: "HAMi"
  nodeLabelKey: "gpu"
  nodeLabelValue: "on"
EOF

Note

If you configured resource quota for HAMi resources in versions prior to v2.7.1, please delete and reconfigure it. If you are upgrading to Hami v2.8.0 and also use Alauda Build of Hami-WebUI, make sure the WebUI version is v1.10.0. Earlier v1.5.0 is not compatible with Hami v2.8.