Install for NVIDIA GPU

This chapter covers the end-to-end installation steps for clusters with NVIDIA GPUs. For Huawei Ascend NPUs, see Install for Huawei Ascend NPU.

Prerequisites

  • Cluster administrator access to your ACP cluster
  • Kubernetes version: v1.16+
  • CUDA version: v10.2+
  • NvidiaDriver: v440+ in Hami and v450+ in DCGM-exporter
  • ACP version: v4.0+

Procedure

Installing the NVIDIA driver on your GPU node

Refer to the NVIDIA official installation guide.

Installing the NVIDIA Container Runtime

Refer to the NVIDIA Container Toolkit installation guide.

Add Nvidia yum library in GPU node

Note: Make sure the GPU node can access nvidia.github.io

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum makecache -y

When the message "Metadata cache created." appears, it indicates that the addition was successful.

Installing Nvidia Container Runtime

yum install nvidia-container-toolkit -y

When the prompt "Complete!" appears, it means the installation is successful.

Configure containerd to use the NVIDIA runtime and restart it:

nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerd

Downloading Cluster plugin

INFO

Alauda Build of Hami and Alauda Build of DCGM-Exporter and Alauda Build of Hami-WebUI(optional) cluster plugin can be retrieved from Customer Portal.

Please contact Consumer Support for more information.

Note: Alauda Build of DCGM-Exporter of version v4.2.3-413 deployed in the global cluster may cause the component to be continuously reinstalled. Version v4.2.3-413-1 resolves this issue, so be sure to use this version.

Uploading the Cluster plugin

For more information on uploading the cluster plugin, please refer to Uploading Cluster Plugins

Installing Alauda Build of Hami

  1. Add label "gpu=on" on every NVIDIA GPU node so that hami-device-plugin (NVIDIA) only runs there.

    kubectl label nodes {nodeid} gpu=on
    TIP

    This label is for NVIDIA nodes only — Ascend nodes use the ascend=on label instead. See Install for Huawei Ascend NPU.

  2. Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of Hami Cluster plugin.

    Keep Enable NVIDIA on in the deploy form. If the cluster does not contain any Huawei Ascend NPU nodes, leave Enable Ascend off. Other parameters can be kept as default or modified after knowing how to use them.

    ParameterDefaultDescription
    Enable NVIDIAEnabledWhen enabled, both hami-scheduler and hami-device-plugin (NVIDIA) are deployed.
    Enable AscendDisabledLeave disabled for NVIDIA-only clusters. See Install for Huawei Ascend NPU if your cluster also has Huawei Ascend NPUs.
    TIP

    Enable NVIDIA and Enable Ascend are independent. You can turn either of them off, but you should keep at least one device type enabled.

  3. Verify result. You can see the status of "Installed" in the UI or you can check the pod status:

    kubectl get pods -n kube-system | grep -E "hami-scheduler|hami-device-plugin"
  4. Create some ConfigMaps that defines extended resources, which can be used to set extended resources on the ACP. Run the following script in your gpu cluster:

    Click to expand code
    kubectl apply -f - <<EOF
    apiVersion: v1
    data:
      dataType: integer
      defaultValue: "1"
      descriptionEn: Number of GPU jobs for resource quota. When create workload, declare how many physical GPUs needs and the requests of gpu core and gpu memory are the usage of per physical GPU
      descriptionZh: 资源配额代表 GPU 任务数。创建负载时代表申请的物理 gpu 个数, 申请的算力和显存都是每个物理 GPU 的使用量
      group: hami-nvidia
      groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
      key: nvidia.com/gpualloc
      labelEn: gpu number
      labelZh: gpu 个数
      limits: optional
      requests: disabled
      resourceUnit: "count"
      relatedResources: "nvidia.com/gpucores,nvidia.com/gpumem"
      excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
      runtimeClassName: ""
    kind: ConfigMap
    metadata:
      labels:
        features.cpaas.io/enabled: "true"
        features.cpaas.io/group: hami-nvidia
        features.cpaas.io/type: CustomResourceLimitation
      name: cf-crl-hami-nvidia-gpualloc
      namespace: kube-public
    ---
    apiVersion: v1
    data:
      dataType: integer
      defaultValue: "20"
      descriptionEn: vgpu cores, 100 cores represents the all computing power of a physical GPU
      descriptionZh: vgpu 算力, 100 算力代表一个物理 GPU 的全部算力
      group: hami-nvidia
      groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
      key: nvidia.com/gpucores
      prefix: limits
      labelEn: vgpu cores
      labelZh: vgpu 算力
      limits: optional
      requests: disabled
      relatedResources: "nvidia.com/gpualloc,nvidia.com/gpumem"
      excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
      runtimeClassName: ""
      ignoreNodeCheck: "true"
    kind: ConfigMap
    metadata:
      labels:
        features.cpaas.io/enabled: "true"
        features.cpaas.io/group: hami-nvidia
        features.cpaas.io/type: CustomResourceLimitation
      name: cf-crl-hami-nvidia-gpucores
      namespace: kube-public
    ---
    apiVersion: v1
    data:
      dataType: integer
      defaultValue: "4000"
      group: hami-nvidia
      groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
      key: nvidia.com/gpumem
      prefix: limits
      labelEn: vgpu memory
      labelZh: vgpu 显存
      limits: optional
      requests: disabled
      resourceUnit: "Mi"
      relatedResources: "nvidia.com/gpualloc,nvidia.com/gpucores"
      excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
      runtimeClassName: ""
      ignoreNodeCheck: "true"
    kind: ConfigMap
    metadata:
      labels:
        features.cpaas.io/enabled: "true"
        features.cpaas.io/group: hami-nvidia
        features.cpaas.io/type: CustomResourceLimitation
      name: cf-crl-hami-nvidia-gpumem
      namespace: kube-public
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cf-crl-hami-config
      namespace: kube-public
      labels:
        device-plugin.cpaas.io/config: "true"
    data:
      deviceName: "HAMi"
      nodeLabelKey: "gpu"
      nodeLabelValue: "on"
    EOF
    

After this, Hami appears in the extended resource type drop-down on the resource configuration page when creating an application in the ACP business view, and you can start using it.

Installing Alauda Build of DCGM-Exporter

  1. Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of DCGM-Exporter Cluster plugin: Set the node labels in the popup form:

    • Node Label Key: gpu
    • Node Label Value: on

    If you need to enable dcgm-exporter for pgpu, add the following label:

    • Node Label Key: nvidia-device-enable
    • Node Label Value: pgpu
  2. Verify result. You can see the status of "Installed" in the UI or you can check the pod status:

    kubectl get pods -n kube-system | grep dcgm-exporter

Installing Monitor

You can use the ACP MonitorDashboard or the Alauda build of Hami-WebUI

Installing ACP MonitorDashboard(optional)

Create the ACP MonitorDashboard resource for HAMi GPU monitor in ACP dashboard. Save the hami-vgpu-metrics-dashboard-v1.0.2.yaml file to the business cluster and execute the command: kubectl apply -f hami-vgpu-metrics-dashboard-v1.0.2.yaml

Installing Alauda build of Hami-WebUI(optional)

Alauda Build of Hami-WebUI version compatibility:

  • v1.10.0 is compatible with Hami v2.7 and v2.8.
  • v1.5.0 is not compatible with Hami v2.8.
  • When deploying Hami v2.8, use Alauda Build of Hami-WebUI v1.10.0.
  1. Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of Hami-WebUI Cluster plugin. Fill in the Prometheus address and Prometheus authentication. It is recommended to enable NodePort access. The Prometheus address and auth string can be retrieved with the following script:
    #!/bin/bash
    
    addr=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.service}')
    if [ -z "$addr" ]; then
    addr=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.address}')
    fi
    echo "Prometheus Address: $addr"
    
    secret_name=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.basicAuth.secretName}')
    namespace="cpaas-system"
    
    username=$(kubectl get secret $secret_name -n $namespace -o jsonpath='{.data.username}' | base64 -d)
    password=$(kubectl get secret $secret_name -n $namespace -o jsonpath='{.data.password}' | base64 -d)
    
    auth="Basic $(echo -n "$username:$password" | base64 -w 0)"
    echo "Prometheus Auth   : $auth"
  2. Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
    kubectl get pods -n cpaas-system | grep "hami-webui"

Verification

This section describes how to verify that the installed Alauda Build of Hami and related monitoring is valid.

Verify Hami

  1. Check whether there are allocatable GPU resources on the GPU node in the control node of the business cluster. Run the following command:
    kubectl get node  ${nodeName} -o=jsonpath='{.status.allocatable}'
    # The output contains: "nvidia.com/gpualloc":"10" (the specific value depends on the number of GPU cards and installation parameters)
  2. Deploy a GPU demo instance. Check whether there is any GPU-related resource consumption. Run the following command on the GPU node of the business cluster:
    nvidia-smi pmon -s u -d 1

If both sm and mem contain data, the GPU is ready. You can start developing GPU applications on the GPU node. Note: When deploying GPU applications, be sure to configure the following mandatory parameters:

spec:
  containers:
    - image: your-image
      imagePullPolicy: IfNotPresent
      name: gpu
      resources:
        limits:
          cpu: '2'
          memory: 4Gi
          nvidia.com/gpualloc: 1     # Request 1 physical GPU (required)
          nvidia.com/gpucores: "50"  # Request 50% of the compute resources per GPU (optional)
          nvidia.com/gpumem: 8000    # Request 8000MB of video memory per GPU (optional)

Verify MonitorDashboard

After the HAMi vgpu service has been running for a while, navigate to Administrator -> Operations Center -> Monitor -> Dashboards page and switch to the HAMi GPU Monitoring panel under Hami. You will see the relevant chart data.

Verify Hami-WebUI

After HAMi-WebUI components have been running for a while, access http://{business cluster node IP}:NodePort in your browser.