Install for NVIDIA GPU

This chapter covers the end-to-end installation steps for clusters with NVIDIA GPUs. For Huawei Ascend NPUs, see Install for Huawei Ascend NPU.

Prerequisites Procedure Installing the NVIDIA driver on your GPU node Installing the NVIDIA Container Runtime Add Nvidia yum library in GPU node Installing Nvidia Container Runtime Downloading Cluster plugin Uploading the Cluster plugin Installing Alauda Build of Hami Installing Alauda Build of DCGM-Exporter Installing Monitor Installing ACP MonitorDashboard(optional)Installing Alauda build of Hami-WebUI(optional)Verification Verify Hami Verify MonitorDashboard Verify Hami-WebUI

Prerequisites

Cluster administrator access to your ACP cluster
Kubernetes version: v1.16+
CUDA version: v10.2+
NvidiaDriver: v440+ in Hami and v450+ in DCGM-exporter
ACP version: v4.0+

Procedure

Installing the NVIDIA driver on your GPU node

Refer to the NVIDIA official installation guide.

Installing the NVIDIA Container Runtime

Refer to the NVIDIA Container Toolkit installation guide.

Add Nvidia yum library in GPU node

Note: Make sure the GPU node can access nvidia.github.io

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum makecache -y

When the message "Metadata cache created." appears, it indicates that the addition was successful.

Installing Nvidia Container Runtime

yum install nvidia-container-toolkit -y

When the prompt "Complete!" appears, it means the installation is successful.

Configure containerd to use the NVIDIA runtime and restart it:

nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerd

Downloading Cluster plugin

INFO

Alauda Build of Hami and Alauda Build of DCGM-Exporter and Alauda Build of Hami-WebUI(optional) cluster plugin can be retrieved from Customer Portal.

Please contact Consumer Support for more information.

Note: Alauda Build of DCGM-Exporter of version v4.2.3-413 deployed in the global cluster may cause the component to be continuously reinstalled. Version v4.2.3-413-1 resolves this issue, so be sure to use this version.

Uploading the Cluster plugin

For more information on uploading the cluster plugin, please refer to Uploading Cluster Plugins

Installing Alauda Build of Hami

Add label "gpu=on" on every NVIDIA GPU node so that hami-device-plugin (NVIDIA) only runs there.
kubectl label nodes {nodeid} gpu=on
TIP
This label is for NVIDIA nodes only — Ascend nodes use the ascend=on label instead. See Install for Huawei Ascend NPU.

Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of Hami Cluster plugin.

Keep Enable NVIDIA on in the deploy form. If the cluster does not contain any Huawei Ascend NPU nodes, leave Enable Ascend off. Other parameters can be kept as default or modified after knowing how to use them.

Parameter	Default	Description
`Enable NVIDIA`	Enabled	When enabled, both `hami-scheduler` and `hami-device-plugin` (NVIDIA) are deployed.
`Enable Ascend`	Disabled	Leave disabled for NVIDIA-only clusters. See Install for Huawei Ascend NPU if your cluster also has Huawei Ascend NPUs.

TIP

Enable NVIDIA and Enable Ascend are independent. You can turn either of them off, but you should keep at least one device type enabled.

Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
kubectl get pods -n kube-system | grep -E "hami-scheduler|hami-device-plugin"

Create some ConfigMaps that defines extended resources, which can be used to set extended resources on the ACP. Run the following script in your gpu cluster:

Click to expand code

kubectl apply -f - <<EOF
apiVersion: v1
data:
  dataType: integer
  defaultValue: "1"
  descriptionEn: Number of GPU jobs for resource quota. When create workload, declare how many physical GPUs needs and the requests of gpu core and gpu memory are the usage of per physical GPU
  descriptionZh: 资源配额代表 GPU 任务数。创建负载时代表申请的物理 gpu 个数, 申请的算力和显存都是每个物理 GPU 的使用量
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpualloc
  labelEn: gpu number
  labelZh: gpu 个数
  limits: optional
  requests: disabled
  resourceUnit: "count"
  relatedResources: "nvidia.com/gpucores,nvidia.com/gpumem"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpualloc
  namespace: kube-public
---
apiVersion: v1
data:
  dataType: integer
  defaultValue: "20"
  descriptionEn: vgpu cores, 100 cores represents the all computing power of a physical GPU
  descriptionZh: vgpu 算力, 100 算力代表一个物理 GPU 的全部算力
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpucores
  prefix: limits
  labelEn: vgpu cores
  labelZh: vgpu 算力
  limits: optional
  requests: disabled
  relatedResources: "nvidia.com/gpualloc,nvidia.com/gpumem"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
  ignoreNodeCheck: "true"
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpucores
  namespace: kube-public
---
apiVersion: v1
data:
  dataType: integer
  defaultValue: "4000"
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpumem
  prefix: limits
  labelEn: vgpu memory
  labelZh: vgpu 显存
  limits: optional
  requests: disabled
  resourceUnit: "Mi"
  relatedResources: "nvidia.com/gpualloc,nvidia.com/gpucores"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
  ignoreNodeCheck: "true"
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpumem
  namespace: kube-public
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cf-crl-hami-config
  namespace: kube-public
  labels:
    device-plugin.cpaas.io/config: "true"
data:
  deviceName: "HAMi"
  nodeLabelKey: "gpu"
  nodeLabelValue: "on"
EOF

After this, Hami appears in the extended resource type drop-down on the resource configuration page when creating an application in the ACP business view, and you can start using it.

Installing Alauda Build of DCGM-Exporter

Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of DCGM-Exporter Cluster plugin: Set the node labels in the popup form:
- Node Label Key: gpu
- Node Label Value: on
If you need to enable dcgm-exporter for pgpu, add the following label:
- Node Label Key: nvidia-device-enable
- Node Label Value: pgpu
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
kubectl get pods -n kube-system | grep dcgm-exporter

Installing Monitor

You can use the ACP MonitorDashboard or the Alauda build of Hami-WebUI

Installing ACP MonitorDashboard(optional)

Create the ACP MonitorDashboard resource for HAMi GPU monitor in ACP dashboard. Save the hami-vgpu-metrics-dashboard-v1.0.2.yaml file to the business cluster and execute the command: kubectl apply -f hami-vgpu-metrics-dashboard-v1.0.2.yaml

Installing Alauda build of Hami-WebUI(optional)

Alauda Build of Hami-WebUI version compatibility:

v1.10.0 is compatible with Hami v2.7 and v2.8.
v1.5.0 is not compatible with Hami v2.8.
When deploying Hami v2.8, use Alauda Build of Hami-WebUI v1.10.0.

Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of Hami-WebUI Cluster plugin. Fill in the Prometheus address and Prometheus authentication. It is recommended to enable NodePort access. The Prometheus address and auth string can be retrieved with the following script:

#!/bin/bash

addr=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.service}')
if [ -z "$addr" ]; then
addr=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.address}')
fi
echo "Prometheus Address: $addr"

secret_name=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.basicAuth.secretName}')
namespace="cpaas-system"

username=$(kubectl get secret $secret_name -n $namespace -o jsonpath='{.data.username}' | base64 -d)
password=$(kubectl get secret $secret_name -n $namespace -o jsonpath='{.data.password}' | base64 -d)

auth="Basic $(echo -n "$username:$password" | base64 -w 0)"
echo "Prometheus Auth   : $auth"

Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
kubectl get pods -n cpaas-system | grep "hami-webui"

Verification

This section describes how to verify that the installed Alauda Build of Hami and related monitoring is valid.

Verify Hami

Check whether there are allocatable GPU resources on the GPU node in the control node of the business cluster. Run the following command:

kubectl get node  ${nodeName} -o=jsonpath='{.status.allocatable}'
# The output contains: "nvidia.com/gpualloc":"10" (the specific value depends on the number of GPU cards and installation parameters)

Deploy a GPU demo instance. Check whether there is any GPU-related resource consumption. Run the following command on the GPU node of the business cluster:
nvidia-smi pmon -s u -d 1

If both sm and mem contain data, the GPU is ready. You can start developing GPU applications on the GPU node. Note: When deploying GPU applications, be sure to configure the following mandatory parameters:

spec:
  containers:
    - image: your-image
      imagePullPolicy: IfNotPresent
      name: gpu
      resources:
        limits:
          cpu: '2'
          memory: 4Gi
          nvidia.com/gpualloc: 1     # Request 1 physical GPU (required)
          nvidia.com/gpucores: "50"  # Request 50% of the compute resources per GPU (optional)
          nvidia.com/gpumem: 8000    # Request 8000MB of video memory per GPU (optional)

Verify MonitorDashboard

After the HAMi vgpu service has been running for a while, navigate to Administrator -> Operations Center -> Monitor -> Dashboards page and switch to the HAMi GPU Monitoring panel under Hami. You will see the relevant chart data.

Verify Hami-WebUI

After HAMi-WebUI components have been running for a while, access http://{business cluster node IP}:NodePort in your browser.

#Install for NVIDIA GPU

#TOC

#Prerequisites

#Procedure

#Installing the NVIDIA driver on your GPU node

#Installing the NVIDIA Container Runtime

#Add Nvidia yum library in GPU node

#Installing Nvidia Container Runtime

#Downloading Cluster plugin

#Uploading the Cluster plugin

#Installing Alauda Build of Hami

#Installing Alauda Build of DCGM-Exporter

#Installing Monitor

#Installing ACP MonitorDashboard(optional)

#Installing Alauda build of Hami-WebUI(optional)

#Verification

#Verify Hami

#Verify MonitorDashboard

#Verify Hami-WebUI