Install for NVIDIA GPU
This chapter covers the end-to-end installation steps for clusters with NVIDIA GPUs. For Huawei Ascend NPUs, see Install for Huawei Ascend NPU.
TOC
PrerequisitesProcedureInstalling the NVIDIA driver on your GPU nodeInstalling the NVIDIA Container RuntimeAdd Nvidia yum library in GPU nodeInstalling Nvidia Container RuntimeDownloading Cluster pluginUploading the Cluster pluginInstalling Alauda Build of HamiInstalling Alauda Build of DCGM-ExporterInstalling MonitorInstalling ACP MonitorDashboard(optional)Installing Alauda build of Hami-WebUI(optional)VerificationVerify HamiVerify MonitorDashboardVerify Hami-WebUIPrerequisites
- Cluster administrator access to your ACP cluster
- Kubernetes version: v1.16+
- CUDA version: v10.2+
- NvidiaDriver: v440+ in Hami and v450+ in DCGM-exporter
- ACP version: v4.0+
Procedure
Installing the NVIDIA driver on your GPU node
Refer to the NVIDIA official installation guide.
Installing the NVIDIA Container Runtime
Refer to the NVIDIA Container Toolkit installation guide.
Add Nvidia yum library in GPU node
Note: Make sure the GPU node can access nvidia.github.io
When the message "Metadata cache created." appears, it indicates that the addition was successful.
Installing Nvidia Container Runtime
When the prompt "Complete!" appears, it means the installation is successful.
Configure containerd to use the NVIDIA runtime and restart it:
Downloading Cluster plugin
Alauda Build of Hami and Alauda Build of DCGM-Exporter and Alauda Build of Hami-WebUI(optional) cluster plugin can be retrieved from Customer Portal.
Please contact Consumer Support for more information.
Note: Alauda Build of DCGM-Exporter of version v4.2.3-413 deployed in the global cluster may cause the component to be continuously reinstalled. Version v4.2.3-413-1 resolves this issue, so be sure to use this version.
Uploading the Cluster plugin
For more information on uploading the cluster plugin, please refer to Uploading Cluster Plugins
Installing Alauda Build of Hami
-
Add label "gpu=on" on every NVIDIA GPU node so that
hami-device-plugin(NVIDIA) only runs there.TIPThis label is for NVIDIA nodes only — Ascend nodes use the
ascend=onlabel instead. See Install for Huawei Ascend NPU. -
Go to the
Administrator->Marketplace->Cluster Pluginpage, switch to the target cluster, and then deploy theAlauda Build of HamiCluster plugin.Keep
Enable NVIDIAon in the deploy form. If the cluster does not contain any Huawei Ascend NPU nodes, leaveEnable Ascendoff. Other parameters can be kept as default or modified after knowing how to use them.TIPEnable NVIDIAandEnable Ascendare independent. You can turn either of them off, but you should keep at least one device type enabled. -
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
-
Create some ConfigMaps that defines extended resources, which can be used to set extended resources on the ACP. Run the following script in your gpu cluster:
Click to expand code
After this, Hami appears in the extended resource type drop-down on the resource configuration page when creating an application in the ACP business view, and you can start using it.
Installing Alauda Build of DCGM-Exporter
-
Go to the
Administrator->Marketplace->Cluster Pluginpage, switch to the target cluster, and then deploy theAlauda Build of DCGM-ExporterCluster plugin: Set the node labels in the popup form:- Node Label Key: gpu
- Node Label Value: on
If you need to enable dcgm-exporter for pgpu, add the following label:
- Node Label Key: nvidia-device-enable
- Node Label Value: pgpu
-
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
Installing Monitor
You can use the ACP MonitorDashboard or the Alauda build of Hami-WebUI
Installing ACP MonitorDashboard(optional)
Create the ACP MonitorDashboard resource for HAMi GPU monitor in ACP dashboard.
Save the hami-vgpu-metrics-dashboard-v1.0.2.yaml file to the business cluster and execute the command: kubectl apply -f hami-vgpu-metrics-dashboard-v1.0.2.yaml
Installing Alauda build of Hami-WebUI(optional)
Alauda Build of Hami-WebUI version compatibility:
v1.10.0is compatible with Hamiv2.7andv2.8.v1.5.0is not compatible with Hamiv2.8.- When deploying Hami
v2.8, useAlauda Build of Hami-WebUI v1.10.0.
- Go to the
Administrator->Marketplace->Cluster Pluginpage, switch to the target cluster, and then deploy theAlauda Build of Hami-WebUICluster plugin. Fill in the Prometheus address and Prometheus authentication. It is recommended to enable NodePort access. The Prometheus address and auth string can be retrieved with the following script: - Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
Verification
This section describes how to verify that the installed Alauda Build of Hami and related monitoring is valid.
Verify Hami
- Check whether there are allocatable GPU resources on the GPU node in the control node of the business cluster. Run the following command:
- Deploy a GPU demo instance. Check whether there is any GPU-related resource consumption. Run the following command on the GPU node of the business cluster:
If both sm and mem contain data, the GPU is ready. You can start developing GPU applications on the GPU node. Note: When deploying GPU applications, be sure to configure the following mandatory parameters:
Verify MonitorDashboard
After the HAMi vgpu service has been running for a while, navigate to Administrator -> Operations Center -> Monitor -> Dashboards page and switch to the HAMi GPU Monitoring panel under Hami.
You will see the relevant chart data.
Verify Hami-WebUI
After HAMi-WebUI components have been running for a while, access http://{business cluster node IP}:NodePort in your browser.