Installation
TOC
Prerequisites
- Cluster administrator access to your ACP cluster
- Kubernetes version: v1.16+
- CUDA version: v10.2+
- NvidiaDriver: v440+ in Hami and v450+ in DCGM-exporter
- ACP version: v3.18.2,v4.0,v4.1
Procedure
Installing Nvidia driver in your gpu node
Prefer to Installation guide of Nvidia Official website
Installing Nvidia Container Runtime
Prefer to Installation guide of Nvidia Container Toolkit
Add Nvidia yum library in GPU node
Note: Make sure the GPU node can access nvidia.github.io
When the message "Metadata cache created." appears, it indicates that the addition was successful.
Installing Nvidia Container Runtime
When the prompt "Complete!" appears, it means the installation is successful.
Downloading Cluster plugin
Alauda Build of Hami and Alauda Build of DCGM-Exporter and Alauda Build of Hami-WebUI(optional) cluster plugin can be retrieved from Customer Portal.
Please contact Consumer Support for more information.
Note: Alauda Build of DCGM-Exporter of version v4.2.3-413 deployed in the global cluster may cause the component to be continuously reinstalled. Version v4.2.3-413-1 resolves this issue, so be sure to use this version.
Uploading the Cluster plugin
For more information on uploading the cluster plugin, please refer to
Installing Alauda Build of Hami
-
Add label "gpu=on" in your GPU node for Hami schedule.
-
Go to the
Administrator->Marketplace->Cluster Pluginpage, switch to the target cluster, and then deploy theAlauda Build of HamiCluster plugin. Note: Deploy form parameters can be kept as default or modified after knowing how to use them. -
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
-
Create some ConfigMaps that defines extended resources, which can be used to set extended resources on the ACP. Run the following script in your gpu cluster:
Click to expand code
Then you can see Hami from the extended resource type drop-down box in the resource configuration page when creating application in the ACP business view, and then you can use it.
Installing Alauda Build of DCGM-Exporter
-
Go to the
Administrator->Marketplace->Cluster Pluginpage, switch to the target cluster, and then deploy theAlauda Build of DCGM-ExporterCluster plugin: Set the node labels in the popup form:- Node Label Key: gpu
- Node Label Value: on
If you need enable dcgm-exporter for pgpu, you can add another labels:
- Node Label Key: nvidia-device-enable
- Node Label Value: pgpu
-
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
Installing Monitor
You can use the ACP MonitorDashboard or the Alauda build of Hami-WebUI
Installing ACP MonitorDashboard(optional)
Create the ACP MonitorDashboard resource for HAMi GPU monitor in ACP dashboard.
Save the hami-vgpu-metrics-dashboard-v1.0.2.yaml file to the business cluster and execute the command: kubectl apply -f hami-vgpu-metrics-dashboard-v1.0.2.yaml
Installing Alauda build of Hami-WebUI(optional)
- Go to the
Administrator->Marketplace->Cluster Pluginpage, switch to the target cluster, and then deploy theAlauda Build of Hami-WebUICluster plugin. Fill in the Prometheus address and Prometheus authentication. It is recommended to enable NodePort access. Prometheus address and Auth message can retrieve by the following scripts: - Verify result. You can see the status of "Installed" in the UI or you can check the pod status: