Prefer to Installation guide of Nvidia Official website
Prefer to Installation guide of Nvidia Container Toolkit
Note: Make sure the GPU node can access nvidia.github.io
When the message "Metadata cache created." appears, it indicates that the addition was successful.
When the prompt "Complete!" appears, it means the installation is successful.
On GPU nodes that have nvidia-container-toolkit installed and that need to use the current plugin, need to configure the default container runtime.
Add the following configuration to the file:
/etc/containerd/config.toml
file, Check whether the nvidia runtime exists, and then update default_runtime_name to nvidia
.
/etc/docker/daemon.json
file:
Containerd
Docker
Alauda Build of NVIDIA GPU Device Plugin
cluster plugin can be retrieved from Customer Portal.
Please contact Consumer Support for more information.
For more information on uploading the cluster plugin, please refer to
Add label "nvidia-device-enable=pgpu" in your GPU node for nvidia-device-plugin schedule.
Note: The same node cannot have both gpu=on
and nvidia-device-enable=pgpu
labels at the same time
Go to the Administrator
-> Marketplace
-> Cluster Plugin
page, switch to the target cluster, and then deploy the Alauda Build of NVIDIA GPU Device Plugin
Cluster plugin.
Note: Deploy form parameters can be kept as default or modified after knowing how to use them.
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
Finally, you can see the Extended Resources
in the form of resources when create application in ACP, and then you can select GPU core
.
Administrator
-> Marketplace
-> Cluster Plugin
page, switch to the target cluster, and then deploy the Alauda Build of DCGM-Exporter
Cluster plugin:
Set the node labels in the popup form:If you need enable dcgm-exporter for Hami, you can add another labels: