Alauda Build of NVIDIA GPU Device Plugin

English

Installation

Prerequisites

Cluster administrator access to your ACP cluster
NvidiaDriver: v450+
ACP version: v3.18,v4.0,v4.1

Procedure

Installing Nvidia driver in your gpu node

Prefer to Installation guide of Nvidia Official website

Installing Nvidia Container Runtime

Prefer to Installation guide of Nvidia Container Toolkit

Add Nvidia yum library in GPU node

Note: Make sure the GPU node can access nvidia.github.io

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum makecache -y

When the message "Metadata cache created." appears, it indicates that the addition was successful.

Installing Nvidia Container Runtime

yum install nvidia-container-toolkit -y

When the prompt "Complete!" appears, it means the installation is successful.

Configuring the default Container Runtime

On GPU nodes that have nvidia-container-toolkit installed and that need to use the current plugin, need to configure the default container runtime.

Add the following configuration to the file:

Containerd: Update the /etc/containerd/config.toml file, Check whether the nvidia runtime exists, and then update default_runtime_name to nvidia.

[plugins]
 [plugins."io.containerd.grpc.v1.cri"]
   [plugins."io.containerd.grpc.v1.cri".containerd]
...
      default_runtime_name = "nvidia"
...
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
...
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
            runtime_type = "io.containerd.runc.v2"
            runtime_engine = ""
            runtime_root = ""
            privileged_without_host_devices = false
            base_runtime_spec = ""
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
              SystemdCgroup = true
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
            privileged_without_host_devices = false
            runtime_engine = ""
            runtime_root = ""
            runtime_type = "io.containerd.runc.v1"
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
              BinaryName = "/usr/bin/nvidia-container-runtime"
              SystemdCgroup = true
...

Docker: Update the /etc/docker/daemon.json file:

{
...
   "default-runtime": "nvidia",
   "runtimes": {
   "nvidia": {
       "path": "/usr/bin/nvidia-container-runtime"
       }
   },
...
}

Restarting Containerd / Docker

Containerd

$ systemctl restart containerd

# Check Default Runtime is nvidia
$ crictl info |grep Runtime
...
      "defaultRuntimeName": "nvidia"
...

Docker


$ systemctl restart docker

# Check Default Runtime is nvidia
$ docker info |grep Runtime
...
 Default Runtime: nvidia

Downloading Cluster plugin

INFO

Alauda Build of NVIDIA GPU Device Plugin cluster plugin can be retrieved from Customer Portal.

Please contact Consumer Support for more information.

Uploading the Cluster plugin

For more information on uploading the cluster plugin, please refer to

Installing Alauda Build of NVIDIA GPU Device Plugin

Add label "nvidia-device-enable=pgpu" in your GPU node for nvidia-device-plugin schedule.
```
kubectl label nodes {nodeid} nvidia-device-enable=pgpu
```
Note: The same node cannot have both gpu=on and nvidia-device-enable=pgpu labels at the same time
Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of NVIDIA GPU Device Plugin Cluster plugin. Note: Deploy form parameters can be kept as default or modified after knowing how to use them.
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
```
kubectl get pods -n kube-system | grep  "nvidia-device-plugin"
```
Finally, you can see the Extended Resources in the form of resources when create application in ACP, and then you can select GPU core.

Installing Alauda Build of DCGM-Exporter

Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of DCGM-Exporter Cluster plugin: Set the node labels in the popup form:

Node Label Key: nvidia-device-enable
Node Label Value: pgpu

If you need enable dcgm-exporter for Hami, you can add another labels:

Node Label Key: gpu
Node Label Value: on

Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
```
kubectl get pods -n kube-system | grep dcgm-exporter
```

#Installation

#TOC

#Prerequisites

#Procedure

#Installing Nvidia driver in your gpu node

#Installing Nvidia Container Runtime

#Add Nvidia yum library in GPU node

#Installing Nvidia Container Runtime

#Configuring the default Container Runtime

#Restarting Containerd / Docker

#Downloading Cluster plugin

#Uploading the Cluster plugin

#Installing Alauda Build of NVIDIA GPU Device Plugin

#Installing Alauda Build of DCGM-Exporter

Installation

TOC

Prerequisites

Procedure

Installing Nvidia driver in your gpu node

Installing Nvidia Container Runtime

Add Nvidia yum library in GPU node

Installing Nvidia Container Runtime

Configuring the default Container Runtime

Restarting Containerd / Docker

Downloading Cluster plugin

Uploading the Cluster plugin

Installing Alauda Build of NVIDIA GPU Device Plugin

Installing Alauda Build of DCGM-Exporter