Installation

TOC

Prerequisites

  • Cluster administrator access to your ACP cluster
  • NvidiaDriver: v450+
  • ACP version: v3.18,v4.0,v4.1

Procedure

Installing Nvidia driver in your gpu node

Prefer to Installation guide of Nvidia Official website

Installing Nvidia Container Runtime

Prefer to Installation guide of Nvidia Container Toolkit

Add Nvidia yum library in GPU node

Note: Make sure the GPU node can access nvidia.github.io

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum makecache -y

When the message "Metadata cache created." appears, it indicates that the addition was successful.

Installing Nvidia Container Runtime

yum install nvidia-container-toolkit -y

When the prompt "Complete!" appears, it means the installation is successful.

Configuring the default Container Runtime

On GPU nodes that have nvidia-container-toolkit installed and that need to use the current plugin, need to configure the default container runtime.

Add the following configuration to the file:

  1. Containerd: Update the /etc/containerd/config.toml file, Check whether the nvidia runtime exists, and then update default_runtime_name to nvidia.
    [plugins]
     [plugins."io.containerd.grpc.v1.cri"]
       [plugins."io.containerd.grpc.v1.cri".containerd]
    ...
          default_runtime_name = "nvidia"
    ...
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
    ...
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                runtime_type = "io.containerd.runc.v2"
                runtime_engine = ""
                runtime_root = ""
                privileged_without_host_devices = false
                base_runtime_spec = ""
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                  SystemdCgroup = true
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
                privileged_without_host_devices = false
                runtime_engine = ""
                runtime_root = ""
                runtime_type = "io.containerd.runc.v1"
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                  BinaryName = "/usr/bin/nvidia-container-runtime"
                  SystemdCgroup = true
    ...
  2. Docker: Update the /etc/docker/daemon.json file:
    {
    ...
       "default-runtime": "nvidia",
       "runtimes": {
       "nvidia": {
           "path": "/usr/bin/nvidia-container-runtime"
           }
       },
    ...
    }

Restarting Containerd / Docker

  • Containerd

    $ systemctl restart containerd
    
    # Check Default Runtime is nvidia
    $ crictl info |grep Runtime
    ...
          "defaultRuntimeName": "nvidia"
    ...
  • Docker

    
    $ systemctl restart docker
    
    # Check Default Runtime is nvidia
    $ docker info |grep Runtime
    ...
     Default Runtime: nvidia

Downloading Cluster plugin

INFO

Alauda Build of NVIDIA GPU Device Plugin cluster plugin can be retrieved from Customer Portal.

Please contact Consumer Support for more information.

Uploading the Cluster plugin

For more information on uploading the cluster plugin, please refer to

Installing Alauda Build of NVIDIA GPU Device Plugin

  1. Add label "nvidia-device-enable=pgpu" in your GPU node for nvidia-device-plugin schedule.

    kubectl label nodes {nodeid} nvidia-device-enable=pgpu

    Note: The same node cannot have both gpu=on and nvidia-device-enable=pgpu labels at the same time

  2. Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of NVIDIA GPU Device Plugin Cluster plugin. Note: Deploy form parameters can be kept as default or modified after knowing how to use them.

  3. Verify result. You can see the status of "Installed" in the UI or you can check the pod status:

    kubectl get pods -n kube-system | grep  "nvidia-device-plugin"
  4. Finally, you can see the Extended Resources in the form of resources when create application in ACP, and then you can select GPU core.

Installing Alauda Build of DCGM-Exporter

  1. Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of DCGM-Exporter Cluster plugin: Set the node labels in the popup form:
  • Node Label Key: nvidia-device-enable
  • Node Label Value: pgpu

If you need enable dcgm-exporter for Hami, you can add another labels:

  • Node Label Key: gpu
  • Node Label Value: on
  1. Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
    kubectl get pods -n kube-system | grep dcgm-exporter