Configure Hardware accelerator on GPU nodes

As the amount of business data increases, especially for scenarios such as artificial intelligence and data analysis, you may want to use GPU capabilities in your self-built business cluster to accelerate data processing. In addition to preparing GPU resources for cluster nodes, GPU configuration should also be performed.

This solution refers to nodes in the cluster that have GPU computing capabilities as GPU Nodes.

Note: Unless otherwise specified, the operation steps will apply to both types of nodes. For driver installation related issues, refer to the NVIDIA official installation documentation.

Prerequisites

GPU resources have been prepared on the operating node, which belongs to the GPU node mentioned in this section.

Install GPU driver

Notice: If the GPU node uses the NVIDIA MPS plugin, ensure that the GPU architecture of the node is Volta or newer (Volta/Turing/Ampere/Hopper, etc.), and the driver supports CUDA version 11.5 or higher.

Gets the driver download address

Log in to the GPU node and run the command lspci |grep -i NVIDIA to check the GPU model of the node.

In the following example, the GPU model is Tesla T4.
```
lspci | grep NVIDIA
00:08.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
```
Go to the NVIDIA official website to obtain the driver download link.
1. Click on Drivers in the top navigation bar on the homepage.
2. Fill in the required information for downloading the driver according to the GPU node model .
3. Click on Search.
4. Click on Download.
5. Right-click on Download > Copy Link Address to copy the download link of the driver.

Execute the following command lines on the GPU node in order to create the /home/gpu directory and download and save the driver file to this directory.

# Create /home/gpu Directory
mkdir -p /home/gpu
cd /home/gpu/
# Download the driver file to /home/gpu Directory，Example：wget https://cn.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
wget <Driver download address>
# Verify that the driver file has been downloaded successfully，If the driver file name is returned（For example：NVIDIA-Linux-x86_64-515.65.01.run）Indicates that the download was successful
ls <Driver file name>

Installation driver

Execute the following command on the GPU node to install the gcc and kernel-devel packages corresponding to the current operating system.
```
sudo yum install dkms gcc  kernel-devel-$(uname -r) -y
```

Execute the following commands in order to install the GPU driver.

chmod a+x /home/gpu/<Driver file name>
/home/gpu/<Driver file name> --dkms

After installation, execute the nvidia-smi command. If GPU information similar to the following example is returned, it indicates that the driver installation was successful.

# nvidia-smi
Tue Sep 13 01:31:33 2022     
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
+-------------------------------+-----------------------+---------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:08.0 Off |                    0 |
| N/A   55C    P0    28W /  70W |      2MiB / 15360MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+-----------------------+---------------------+
                                                                                
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Installation the NVIDIA Container runtime

On the GPU Node, add the NVIDIA yum repository.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum makecache -y

When the prompt "Metadata cache created." appears, it indicates that the addition is successful.

Install NVIDIA Container Runtime.
```
yum install nvidia-container-toolkit -y
```
When the prompt Complete! appears, it means the installation is successful.

Config the default Runtime. Add the following configuration to the file.

Containerd: Modify the /etc/containerd/config.toml file.

[plugins]
 [plugins."io.containerd.grpc.v1.cri"]
   [plugins."io.containerd.grpc.v1.cri".containerd]
...
      default_runtime_name = "nvidia"
...
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
...
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
            runtime_type = "io.containerd.runc.v2"
            runtime_engine = ""
            runtime_root = ""
            privileged_without_host_devices = false
            base_runtime_spec = ""
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
              SystemdCgroup = true
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
            privileged_without_host_devices = false
            runtime_engine = ""
            runtime_root = ""
            runtime_type = "io.containerd.runc.v1"
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
              BinaryName = "/usr/bin/nvidia-container-runtime"
              SystemdCgroup = true
...

Docker: Modify the /etc/docker/daemon.json file.

 {
 ...
     "default-runtime": "nvidia",
     "runtimes": {
     "nvidia": {
         "path": "/usr/bin/nvidia-container-runtime"
         }
     },
 ...
 }

Restart Containerd / Docker.

Containerd

systemctl restart containerd   #Restart

crictl info |grep Runtime  #Check

Docker

systemctl restart docker   #Restart

docker info |grep Runtime  #Check

Physical GPU configuration

Deploy physical GPU plugin on a GPU Business Cluster

On the management interface of the GPU cluster, perform the following actions:

In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "pGPU" option;
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "pGPU", and click OK;
In the "Pods" tab, check the running status of the container group corresponding to nvidia-device-plugin-ds to see if there are any abnormalities and ensure it is running on the specified nodes.

NVIDIA MPS configuration (driver support cuda version must >= 11.5)

Deploy NVIDIA MPS plugin on a GPU Business Cluster

On the management interface of the GPU cluster, perform the following actions:

In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "MPS" option;
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "MPS", and click OK;
In the "Pods" tab, check the running status of the container group corresponding to nvidia-mps-device-plugin-daemonset to see if there are any abnormalities and ensure it is running on the specified nodes.

Configure kube-scheduler (kubernetes> = 1.23)

On the Business Cluster Control Node, check if the scheduler correctly references the scheduling policy.

cat /etc/kubernetes/manifests/kube-scheduler.yaml

check if has –config option and value is /etc/kubernetes/scheduler-config.yaml, like

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
   tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --config=/etc/kubernetes/scheduler-config.yaml

Note: The above parameters and values are the default configurations of the platform. If you have modified them, please change them back to the default values. The original custom configurations can be copied to the scheduling policy file.

Check the configuration of the scheduling policy file.

Execute the command: kubectl describe service kubernetes -n default |grep Endpoints.
```
Expected effectEndpoints:         192.168.130.240:6443
```

Replace the contents of the /etc/kubernetes/scheduler-config.yaml file on all Master nodes with the following content, where ${kube-apiserver} should be replaced with the output of the first step.

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
extenders:
- enableHTTPS: true
  filterVerb: predicates
  managedResources:
  - ignoredByScheduler: false
    name: nvidia.com/mps-core
  nodeCacheCapable: false
  urlPrefix: https://${kube-apiserver}/api/v1/namespaces/kube-system/services/nvidia-mps-scheduler-plugin/proxy/scheduler
  tlsConfig:
    insecure: false
    certFile: /etc/kubernetes/pki/apiserver-kubelet-client.crt
    keyFile: /etc/kubernetes/pki/apiserver-kubelet-client.key
    caFile: /etc/kubernetes/pki/ca.crt

if schedule-config.yaml already exist extenders,then append yaml to the end

- enableHTTPS: true
  filterVerb: predicates
  managedResources:
  - ignoredByScheduler: false
    name: nvidia.com/mps-core
  nodeCacheCapable: false
  urlPrefix: https://${kube-apiserver}/api/v1/namespaces/kube-system/services/nvidia-mps-scheduler-plugin/proxy/scheduler
  tlsConfig:
    insecure: false
    certFile: /etc/kubernetes/pki/apiserver-kubelet-client.crt
    keyFile: /etc/kubernetes/pki/apiserver-kubelet-client.key
    caFile: /etc/kubernetes/pki/ca.crt

Run the following command to obtain the container ID:

Containerd: Execute crictl ps |grep kube-scheduler, the output is as follows, with the first column being the container ID.

1d113ccf1c1a9       03c72176d0f15       2 seconds ago       Running             kube-scheduler              3                   ecd054bbdd465       kube-scheduler-192.168.176.47

Docker: Run docker ps |grep kube-scheduler, the output is as follows, with the first column being the container ID.

30528a45a118   d8a9fef7349c    "kube-scheduler --au..."   37 minutes ago   Up 37 minutes     k8s_kube-scheduler_kube-scheduler-192.168.130.240_kube-system_3e9f7007b38f4deb6ffd1c7587621009_28

Restart the Containerd/Docker container using the container ID obtained in the previous step.
- Containerd
  crictl stop <Container ID>
Restart Kubelet.
```
systemctl restart kubelet
```

GPU-Manager configuration

Configure kube-scheduler (kubernetes> = 1.23)

On the Business Cluster Control Node, check if the scheduler correctly references the scheduling policy.

cat /etc/kubernetes/manifests/kube-scheduler.yaml

check if has –config option and value is /etc/kubernetes/scheduler-config.yaml, like

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
   tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --config=/etc/kubernetes/scheduler-config.yaml

Check the configuration of the scheduling policy file.

Execute the command: kubectl describe service kubernetes -n default |grep Endpoints.
```
Expected effectEndpoints:         192.168.130.240:6443
```

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
extenders:
- enableHTTPS: true
  filterVerb: predicates
  managedResources:
  - ignoredByScheduler: false
    name: tencent.com/vcuda-core
  nodeCacheCapable: false
  urlPrefix: https://${kube-apiserver}/api/v1/namespaces/kube-system/services/gpu-quota-admission/proxy/scheduler
  tlsConfig:
    insecure: false
    certFile: /etc/kubernetes/pki/apiserver-kubelet-client.crt
    keyFile: /etc/kubernetes/pki/apiserver-kubelet-client.key
    caFile: /etc/kubernetes/pki/ca.crt

Run the following command to obtain the container ID:

Containerd: Execute crictl ps |grep kube-scheduler, the output is as follows, with the first column being the container ID.

1d113ccf1c1a9       03c72176d0f15       2 seconds ago       Running             kube-scheduler              3                   ecd054bbdd465       kube-scheduler-192.168.176.47

Docker: Run docker ps |grep kube-scheduler, the output is as follows, with the first column being the container ID.

30528a45a118   d8a9fef7349c    "kube-scheduler --au..."   37 minutes ago   Up 37 minutes     k8s_kube-scheduler_kube-scheduler-192.168.130.240_kube-system_3e9f7007b38f4deb6ffd1c7587621009_28

Restart the Containerd/Docker container using the container ID obtained in the previous step.
- Containerd
  crictl stop <Container ID>
Restart Kubelet.
```
systemctl restart kubelet
```

Deploy GPU Manager plugin on a GPU Business Cluster

On the management interface of the GPU cluster, perform the following actions:

In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "GPU-Manager" option;
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "vGPU", and click OK;
In the "Pods" tab, check the running status of the container group corresponding to gpu-manager-daemonset to see if there are any abnormalities and ensure it is running on the specified nodes.

Validation of results

Method 1: Check if there are available GPU resources on the GPU nodes by running the following command on the control node of the business cluster:

kubectl get node  ${nodeName} -o=jsonpath='{.status.allocatable}'

Method 2: Deploy a GPU application on the platform by specifying the required amount of GPU resources. After deployment, exec the Pod and execute the following command:.

# nvidia-smi
Tue Sep 13 01:31:33 2022     
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
+-------------------------------+-----------------------+---------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:08.0 Off |                    0 |
| N/A   55C    P0    28W /  70W |      2MiB / 15360MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+-----------------------+---------------------+
                                                                                
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Check whether the correct GPU information is retrieved.

View full docs as PDF

Node Management

Managed Clusters

Import Clusters

Public Cloud Cluster Initialization

Network Initialization

Storage Initialization

How to

How to

Backup Management

Recovery Management

Architecture

Concepts

Guides

How To

Trouble Shooting

Concepts

Guides

How To

Troubleshooting

Install

Concepts

Guides

How To

Disaster Recovery

Concepts

Guides

How To

Guides

Compliance

Install

API Refiner

User

Guides

Group

Guides

Role

Guides

IDP

Guides

Troubleshooting

User Policy

Guides

Overview

Images

Guides

How To

Virtual Machine

Guides

How To

Troubleshooting

Network

Guides

How To

Storage

Guides

Backup and Recovery

Guides

Concepts

Concepts

Guides

Namespaces

Pre-Application-Creation Preparation

Creating Applications

Post-Application-Creation Configuration

Operation and Maintenance

Application Observability

Workloads

Pod

Container

How To

Install

How To

Install

Guides

How To

Concepts

Guides

Argo CD Concept

Alauda Container Platform GitOps Concepts

Creating GitOps Application