Configure Hardware accelerator on GPU nodes
As the amount of business data increases, especially for scenarios such as artificial intelligence and data analysis, you may want to use GPU capabilities in your self-built business cluster to accelerate data processing. In addition to preparing GPU resources for cluster nodes, GPU configuration should also be performed.
This solution refers to nodes in the cluster that have GPU computing capabilities as GPU Nodes.
Note: Unless otherwise specified, the operation steps will apply to both types of nodes. For driver installation related issues, refer to the NVIDIA official installation documentation.
TOC
Prerequisites
GPU resources have been prepared on the operating node, which belongs to the GPU node mentioned in this section.
Install GPU driver
Notice: If the GPU node uses the NVIDIA MPS plugin, ensure that the GPU architecture of the node is Volta or newer (Volta/Turing/Ampere/Hopper, etc.), and the driver supports CUDA version 11.5 or higher.
Gets the driver download address
-
Log in to the GPU node and run the command
lspci |grep -i NVIDIAto check the GPU model of the node.In the following example, the GPU model is Tesla T4.
-
Go to the NVIDIA official website to obtain the driver download link.
-
Click on Drivers in the top navigation bar on the homepage.
-
Fill in the required information for downloading the driver according to the GPU node model .
-
Click on Search.
-
Click on Download.
-
Right-click on Download > Copy Link Address to copy the download link of the driver.
-
-
Execute the following command lines on the GPU node in order to create the
/home/gpudirectory and download and save the driver file to this directory.
Installation driver
-
Execute the following command on the GPU node to install the gcc and kernel-devel packages corresponding to the current operating system.
-
Execute the following commands in order to install the GPU driver.
-
After installation, execute the
nvidia-smicommand. If GPU information similar to the following example is returned, it indicates that the driver installation was successful.
Installation the NVIDIA Container runtime
-
On the GPU Node, add the NVIDIA yum repository.
When the prompt "Metadata cache created." appears, it indicates that the addition is successful.
-
Install NVIDIA Container Runtime.
When the prompt
Complete!appears, it means the installation is successful. -
Config the default Runtime. Add the following configuration to the file.
-
Containerd: Modify the
/etc/containerd/config.tomlfile. -
Docker: Modify the
/etc/docker/daemon.jsonfile.
-
-
Restart Containerd / Docker.
-
Containerd
-
Docker
-
Physical GPU configuration
Deploy physical GPU plugin on a GPU Business Cluster
On the management interface of the GPU cluster, perform the following actions:
-
In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "pGPU" option;
-
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "pGPU", and click OK;
-
In the "Pods" tab, check the running status of the container group corresponding to nvidia-device-plugin-ds to see if there are any abnormalities and ensure it is running on the specified nodes.
NVIDIA MPS configuration (driver support cuda version must >= 11.5)
Deploy NVIDIA MPS plugin on a GPU Business Cluster
On the management interface of the GPU cluster, perform the following actions:
-
In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "MPS" option;
-
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "MPS", and click OK;
-
In the "Pods" tab, check the running status of the container group corresponding to nvidia-mps-device-plugin-daemonset to see if there are any abnormalities and ensure it is running on the specified nodes.
Configure kube-scheduler (kubernetes> = 1.23)
-
On the Business Cluster Control Node, check if the scheduler correctly references the scheduling policy.
check if has –config option and value is /etc/kubernetes/scheduler-config.yaml, like
Note: The above parameters and values are the default configurations of the platform. If you have modified them, please change them back to the default values. The original custom configurations can be copied to the scheduling policy file.
-
Check the configuration of the scheduling policy file.
-
Execute the command:
kubectl describe service kubernetes -n default |grep Endpoints. -
Replace the contents of the
/etc/kubernetes/scheduler-config.yamlfile on all Master nodes with the following content, where${kube-apiserver}should be replaced with the output of the first step.if schedule-config.yaml already exist extenders,then append yaml to the end
-
-
Run the following command to obtain the container ID:
-
Containerd: Execute
crictl ps |grep kube-scheduler, the output is as follows, with the first column being the container ID. -
Docker: Run
docker ps |grep kube-scheduler, the output is as follows, with the first column being the container ID.
-
-
Restart the Containerd/Docker container using the container ID obtained in the previous step.
-
Containerd
-
-
Restart Kubelet.
GPU-Manager configuration
Configure kube-scheduler (kubernetes> = 1.23)
-
On the Business Cluster Control Node, check if the scheduler correctly references the scheduling policy.
check if has –config option and value is /etc/kubernetes/scheduler-config.yaml, like
Note: The above parameters and values are the default configurations of the platform. If you have modified them, please change them back to the default values. The original custom configurations can be copied to the scheduling policy file.
-
Check the configuration of the scheduling policy file.
-
Execute the command:
kubectl describe service kubernetes -n default |grep Endpoints. -
Replace the contents of the
/etc/kubernetes/scheduler-config.yamlfile on all Master nodes with the following content, where${kube-apiserver}should be replaced with the output of the first step.
-
-
Run the following command to obtain the container ID:
-
Containerd: Execute
crictl ps |grep kube-scheduler, the output is as follows, with the first column being the container ID. -
Docker: Run
docker ps |grep kube-scheduler, the output is as follows, with the first column being the container ID.
-
-
Restart the Containerd/Docker container using the container ID obtained in the previous step.
-
Containerd
-
-
Restart Kubelet.
Deploy GPU Manager plugin on a GPU Business Cluster
On the management interface of the GPU cluster, perform the following actions:
-
In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "GPU-Manager" option;
-
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "vGPU", and click OK;
-
In the "Pods" tab, check the running status of the container group corresponding to gpu-manager-daemonset to see if there are any abnormalities and ensure it is running on the specified nodes.
Validation of results
Method 1: Check if there are available GPU resources on the GPU nodes by running the following command on the control node of the business cluster:
Method 2: Deploy a GPU application on the platform by specifying the required amount of GPU resources. After deployment, exec the Pod and execute the following command:.
Check whether the correct GPU information is retrieved.