As the amount of business data increases, especially for scenarios such as artificial intelligence and data analysis, you may want to use GPU capabilities in your self-built business cluster to accelerate data processing. In addition to preparing GPU resources for cluster nodes, GPU configuration should also be performed.
This solution refers to nodes in the cluster that have GPU computing capabilities as GPU Nodes.
Note: Unless otherwise specified, the operation steps will apply to both types of nodes. For driver installation related issues, refer to the NVIDIA official installation documentation.
GPU resources have been prepared on the operating node, which belongs to the GPU node mentioned in this section.
Notice: If the GPU node uses the NVIDIA MPS plugin, ensure that the GPU architecture of the node is Volta or newer (Volta/Turing/Ampere/Hopper, etc.), and the driver supports CUDA version 11.5 or higher.
Log in to the GPU node and run the command lspci |grep -i NVIDIA
to check the GPU model of the node.
In the following example, the GPU model is Tesla T4.
Go to the NVIDIA official website to obtain the driver download link.
Click on Drivers in the top navigation bar on the homepage.
Fill in the required information for downloading the driver according to the GPU node model .
Click on Search.
Click on Download.
Right-click on Download > Copy Link Address to copy the download link of the driver.
Execute the following command lines on the GPU node in order to create the /home/gpu
directory and download and save the driver file to this directory.
Execute the following command on the GPU node to install the gcc and kernel-devel packages corresponding to the current operating system.
Execute the following commands in order to install the GPU driver.
After installation, execute the nvidia-smi
command. If GPU information similar to the following example is returned, it indicates that the driver installation was successful.
On the GPU Node, add the NVIDIA yum repository.
When the prompt "Metadata cache created." appears, it indicates that the addition is successful.
Install NVIDIA Container Runtime.
When the prompt Complete!
appears, it means the installation is successful.
Config the default Runtime. Add the following configuration to the file.
Containerd: Modify the /etc/containerd/config.toml
file.
Docker: Modify the /etc/docker/daemon.json
file.
Restart Containerd / Docker.
Containerd
Docker
On the management interface of the GPU cluster, perform the following actions:
In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "pGPU" option;
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "pGPU", and click OK;
In the "Pods" tab, check the running status of the container group corresponding to nvidia-device-plugin-ds to see if there are any abnormalities and ensure it is running on the specified nodes.
In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "MPS" option;
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "MPS", and click OK;
In the "Pods" tab, check the running status of the container group corresponding to nvidia-mps-device-plugin-daemonset to see if there are any abnormalities and ensure it is running on the specified nodes.
On the Business Cluster Control Node, check if the scheduler correctly references the scheduling policy.
check if has –config option and value is /etc/kubernetes/scheduler-config.yaml, like
Note: The above parameters and values are the default configurations of the platform. If you have modified them, please change them back to the default values. The original custom configurations can be copied to the scheduling policy file.
Check the configuration of the scheduling policy file.
Execute the command: kubectl describe service kubernetes -n default |grep Endpoints
.
Replace the contents of the /etc/kubernetes/scheduler-config.yaml
file on all Master nodes with the following content, where ${kube-apiserver}
should be replaced with the output of the first step.
if schedule-config.yaml already exist extenders,then append yaml to the end
Run the following command to obtain the container ID:
Containerd: Execute crictl ps |grep kube-scheduler
, the output is as follows, with the first column being the container ID.
Docker: Run docker ps |grep kube-scheduler
, the output is as follows, with the first column being the container ID.
Restart the Containerd/Docker container using the container ID obtained in the previous step.
Containerd
Restart Kubelet.
On the Business Cluster Control Node, check if the scheduler correctly references the scheduling policy.
check if has –config option and value is /etc/kubernetes/scheduler-config.yaml, like
Note: The above parameters and values are the default configurations of the platform. If you have modified them, please change them back to the default values. The original custom configurations can be copied to the scheduling policy file.
Check the configuration of the scheduling policy file.
Execute the command: kubectl describe service kubernetes -n default |grep Endpoints
.
Replace the contents of the /etc/kubernetes/scheduler-config.yaml
file on all Master nodes with the following content, where ${kube-apiserver}
should be replaced with the output of the first step.
Run the following command to obtain the container ID:
Containerd: Execute crictl ps |grep kube-scheduler
, the output is as follows, with the first column being the container ID.
Docker: Run docker ps |grep kube-scheduler
, the output is as follows, with the first column being the container ID.
Restart the Containerd/Docker container using the container ID obtained in the previous step.
Containerd
Restart Kubelet.
On the management interface of the GPU cluster, perform the following actions:
In the Catalog leftsidebar, choose "Cluster Plugins" subsidebar, click to deploy the "ACP GPU Device Plugin" and open the "GPU-Manager" option;
In the "Nodes" tab, select the nodes that need to deploy the physical GPU, then click on "Label and Taint Manager", add a "device label" and choose "vGPU", and click OK;
In the "Pods" tab, check the running status of the container group corresponding to gpu-manager-daemonset to see if there are any abnormalities and ensure it is running on the specified nodes.
Method 1: Check if there are available GPU resources on the GPU nodes by running the following command on the control node of the business cluster:
Method 2: Deploy a GPU application on the platform by specifying the required amount of GPU resources. After deployment, exec the Pod and execute the following command:.
Check whether the correct GPU information is retrieved.