Verification

This article describes how to verify that the installed Alauda Build of NVIDIA GPU Device Plugin and related monitoring is valid.

TOC

Verify Alauda Build of NVIDIA GPU Device Plugin

  1. Check whether there are allocatable GPU resources on the GPU node in the control node of the business cluster. Run the following command:
    kubectl get node  ${nodeName} -o=jsonpath='{.status.allocatable}'
    # The output contains: "nvidia.com/gpu":"1" (the specific value depends on the number of GPU cards)
  2. Deploy a GPU demo instance. Check whether there is any GPU-related resource consumption. Run the following command on the GPU node of the business cluster:
    nvidia-smi pmon -s u -d 1

If both sm and mem contain data, the GPU is ready. You can start developing GPU applications on the GPU node. Note: When deploying GPU applications, be sure to configure the following mandatory parameters:

spec:
  containers:
    - image: your-image
      imagePullPolicy: IfNotPresent
      name: gpu
      resources:
        limits:
          cpu: '2'
          memory: 4Gi
          nvidia.com/gpu: 1 # Request 1 physical GPU (required)

Verify GPU Dashboards

After the HAMi vgpu service has been running for a while, navigate to Administrator -> Operations Center -> Monitor -> Dashboards page and switch to the node and pod panel under GPU. You will see the relevant monitoring data.