Installation

TOC

Prerequisites

  • Cluster administrator access to your ACP cluster
  • Kubernetes version: v1.16+
  • CUDA version: v10.2+
  • NvidiaDriver: v440+ in Hami and v450+ in DCGM-exporter
  • ACP version: v3.18.2,v4.0,v4.1

Procedure

Installing Nvidia driver in your gpu node

Prefer to Installation guide of Nvidia Official website

Installing Nvidia Container Runtime

Prefer to Installation guide of Nvidia Container Toolkit

Add Nvidia yum library in GPU node

Note: Make sure the GPU node can access nvidia.github.io

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum makecache -y

When the message "Metadata cache created." appears, it indicates that the addition was successful.

Installing Nvidia Container Runtime

yum install nvidia-container-toolkit -y

When the prompt "Complete!" appears, it means the installation is successful.

Downloading Cluster plugin

INFO

Alauda Build of Hami and Alauda Build of DCGM-Exporter and Alauda Build of Hami-WebUI(optional) cluster plugin can be retrieved from Customer Portal.

Please contact Consumer Support for more information.

Note: Alauda Build of DCGM-Exporter of version v4.2.3-413 deployed in the global cluster may cause the component to be continuously reinstalled. Version v4.2.3-413-1 resolves this issue, so be sure to use this version.

Uploading the Cluster plugin

For more information on uploading the cluster plugin, please refer to

Installing Alauda Build of Hami

  1. Add label "gpu=on" in your GPU node for Hami schedule.

    kubectl label nodes {nodeid} gpu=on
  2. Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of Hami Cluster plugin. Note: Deploy form parameters can be kept as default or modified after knowing how to use them.

  3. Verify result. You can see the status of "Installed" in the UI or you can check the pod status:

    kubectl get pods -n kube-system | grep -E "hami-scheduler|hami-device-plugin"
  4. Create some ConfigMaps that defines extended resources, which can be used to set extended resources on the ACP. Run the following script in your gpu cluster:

    Click to expand code
    kubectl apply -f - <<EOF
    apiVersion: v1
    data:
      dataType: integer
      defaultValue: "1"
      descriptionEn: Declare how many physical GPUs needs and the requests of gpu core and gpu memory are the usage of per physical GPU
      descriptionZh: 申请的物理 gpu 个数, 申请的算力和显存都是每个物理 GPU 的使用量
      group: hami-nvidia
      groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
      key: nvidia.com/gpualloc
      labelEn: gpu number
      labelZh: gpu 个数
      limits: optional
      requests: disabled
      resourceUnit: "count"
      relatedResources: "nvidia.com/gpucores,nvidia.com/gpumem"
      excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
      runtimeClassName: ""
    kind: ConfigMap
    metadata:
      labels:
        features.cpaas.io/enabled: "true"
        features.cpaas.io/group: hami-nvidia
        features.cpaas.io/type: CustomResourceLimitation
      name: cf-crl-hami-nvidia-gpualloc
      namespace: kube-public
    ---
    apiVersion: v1
    data:
      dataType: integer
      defaultValue: "20"
      descriptionEn: vgpu cores, 100 cores represents the all computing power of a physical GPU
      descriptionZh: vgpu 算力, 100 算力代表一个物理 GPU 的全部算力
      group: hami-nvidia
      groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
      key: nvidia.com/gpucores
      labelEn: vgpu cores
      labelZh: vgpu 算力
      limits: optional
      requests: disabled
      resourceUnit: "%"
      relatedResources: "nvidia.com/gpualloc,nvidia.com/gpumem"
      excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
      runtimeClassName: ""
      ignoreNodeCheck: "true"
    kind: ConfigMap
    metadata:
      labels:
        features.cpaas.io/enabled: "true"
        features.cpaas.io/group: hami-nvidia
        features.cpaas.io/type: CustomResourceLimitation
      name: cf-crl-hami-nvidia-gpucores
      namespace: kube-public
    ---
    apiVersion: v1
    data:
      dataType: integer
      defaultValue: "4000"
      group: hami-nvidia
      groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
      key: nvidia.com/gpumem
      labelEn: vgpu memory
      labelZh: vgpu 显存
      limits: optional
      requests: disabled
      resourceUnit: "Mi"
      relatedResources: "nvidia.com/gpualloc,nvidia.com/gpucores"
      excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
      runtimeClassName: ""
      ignoreNodeCheck: "true"
    kind: ConfigMap
    metadata:
      labels:
        features.cpaas.io/enabled: "true"
        features.cpaas.io/group: hami-nvidia
        features.cpaas.io/type: CustomResourceLimitation
      name: cf-crl-hami-nvidia-gpumem
      namespace: kube-public
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cf-crl-hami-config
      namespace: kube-public
      labels:
        device-plugin.cpaas.io/config: "true"
    data:
      deviceName: "HAMi"
      nodeLabelKey: "gpu"
      nodeLabelValue: "on"
    EOF

Then you can see Hami from the extended resource type drop-down box in the resource configuration page when creating application in the ACP business view, and then you can use it.

Installing Alauda Build of DCGM-Exporter

  1. Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of DCGM-Exporter Cluster plugin: Set the node labels in the popup form:

    • Node Label Key: gpu
    • Node Label Value: on

    If you need enable dcgm-exporter for pgpu, you can add another labels:

    • Node Label Key: nvidia-device-enable
    • Node Label Value: pgpu
  2. Verify result. You can see the status of "Installed" in the UI or you can check the pod status:

    kubectl get pods -n kube-system | grep dcgm-exporter

Installing Monitor

You can use the ACP MonitorDashboard or the Alauda build of Hami-WebUI

Installing ACP MonitorDashboard(optional)

Create the ACP MonitorDashboard resource for HAMi GPU monitor in ACP dashboard. Save the hami-vgpu-metrics-dashboard-v1.0.2.yaml file to the business cluster and execute the command: kubectl apply -f hami-vgpu-metrics-dashboard-v1.0.2.yaml

Installing Alauda build of Hami-WebUI(optional)

  1. Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of Hami-WebUI Cluster plugin. Fill in the Prometheus address and Prometheus authentication. It is recommended to enable NodePort access. Prometheus address and Auth message can retrieve by the following scripts:
    #!/bin/bash
    
    addr=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.service}')
    if [ -z "$addr" ]; then
    addr=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.address}')
    fi
    echo "Prometheus Address: $addr"
    
    secret_name=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.basicAuth.secretName}')
    namespace="cpaas-system"
    
    username=$(kubectl get secret $secret_name -n $namespace -o jsonpath='{.data.username}' | base64 -d)
    password=$(kubectl get secret $secret_name -n $namespace -o jsonpath='{.data.password}' | base64 -d)
    
    auth="Basic $(echo -n "$username:$password" | base64 -w 0)"
    echo "Prometheus Auth   : $auth"
  2. Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
    kubectl get pods -n cpaas-system | grep "hami-webui"