Installation

Prerequisites

Cluster administrator access to your ACP cluster
Kubernetes version: v1.16+
CUDA version: v10.2+
NvidiaDriver: v440+ in Hami and v450+ in DCGM-exporter
ACP version: v3.18.2,v4.0,v4.1

Procedure

Installing Nvidia driver in your gpu node

Prefer to Installation guide of Nvidia Official website

Installing Nvidia Container Runtime

Prefer to Installation guide of Nvidia Container Toolkit

Add Nvidia yum library in GPU node

Note: Make sure the GPU node can access nvidia.github.io

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum makecache -y

When the message "Metadata cache created." appears, it indicates that the addition was successful.

Installing Nvidia Container Runtime

yum install nvidia-container-toolkit -y

When the prompt "Complete!" appears, it means the installation is successful.

Downloading Cluster plugin

INFO

Alauda Build of Hami and Alauda Build of DCGM-Exporter and Alauda Build of Hami-WebUI(optional) cluster plugin can be retrieved from Customer Portal.

Please contact Consumer Support for more information.

Note: Alauda Build of DCGM-Exporter of version v4.2.3-413 deployed in the global cluster may cause the component to be continuously reinstalled. Version v4.2.3-413-1 resolves this issue, so be sure to use this version.

Uploading the Cluster plugin

For more information on uploading the cluster plugin, please refer to

Installing Alauda Build of Hami

Add label "gpu=on" in your GPU node for Hami schedule.
```
kubectl label nodes {nodeid} gpu=on
```
Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of Hami Cluster plugin. Note: Deploy form parameters can be kept as default or modified after knowing how to use them.
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
```
kubectl get pods -n kube-system | grep -E "hami-scheduler|hami-device-plugin"
```

Create some ConfigMaps that defines extended resources, which can be used to set extended resources on the ACP. Run the following script in your gpu cluster:

Click to expand code

kubectl apply -f - <<EOF
apiVersion: v1
data:
  dataType: integer
  defaultValue: "1"
  descriptionEn: Declare how many physical GPUs needs and the requests of gpu core and gpu memory are the usage of per physical GPU
  descriptionZh: 申请的物理 gpu 个数, 申请的算力和显存都是每个物理 GPU 的使用量
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpualloc
  labelEn: gpu number
  labelZh: gpu 个数
  limits: optional
  requests: disabled
  resourceUnit: "count"
  relatedResources: "nvidia.com/gpucores,nvidia.com/gpumem"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpualloc
  namespace: kube-public
---
apiVersion: v1
data:
  dataType: integer
  defaultValue: "20"
  descriptionEn: vgpu cores, 100 cores represents the all computing power of a physical GPU
  descriptionZh: vgpu 算力, 100 算力代表一个物理 GPU 的全部算力
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpucores
  labelEn: vgpu cores
  labelZh: vgpu 算力
  limits: optional
  requests: disabled
  resourceUnit: "%"
  relatedResources: "nvidia.com/gpualloc,nvidia.com/gpumem"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
  ignoreNodeCheck: "true"
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpucores
  namespace: kube-public
---
apiVersion: v1
data:
  dataType: integer
  defaultValue: "4000"
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpumem
  labelEn: vgpu memory
  labelZh: vgpu 显存
  limits: optional
  requests: disabled
  resourceUnit: "Mi"
  relatedResources: "nvidia.com/gpualloc,nvidia.com/gpucores"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
  ignoreNodeCheck: "true"
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpumem
  namespace: kube-public
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cf-crl-hami-config
  namespace: kube-public
  labels:
    device-plugin.cpaas.io/config: "true"
data:
  deviceName: "HAMi"
  nodeLabelKey: "gpu"
  nodeLabelValue: "on"
EOF

Then you can see Hami from the extended resource type drop-down box in the resource configuration page when creating application in the ACP business view, and then you can use it.

Installing Alauda Build of DCGM-Exporter

Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of DCGM-Exporter Cluster plugin: Set the node labels in the popup form:
- Node Label Key: gpu
- Node Label Value: on
If you need enable dcgm-exporter for pgpu, you can add another labels:
- Node Label Key: nvidia-device-enable
- Node Label Value: pgpu
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
```
kubectl get pods -n kube-system | grep dcgm-exporter
```

Installing Monitor

You can use the ACP MonitorDashboard or the Alauda build of Hami-WebUI

Installing ACP MonitorDashboard(optional)

Create the ACP MonitorDashboard resource for HAMi GPU monitor in ACP dashboard. Save the hami-vgpu-metrics-dashboard-v1.0.2.yaml file to the business cluster and execute the command: kubectl apply -f hami-vgpu-metrics-dashboard-v1.0.2.yaml

Installing Alauda build of Hami-WebUI(optional)

Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of Hami-WebUI Cluster plugin. Fill in the Prometheus address and Prometheus authentication. It is recommended to enable NodePort access. Prometheus address and Auth message can retrieve by the following scripts:

#!/bin/bash

addr=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.service}')
if [ -z "$addr" ]; then
addr=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.address}')
fi
echo "Prometheus Address: $addr"

secret_name=$(kubectl get feature monitoring -o jsonpath='{.spec.accessInfo.database.basicAuth.secretName}')
namespace="cpaas-system"

username=$(kubectl get secret $secret_name -n $namespace -o jsonpath='{.data.username}' | base64 -d)
password=$(kubectl get secret $secret_name -n $namespace -o jsonpath='{.data.password}' | base64 -d)

auth="Basic $(echo -n "$username:$password" | base64 -w 0)"
echo "Prometheus Auth   : $auth"

Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
```
kubectl get pods -n cpaas-system | grep "hami-webui"
```

#Installation

#TOC

#Prerequisites

#Procedure

#Installing Nvidia driver in your gpu node

#Installing Nvidia Container Runtime

#Add Nvidia yum library in GPU node

#Installing Nvidia Container Runtime

#Downloading Cluster plugin

#Uploading the Cluster plugin

#Installing Alauda Build of Hami

#Installing Alauda Build of DCGM-Exporter

#Installing Monitor

#Installing ACP MonitorDashboard(optional)

#Installing Alauda build of Hami-WebUI(optional)

Installation

TOC

Prerequisites

Procedure

Installing Nvidia driver in your gpu node

Installing Nvidia Container Runtime

Add Nvidia yum library in GPU node

Installing Nvidia Container Runtime

Downloading Cluster plugin

Uploading the Cluster plugin

Installing Alauda Build of Hami

Installing Alauda Build of DCGM-Exporter

Installing Monitor

Installing ACP MonitorDashboard(optional)

Installing Alauda build of Hami-WebUI(optional)