How to Define a vGPU(Hami) Cost Model
TOC
Prerequisites
In the GPU cluster:
- Alauda Build of Hami installed
- The Cost Management Agent installed
About Alauda Build of Hami
Heterogeneous AI Computing Virtualization Middleware (HAMi), formerly known as k8s-vGPU-scheduler, is an "all-in-one" chart designed to manage Heterogeneous AI Computing Devices in a k8s cluster. It can provide the ability to share Heterogeneous AI devices among tasks.
Note
Because Alauda Build of Hami releases on a different cadence from Alauda Container Platform, the Alauda Build of Hami documentation is now available as a separate documentation set at .
Procedure
Create PrometheusRule for generate needed metrics
Create a PrometheusRule in the Hami cluster.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: kube-prometheus
name: hami-gpu-labels
namespace: kube-system
spec:
groups:
- name: hami-gpu-labels.rules
rules:
- expr: |
min by (podnamespace, deviceuuid, label_modelName, label_device) (
vGPUCorePercentage
* on (deviceuuid) group_left(label_modelName, label_device) (
label_replace(
label_replace(
label_replace(
DCGM_FI_DEV_SM_CLOCK,
"deviceuuid", "$1", "UUID", "(.*)"
),
"label_modelName", "$1", "modelName", "(.*)"
),
"label_device", "$1", "device", "([a-zA-Z]+)[0-9]+$"
)
)
)
record: vgpu_core_labels
Add Collection Config (Cost Management Agent)
Create some ConfigMap in the Hami cluster where the Cost Management Agent runs to declare what to collect.
Note: The configmap of project quota is only supported in hami 2.7+
apiVersion: v1
data:
config: >
- kind: vGPU
category: vGPUCore
item: vGPUCoreQuota
period: Hourly
labels:
query: "vgpu_core_labels{}"
mappers:
name: deviceuuid
namespace: podnamespace
cluster: ""
project: ""
usage:
query: sum by (deviceuuid,podnamespace) (avg_over_time(vGPUCorePercentage{}[5m]))
step: 5m
mappers:
name: deviceuuid
namespace: podnamespace
cluster: ""
project: ""
- kind: vGPU
category: vGPUMemory
item: vGPURamBytesQuota
period: Hourly
labels:
query: "vgpu_core_labels{}"
mappers:
name: deviceuuid
namespace: podnamespace
cluster: ""
project: ""
usage:
query: sum by (deviceuuid,podnamespace) (avg_over_time(vGPU_device_memory_limit_in_bytes{}[5m]))
step: 5m
mappers:
name: deviceuuid
namespace: podnamespace
cluster: ""
project: ""
- kind: vGPU
category: vGPUCore
item: vGPUCoreUsed
period: Hourly
labels:
query: "vgpu_core_labels{}"
mappers:
name: deviceuuid
namespace: podnamespace
cluster: ""
project: ""
usage:
query: sum by (deviceuuid,podnamespace) (avg_over_time(Device_utilization_desc_of_container{}[5m]))
step: 5m
mappers:
name: deviceuuid
namespace: podnamespace
cluster: ""
project: ""
- kind: vGPU
category: vGPUMemory
item: vGPURamBytesUsed
period: Hourly
labels:
query: "vgpu_core_labels{}"
mappers:
name: deviceuuid
namespace: podnamespace
cluster: ""
project: ""
usage:
query: sum by (deviceuuid,podnamespace) (avg_over_time(vGPU_device_memory_usage_in_bytes{}[5m]))
step: 5m
mappers:
name: deviceuuid
namespace: podnamespace
cluster: ""
project: ""
kind: ConfigMap
metadata:
labels:
cpaas.io/slark.collection.config: "true"
name: slark-agent-vgpu-namespace-config
namespace: cpaas-system
---
# Note: The following configmap is only supported in hami 2.7+
apiVersion: v1
data:
config: >
- kind: Project
category: vGPUCore
item: vGPUCoresProjectQuota
period: Hourly
usage:
query: avg by (project, cluster) (avg_over_time(cpaas_project_resourcequota{resource="limits.nvidia.com/gpucores", type="project-hard"}[5m]))
step: 5m
mappers:
name: project
namespace: ""
cluster: cluster
project: project
- kind: Project
category: vGPUMemory
item: vGPURamBytesProjectQuota
period: Hourly
usage:
query: avg by (project, cluster) (avg_over_time(cpaas_project_resourcequota{resource="limits.nvidia.com/gpumem", type="project-hard"}[5m]))
step: 5m
mappers:
name: project
namespace: ""
cluster: cluster
project: project
kind: ConfigMap
metadata:
labels:
cpaas.io/slark.collection.config: "true"
name: slark-agent-project-config-vgpu
namespace: cpaas-system
After adding the yaml , you need to restart the Agent Pod to reload the configurations.
kubectl delete pods -n cpaas-system -l service_name=slark-agent
Add Display/Storage Config (Cost Management Server)
Create a ConfigMap in the cluster where the Cost Management Server runs to declare billing items, methods, units, and display names. This tells the server what and how to bill.
Note:
It makes no sense to use Request Usage without enabling GPU Overcommitment Ratio. If you use Request Usage, please enable GPU Overcommitment Ratio.
apiVersion: v1
data:
config: |
- name: vGPUCore
displayname:
zh: "HAMi NVIDIA vGPU Cores"
en: "HAMi NVIDIA vGPU Cores"
methods:
- name: Request
displayname:
zh: "请求量"
en: "Request Usage"
item: vGPUCoreQuota
divisor: 1
unit:
zh: "core-hours"
en: "core-hours"
- name: Usage
displayname:
zh: "使用量"
en: "Used Usage"
item: vGPUCoreUsed
divisor: 1
unit:
zh: "core-hours"
en: "core-hours"
- name: ProjectQuota
displayname:
zh: "项目配额"
en: "Project Quota"
item: vGPUCoresProjectQuota
unit:
zh: "core-hours"
en: "core-hours"
divisor: 1
- name: vGPUMemory
displayname:
zh: "HAMi NVIDIA vGPU Memory"
en: "HAMi NVIDIA vGPU Memory"
methods:
- name: Request
displayname:
zh: "请求量"
en: "Request Usage"
item: vGPURamBytesQuota
divisor: 1073741824
unit:
zh: "Gi-hours"
en: "Gi-hours"
- name: Used
displayname:
zh: "使用量"
en: "Used Usage"
item: vGPURamBytesUsed
divisor: 1073741824
unit:
zh: "Gi-hours"
en: "Gi-hours"
- name: ProjectQuota
displayname:
zh: "项目配额"
en: "Project Quota"
item: vGPURamBytesProjectQuota
unit:
zh: "Gi-hours"
en: "Gi-hours"
divisor: 1024 # Mi/1024
kind: ConfigMap
metadata:
labels:
cpaas.io/slark.display.config: "true"
name: slark-display-config-for-vgpu
namespace: kube-public
After adding the yaml , you need to restart the Server Pod to reload the configurations.
kubectl delete pods -n cpaas-system -l service_name=slark
Add Price For a vGPU Cost Model
If the GPU cluster does not have a Cost model, you need to create a new cost model.
Then you can add price for the cost model of the GPU cluster:
Billing Method Description
Add Price For a Cost Model
-
Select vGPU or vGPUMemory in Billing Items.
-
Select Request Usage (core-hours) or Used Usage (core-hours) or Project Quota (core-hours) in Method.
-
Set Default Price.
-
Config Price By Label (optional).
Currently only two keys are supported: modelName and device
modelName: GPU model, example "Tesla P100-PCIE-16GB" or "Tesla T4"(Got it by run nvidia-smi).
device: GPU manufacturer, example "nvidia" or "ascend".
Cost Details and Cost Statistics
Finally, after waiting for 1 or more hours, you can see the cost details in the Cost Details with namespace and card uuid dimensions.
And you can see the total costs based on cluster, project, and namespace in the Cost Statistics.