How to Define a Custom Cost Model

This guide shows how to add new billing items by defining a custom cost model. The workflow has four parts:

  • Prepare usage metrics
  • Prepare label metrics
  • Add a collection config for the Cost Management Agent
  • Add a display config for the Cost Management Server

As a concrete example, we will add two GPU billing items: GPU Compute(cores) and GPU Memory , and enable pricing by label.

TOC

Prepare Usage Metrics

Usage metrics represent the consumption to be billed. Provide a PromQL query that aggregates usage per step (for example, 5m). Each step yields one data point that represents usage over the preceding window; the final usage is the sum across the selected time range.

Prometheus metric types include Counter, Gauge, Summary, and Histogram. For usage, use Counter or Gauge. Examples:

# Counter example (e.g., GPU core usage time):
vgpu_core_usage_seconds_total{"containeridx":"0","deviceuuid":"GPU-2eec4202-80dc-870f-3ca5-25879d96eca7","endpoint":"monitor","instance":"10.3.0.130:9395","ip":"192.168.131.32","job":"hami-scheduler","namespace":"kube-system","node_name":"192.168.131.32","nodename":"192.168.133.48","pod":"hami-scheduler-86b7ff47c6-sp7kb","podname":"pytorch-cuda12-1-57d9ff4544-bqkvg","podnamespace":"yulin-2","service":"hami-scheduler","zone":"vGPU"}

# Corresponding usage query (step = 5m):
sum by (deviceuuid, podnamespace) (rate(vgpu_core_usage_seconds_total{}[5m]))


# Gauge example (e.g., GPU memory usage in bytes):
vgpu_memory_usage_bytes{"containeridx":"0","deviceuuid":"GPU-2eec4202-80dc-870f-3ca5-25879d96eca7","endpoint":"monitor","instance":"10.3.0.130:9395","ip":"192.168.131.32","job":"hami-scheduler","namespace":"kube-system","node_name":"192.168.131.32","nodename":"192.168.133.48","pod":"hami-scheduler-86b7ff47c6-sp7kb","podname":"pytorch-cuda12-1-57d9ff4544-bqkvg","podnamespace":"yulin-2","service":"hami-scheduler","zone":"vGPU"}

# Corresponding usage query (step = 5m):
sum by (deviceuuid, podnamespace) (avg_over_time(vgpu_memory_usage_bytes{}[5m]))

Prepare Label Metrics

Label metrics attach attributes to usage so you can price by labels. They must be joinable with the usage metrics—share at least one label (such as deviceuuid)—so the system can enrich usage with those attributes.

For label metrics, use Gauge. Example:

# Label metrics (label fields start with label_ in this example):
vgpu_device_labels{"deviceuuid":"GPU-2eec4202-80dc-870f-3ca5-25879d96eca7","label_device":"nvidia","label_modelName":"Tesla T4","podnamespace":"yulin-1"}

# Query for label metrics:
vgpu_device_labels{}

# How the usage and label queries relate (GPU core & memory example):
# 1) The usage query groups by (deviceuuid, podnamespace) and each series includes deviceuuid
# 2) The label query groups by its own labels and each series includes deviceuuid
# 3) Because both include the unique deviceuuid, labels from the label query can be attached to usage

Add Collection Config (Cost Management Agent)

Create a ConfigMap in each cluster where the Cost Management Agent runs to declare what to collect.

A typical collected record (GPU core/memory) maps to config fields as shown:

{
    "id": "cab9881e380fcf72726ccf45565ffc2d",           # Auto-generated from agreed fields
    "kind": "Vgpu",                                     # Matches kind in config
    "name": "GPU-2eec4202-80dc-870f-3ca5-25879d96eca7", # From usage.mappers.name
    "namespace": "cpaas-system",                        # From usage.mappers.namespace
    "cluster": "default-cluster",                       # Cluster where the agent runs
    "project": "cpaas-system",                          # Derived from namespace
    "labels": {                                          # From labels.query if configured
        "key1": "val1",
        "key2": "val2"
    },
    "date": "2024-04-29T00:00:00Z",                     # Start of that day
    "period": "hourly",                                 # Matches period in config
    "start": "2024-04-29T00:05:00Z",                    # Period start
    "end": "2024-04-29T00:59:59Z",                      # Period end
    "category": "VgpuCore",                             # Matches category in config
    "item": "VgpuCoreUsed",                             # Matches item in config
    "usage": 200                                         # From usage query result
}

Now create the ConfigMap for the agent (GPU core & memory):

apiVersion: v1
data:
  config: |
    - kind: Vgpu                                        # Required; must be consistent with server config
      category: VgpuCore                                # Required; must be consistent with server config
      item: VgpuCoreUsed                                # Required and unique; must match server config
      period: Hourly                                    # Required; Hourly or Daily; prefer Hourly
      labels:                                           # Optional; enrich usage with labels from this query
        query: "vgpu_core_labels{}"                     # E.g., add GPU model, vendor to labels
        mappers:
          name: deviceuuid                              # Map deviceuuid as name
          namespace: podnamespace                       # Map podnamespace as namespace
          cluster: ""                                   # Leave empty to auto-fill current cluster
          project: ""                                   # Leave empty to auto-fill project from namespace
      usage:                                            # Required; usage query (grouped by deviceuuid,podnamespace)
        query: "sum by (deviceuuid, podnamespace) (rate(vgpu_core_usage_seconds_total{}[5m]))"
        step: 5m                                        # Required; step for sampling points
        mappers:
          name: deviceuuid                              # Map deviceuuid to name
          namespace: podnamespace                       # Map podnamespace to namespace
          cluster: ""                                   # Auto-fill if empty
          project: ""                                   # Auto-fill if empty
    - kind: Vgpu                                        # Second billing item
      category: VgpuMemory                              # Must be consistent with server config
      item: VgpuMemoryUsed                              # Unique item name
      period: Hourly
      labels:
        query: "vgpu_core_labels{}"
        mappers:
          name: deviceuuid
          namespace: podnamespace
          cluster: ""
          project: ""
      usage:
        query: "sum by (deviceuuid, podnamespace) (avg_over_time(vgpu_memory_usage_bytes{}[5m]))"
        step: 5m
        mappers:
          name: deviceuuid
          namespace: podnamespace
          cluster: ""
          project: ""
kind: ConfigMap
metadata:
  labels:
    cpaas.io/slark.collection.config: "true"            # Required; enables collection config
  name: slark-agent-project-config-vgpu
  namespace: cpaas-system                               # Required;

After adding the yaml , you need to restart the Agent Pod to reload the configurations.

kubectl delete pods -n cpaas-system -l service_name=slark-agent

Add Display/Storage Config (Cost Management Server)

Create a ConfigMap in the cluster where the Cost Management Server runs to declare billing items, methods, units, and display names. This tells the server what and how to bill.

apiVersion: v1
data:
  config: |
    - name: VgpuCore                                     # Billing item name; must match category above
      displayname:
        zh: "Vgpu"
        en: "Vgpu"
      methods:                                           # List of billing methods (unique names)
        - name: Usage                                    # Method name
          displayname:
            zh: "使用量"
            en: "Used Usage"
          item: VgpuCoreUsed                             # Must match the agent config item
          divisor: 1000                                  # Unit conversion (e.g., mCPU to cores)
          unit:
            zh: "core-hours"
            en: "core-hours"
    - name: VgpuMemory                                   # Second billing item
      displayname:
        zh: "Vgpu 显存"
        en: "VgpuMemory"
      methods:
        - name: Used
          displayname:
            zh: "使用量"
            en: "Used Usage"
          item: VgpuMemoryUsed
          divisor: 1073741824                            # bytes -> Gi
          unit:
            zh: "Gi-hours"
            en: "Gi-hours"
kind: ConfigMap
metadata:
  labels:
    cpaas.io/slark.display.config: "true"                # Required; enables display/storage config
  name: slark-display-config-for-vgpu
  namespace: kube-public                                 # Required;

After adding the yaml , you need to restart the Server Pod to reload the configurations.

kubectl delete pods -n cpaas-system -l service_name=slark-server

Notes and Best Practices

  • Keep naming consistent across agent and server configs: kind, category, and item must match.
  • Prefer Hourly for finer granularity and faster feedback.
  • Ensure label metrics can be joined with usage via a shared label (for example, deviceuuid).
  • Validate PromQL locally before rollout; monitor initial runs for data sanity.
  • Start small (few clusters/items), then scale after validation.