⚠️ This feature is still experimental. Please use it with caution.

Enable dynamic MIG feature

HAMi now supports dynamic MIG using mig-parted to adjust MIG devices dynamically, including:

Dynamic MIG Instance Management: Users no longer need to operate directly on GPU nodes or use commands like nvidia-smi -i 0 -mig 1 to manage MIG instances. HAMi-device-plugin will handle this automatically.
Dynamic MIG Adjustment: Each MIG device managed by HAMi will dynamically adjust its MIG template according to the jobs submitted, as needed.
Device MIG Observation: Each MIG instance generated by HAMi will be displayed in the scheduler monitor, along with job information, providing a clear overview of MIG nodes.
Compatibility with HAMi-Core Nodes: HAMi can manage a unified GPU pool across both HAMi-core nodes and MIG nodes. A job can be scheduled to either node unless manually specified using the nvidia.com/vgpu-mode annotation.
Unified API with HAMi-Core: No additional work is required to make jobs compatible with the dynamic MIG feature.

Prerequisites

NVIDIA Blackwell, Hopper™, and Ampere GPUs
Alauda Build of Hami Installed

Enable dynamic MIG support

Set operatingmode to mig in the hami-device-plugin ConfigMap for each MIG node

kubectl edit configmap hami-device-plugin -n kube-system

apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {
                "name": "MIG-NODE-A",
                "operatingmode": "mig",
                "migstrategy":"mixed",
                "filterdevices": {
                  "uuid": [],
                  "index": []
                }
            }
        ]
    }
kind: ConfigMap

Replace the node name in the nodeconfig array with the target node's name. To cover multiple nodes, add more entries to the array.

Restart the following pods for the change to take effect:

hami-scheduler
hami-device-plugin on node 'MIG-NODE-A'

Note: The configuration above is lost on chart upgrade; future versions of Hami will improve this.

Custom MIG configuration (optional)

HAMi ships with a default MIG configuration.

You can customize the MIG configuration by following the steps below:

kubectl -n kube-system edit configmap hami-scheduler-device

apiVersion: v1
data:
  device-config.yaml: >-
      knownMigGeometries:
      - models: [ "A30" ]
        allowedGeometries:
          -
            - name: 1g.6gb
              memory: 6144
              count: 4
          -
            - name: 2g.12gb
              memory: 12288
              count: 2
          -
            - name: 4g.24gb
              memory: 24576
              count: 1
      - models: [ "A100-SXM4-40GB", "A100-40GB-PCIe", "A100-PCIE-40GB", "A100-SXM4-40GB" ]
        allowedGeometries:
          -
            - name: 1g.5gb
              memory: 5120
              count: 7
          -
            - name: 2g.10gb
              memory: 10240
              count: 3
            - name: 1g.5gb
              memory: 5120
              count: 1
          -
            - name: 3g.20gb
              memory: 20480
              count: 2
          -
            - name: 7g.40gb
              memory: 40960
              count: 1
      - models: [ "A100-SXM4-80GB", "A100-80GB-PCIe", "A100-PCIE-80GB"]
        allowedGeometries:
          -
            - name: 1g.10gb
              memory: 10240
              count: 7
          -
            - name: 2g.20gb
              memory: 20480
              count: 3
            - name: 1g.10gb
              memory: 10240
              count: 1
          -
            - name: 3g.40gb
              memory: 40960
              count: 2
          -
            - name: 7g.79gb
              memory: 80896
              count: 1

Then restart the hami-scheduler components. HAMi identifies and uses the first MIG template that matches the job, in the order defined in this ConfigMap.

Note: The configuration above is lost on chart upgrade; future versions of Hami will improve this.

Running MIG jobs

A MIG instance can now be requested by a container in the same way as hami-core, simply by specifying the nvidia.com/gpualloc and nvidia.com/gpumem resource types.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  annotations:
    nvidia.com/vgpu-mode: "mig" #(Optional), if not set, this pod can be assigned to a MIG instance or a hami-core instance
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpualloc: 1
          nvidia.com/gpumem: 8000

Note:

The nvidia.com/gpualloc request cannot exceed the actual number of physical GPUs. For example, a single GPU in MIG mode can only request 1. This is a current Hami limitation and will be improved in future versions.
No action is required on MIG nodes — everything is managed by mig-parted in hami-device-plugin.
NVIDIA devices older than the Ampere architecture do not support MIG mode.
MIG resources (example: nvidia.com/mig-1g.10gb) will not be visible on the node. HAMi uses a unified resource name for both MIG and hami-core nodes.
The DCGM-exporter component deployed on MIG nodes must be stopped when performing MIG partitioning, because MIG partitioning requires resetting the GPU. After the first MIG-enabled workload is created, automatic MIG partitioning is performed. Subsequent workloads will not trigger further partitioning. When all workloads stop, starting the first workload again will trigger MIG partitioning once more.

#Enable dynamic MIG feature

#TOC

#Prerequisites

#Enable dynamic MIG support

#Custom MIG configuration (optional)

#Running MIG jobs

Enable dynamic MIG feature

TOC

Prerequisites

Enable dynamic MIG support

Custom MIG configuration (optional)

Running MIG jobs