MIG Support
The Multi-Instance GPU (MIG) feature enables securely partitioning GPUs such as the NVIDIA A100 into several separate GPU instances for CUDA applications. For example, the NVIDIA A100 supports up to seven separate GPU instances.
MIG provides multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU's compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.
This document provides an overview of the necessary steps to enable MIG support for Alauda Build of NVIDIA GPU Device Plugin. Refer to the MIG User Guide for more details on the technical concepts, setting up MIG and the NVIDIA Container Toolkit for running containers with MIG.
Prerequisites
- Alauda Build of GPU Device Plugin: v0.18.0+
- NVIDIA Blackwell, Hopperâ„¢, and Ampere GPUs(see Supported GPUs)
Testing with Different MIG Strategies
The none strategy
The none strategy is designed to keep the Alauda Build of GPU Device Plugin running the same as it always has. The plugin will make no distinction between GPUs that have either MIG enabled or not, and will enumerate all GPUs on the node, making them available using the nvidia.com/gpu resource type.
Procedure
To test this strategy we check the enumeration of a GPU with and without MIG enabled and make sure we can see it in both cases. The test assumes a single GPU on a single node in the cluster.
-
Verify that MIG is disabled on the GPU:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:36:00.0 Off | 0 |
| N/A 29C P0 62W / 400W | 0MiB / 40537MiB | 6% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
-
Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy or update the Alauda Build of GPU Device Plugin Cluster plugin:
Set the MIG strategy for none in the config form.
-
Observe that 1 GPU is available on the node with resource type nvidia.com/gpu:
kubectl describe node {node-id}
...
Capacity:
nvidia.com/gpu: 1
...
Allocatable:
nvidia.com/gpu: 1
...
-
Deploy a pod to consume the GPU and run nvidia-smi
kubectl run -it --rm \
--image=nvidia/cuda:12.4.1-base-ubuntu20.04 \
--restart=Never \
--limits=nvidia.com/gpu=1 \
mig-none-example -- nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-15f0798d-c807-231d-6525-a7827081f0f1)
The single strategy
The single strategy is designed to keep the user-experience of working with GPUs in Kubernetes the same as it has always been. MIG devices are enumerated with the nvidia.com/gpu resource type just as before. However, the properties associated with that resource type now map to the MIG devices available on that node, instead of the full GPUs.
Procedure
To test this strategy, we check that MIG devices of a single type are enumerated using the traditional nvidia.com/gpu resource type. The test assumes a single GPU on a single node in the cluster with MIG enabled on it already.
-
Enable MIG on the GPU (requires stopping all GPU clients first)
On the control node:
kubectl label node {node-id} nvidia-device-enable-
On the GPU node:
Enabled MIG Mode for GPU 00000000:36:00.0
All done.
nvidia-smi --query-gpu=mig.mode.current --format=csv,noheader
On the control node:
kubectl label node {node-id} nvidia-device-enable=pgpu
-
Create 7 single-slice MIG devices on the GPU:
INFO
The following example is for the Nvidia A100 model. For other models, you can check the supported split types by executing the command nvidia-smi mig -lgip.
For example, for the Nvidia A30 model, after executing the command:
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|=============================================================================|
| 0 MIG 1g.6gb 14 4/4 5.81 No 14 1 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.6gb+me 21 1/1 5.81 No 14 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.12gb 5 2/2 11.75 No 28 2 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.12gb+me 6 1/1 11.75 No 28 2 0 |
| 2 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.24gb 0 1/1 23.56 No 56 4 0 |
| 4 1 1 |
+-----------------------------------------------------------------------------+
You can run nvidia-smi mig -cgi 14,14,14,14 -C for create 4 single-slice MIG devices, or run nvidia-smi mig -cgi 14,14,5 -C for create 2 single-slice MIG devices and 1 double-slice MIG device.
Refer to Supported MIG Profiles
# They must be of the same MIG device type present on node for single strategy
nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C
GPU 0: A100-SXM4-40GB (UUID: GPU-4200ccc0-2667-d4cb-9137-f932c716232a)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/7/0)
MIG 1g.5gb Device 1: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/8/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/9/0)
MIG 1g.5gb Device 3: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/10/0)
MIG 1g.5gb Device 4: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/11/0)
MIG 1g.5gb Device 5: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/12/0)
MIG 1g.5gb Device 6: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/13/0)
-
Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy or update the Alauda Build of GPU Device Plugin Cluster plugin:
Set the MIG strategy for single in the config form.
-
Observe that 7 MIG devices are available on the node with resource type nvidia.com/gpu:
kubectl describe node {node-id}
...
Capacity:
nvidia.com/gpu: 7
...
Allocatable:
nvidia.com/gpu: 7
...
-
Deploy 7 pods, each consuming one MIG device (then read their logs and delete them)
for i in $(seq 7); do
kubectl run \
--image=nvidia/cuda:12.4.1-base-ubuntu20.04 \
--restart=Never \
--limits=nvidia.com/gpu=1 \
mig-single-example-${i} -- bash -c "nvidia-smi -L; sleep infinity"
done
pod/mig-single-example-1 created
pod/mig-single-example-2 created
pod/mig-single-example-3 created
pod/mig-single-example-4 created
pod/mig-single-example-5 created
pod/mig-single-example-6 created
pod/mig-single-example-7 created
for i in $(seq 7); do
echo "mig-single-example-${i}";
kubectl logs mig-single-example-${i}
echo "";
done
mig-single-example-1
GPU 0: A100-SXM4-40GB (UUID: GPU-4200ccc0-2667-d4cb-9137-f932c716232a)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/7/0)
mig-single-example-2
GPU 0: A100-SXM4-40GB (UUID: GPU-4200ccc0-2667-d4cb-9137-f932c716232a)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/9/0)
...
for i in $(seq 7); do
kubectl delete pod mig-single-example-${i};
done
pod "mig-single-example-1" deleted
pod "mig-single-example-2" deleted
...
The mixed strategy
The mixed strategy is designed to enumerate a different resource type for every MIG device configuration available in the cluster.
Procedure
To test this strategy, we check that all MIG devices are enumerated using their fully qualified name of the form nvidia.com/mig-<slice_count>g.<memory_size>gb. The test assumes a single GPU on a single node in the cluster with MIG enabled on it already.
-
Verify that MIG is enabled on the GPU and no MIG devices present:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:00:04.0 Off | On |
| N/A 32C P0 43W / 400W | 0MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
-
Create 3 different MIG devices of different sizes on the GPU(Nvidia A100):
nvidia-smi mig -cgi 9,14,19 -C
GPU 0: A100-SXM4-40GB (UUID: GPU-4200ccc0-2667-d4cb-9137-f932c716232a)
MIG 3g.20gb Device 0: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/2/0)
MIG 2g.10gb Device 1: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/3/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/9/0)
-
Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy or update the Alauda Build of GPU Device Plugin Cluster plugin:
Set the MIG strategy for mixed in the config form.
-
Observe that 3 MIG devices are available on the node with resource type nvidia.com/gpu:
kubectl describe node {node-id}
...
Capacity:
nvidia.com/mig-1g.5gb: 1
nvidia.com/mig-2g.10gb: 1
nvidia.com/mig-3g.20gb: 1
...
Allocatable:
nvidia.com/mig-1g.5gb: 1
nvidia.com/mig-2g.10gb: 1
nvidia.com/mig-3g.20gb: 1
...
-
Deploy 3 pods, each consuming one of the available MIG devices
kubectl run -it --rm \
--image=nvidia/cuda:12.4.1-base-ubuntu20.04 \
--restart=Never \
--limits=nvidia.com/mig-1g.5gb=1 \
mig-mixed-example -- nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-4200ccc0-2667-d4cb-9137-f932c716232a)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/9/0)
pod "mig-mixed-example" deleted