Introduction

Hardware accelerator Introduction

The Kubernetes Hardware accelerator Suite is an enterprise-grade solution for optimizing GPU resource allocation, isolation, and sharing in cloud-native environments. Built on Kubernetes device plugins and NVIDIA-native technologies, it provides three core modules:

  1. vGPU Module
    Based on Opensource GPU-Manager, this enables fine-grained GPU virtualization by splitting physical GPUs into shareable virtual units with memory/compute quotas. Ideal for multi-tenant environments requiring dynamic resource allocation.

  2. pGPU Module
    Leveraging NVIDIA's official Device Plugin, it delivers full physical GPU isolation with NUMA-aware scheduling. Designed for high-performance computing (HPC) workloads needing dedicated GPU access.

  3. MPS Module
    Implements NVIDIA's Multi-Process Service to allow concurrent GPU context execution with resource constraints. Optimizes latency-sensitive applications through CUDA kernel fusion.

Product Advantages

vGPU Module

  • Dynamic Slicing: Split GPUs to support multi progress used one physical gpu
  • QoS Enforcement: Guaranteed compute units (vcuda-core) and memory quotas (vcuda-memory)

pGPU Module

  • Hardware-Level Isolation: Direct PCIe passthrough with IOMMU protection
  • NUMA Optimization: Minimize cross-socket data transfer via automatic NUMA node binding

MPS Module

  • Low-Latency Execution: 30-50% latency reduction through CUDA context fusion
  • Resource Caps: Limit per-process GPU compute (0-100%) and memory usage
  • Zero Code Changes: Works with unmodified CUDA applications

Application Scenarios

vGPU Use Cases

  • Multi-Tenant AI Platforms: Share A100/H100 GPUs across teams with guaranteed SLAs
  • VDI Environments: Deliver GPU-accelerated virtual desktops for CAD/3D rendering
  • Batch Inference: Parallelize model serving with fractional GPU allocations

pGPU Use Cases

  • HPC Clusters: Run MPI jobs with exclusive GPU access for weather simulation
  • ML Training: Full GPU utilization for large language model training
  • Medical Imaging: Process high-resolution MRI data without resource contention

MPS Use Cases

  • Real-Time Inference: Low-latency video analytics using concurrent CUDA streams
  • Microservice Orchestration: Co-locate multiple GPU microservices on shared Hardware
  • High-Concurrency Serving: Improve QPS by 3x for recommendation systems

Technical Limitations

Privileged Required

Hardware Device Access Requirements

Device File Permissions NVIDIA GPU devices require direct access to protected system resources:

# Device file ownership and permissions
$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Aug 1 10:00 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Aug 1 10:00 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Aug 1 10:00 /dev/nvidia-uvm
  • Requirement: Root access to read/write device files
  • Consequence: Non-root containers get permission denied errors

Kernel-Level Operations

Essential NVIDIA Driver Interactions

OperationPrivilege RequirementPurpose
Module LoadingCAP_SYS_MODULELoad NVIDIA kernel modules
Memory ManagementCAP_IPC_LOCKGPU memory allocation
Interrupt HandlingCAP_SYS_RAWIOProcess GPU interrupts

K8s Device Plugin Architecture Requirements

  1. Socket Creation: Write to /var/lib/kubelet/device-plugins
  2. Health Monitoring: Access to nvidia-smi and kernel logs
  3. Resource Allocation: Modify device cgroups

vGPU Constraints

  • support only cuda less then 12.4
  • No MIG support when vGPU enabled

pGPU Constraints

  • No GPU sharing capability (1:1 pod-to-GPU mapping)
  • Requires Kubernetes 1.25+ with SR-IOV enabled
  • Limited to PCIe/NVSwitch-connected GPUs

MPS Constraints

  • Potential fault propagation across fused contexts
  • Requires CUDA 11.4+ for memory limits
  • No support for MIG-sliced GPUs