Introduction
Hardware accelerator Introduction
The Kubernetes Hardware accelerator Suite is an enterprise-grade solution for optimizing GPU resource allocation, isolation, and sharing in cloud-native environments. Built on Kubernetes device plugins and NVIDIA-native technologies, it provides three core modules:
-
vGPU Module
Based on Opensource GPU-Manager, this enables fine-grained GPU virtualization by splitting physical GPUs into shareable virtual units with memory/compute quotas. Ideal for multi-tenant environments requiring dynamic resource allocation.
-
pGPU Module
Leveraging NVIDIA's official Device Plugin, it delivers full physical GPU isolation with NUMA-aware scheduling. Designed for high-performance computing (HPC) workloads needing dedicated GPU access.
-
MPS Module
Implements NVIDIA's Multi-Process Service to allow concurrent GPU context execution with resource constraints. Optimizes latency-sensitive applications through CUDA kernel fusion.
Product Advantages
vGPU Module
- Dynamic Slicing: Split GPUs to support multi progress used one physical gpu
- QoS Enforcement: Guaranteed compute units (vcuda-core) and memory quotas (vcuda-memory)
pGPU Module
- Hardware-Level Isolation: Direct PCIe passthrough with IOMMU protection
- NUMA Optimization: Minimize cross-socket data transfer via automatic NUMA node binding
MPS Module
- Low-Latency Execution: 30-50% latency reduction through CUDA context fusion
- Resource Caps: Limit per-process GPU compute (0-100%) and memory usage
- Zero Code Changes: Works with unmodified CUDA applications
Application Scenarios
vGPU Use Cases
- Multi-Tenant AI Platforms: Share A100/H100 GPUs across teams with guaranteed SLAs
- VDI Environments: Deliver GPU-accelerated virtual desktops for CAD/3D rendering
- Batch Inference: Parallelize model serving with fractional GPU allocations
pGPU Use Cases
- HPC Clusters: Run MPI jobs with exclusive GPU access for weather simulation
- ML Training: Full GPU utilization for large language model training
- Medical Imaging: Process high-resolution MRI data without resource contention
MPS Use Cases
- Real-Time Inference: Low-latency video analytics using concurrent CUDA streams
- Microservice Orchestration: Co-locate multiple GPU microservices on shared Hardware
- High-Concurrency Serving: Improve QPS by 3x for recommendation systems
Technical Limitations
Privileged Required
Hardware Device Access Requirements
Device File Permissions
NVIDIA GPU devices require direct access to protected system resources:
# Device file ownership and permissions
$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Aug 1 10:00 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Aug 1 10:00 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Aug 1 10:00 /dev/nvidia-uvm
- Requirement: Root access to read/write device files
- Consequence: Non-root containers get permission denied errors
Kernel-Level Operations
Essential NVIDIA Driver Interactions
Operation | Privilege Requirement | Purpose |
---|
Module Loading | CAP_SYS_MODULE | Load NVIDIA kernel modules |
Memory Management | CAP_IPC_LOCK | GPU memory allocation |
Interrupt Handling | CAP_SYS_RAWIO | Process GPU interrupts |
K8s Device Plugin Architecture Requirements
- Socket Creation: Write to
/var/lib/kubelet/device-plugins
- Health Monitoring: Access to
nvidia-smi
and kernel logs
- Resource Allocation: Modify device cgroups
vGPU Constraints
- support only cuda less then 12.4
- No MIG support when vGPU enabled
pGPU Constraints
- No GPU sharing capability (1:1 pod-to-GPU mapping)
- Requires Kubernetes 1.25+ with SR-IOV enabled
- Limited to PCIe/NVSwitch-connected GPUs
MPS Constraints
- Potential fault propagation across fused contexts
- Requires CUDA 11.4+ for memory limits
- No support for MIG-sliced GPUs