Release Notes
v2.8.1
Key feature:
Support Huawei Ascend NPU sharing
Support CDI (Container Device Interface) mode on NVIDIA devices
Sync with NVIDIA k8s-device-plugin v0.18.0
Add hami_build_info Prometheus metrics and version print
Watch and hot reload TLS certificates without restarting pods
Support NVIDIA GPU Operator toolkit readiness check
Support GPUDirect RDMA copy (GDRCopy) and GPUDirect Storage (GDS) configuration
Support mock device plugin for testing environments
HAMi-WebUI upgraded to v1.10.0 for Hami v2.8 compatibility; v1.10.0 is compatible with Hami v2.7 and v2.8, while v1.5.0 is not compatible with Hami v2.8
Bug fix
Fix: HAMi-core updated to fix vLLM-related issues
Fix: Quota calculation error
Fix: MIG instance allocation error where scheduler was allocating incorrect MIG instances
Fix: nvidia-mig-parted upgraded to v0.12.2 for security fixes
Fix: After removing the device plugin from a GPU node, it could still appear
Fix: Concurrent map read/write errors
Fix: Device-NUMA acquisition logic
Fix: ClusterRoleBinding error when changing release name or chart name
v2.7.1
Key feature:
Support NVIDIA GPU ResourceQuota(ACP 4.2+)
Aggregated Scheduling Failure Events
Make node lock timeout configurable
Bug fix
Fix: After removing the device plugin from the gpu node, it can still be scheduled in the node.
v2.6.1
Bug fix
Fix: Device memory not counted properly when allocating with 'cuMallocAsync'
Fix: Device memory not counted properly when running gpu_burn
Fix: Segmentation fault on some scenarios
Fix: Utilization metrics not properly count when using multiple devices
Fix: Initialization error when using vllm with tp>2
v2.6.0
Key feature:
Optimize scheduler log
Support enflame gcu-share
Support metax GPU and metax sGPU
Helm chart add checksum annotation for restarting hami component after ConfigMap modification
Support for using RuntimeClass with nvidia devices
Add support for profiling via net/http/pprof package
Add nvidia gpu topoloy score registry to node
Feat: vGPUmonitor support MigInfo metrics
Bug fix
Fix stuck in driver 570+
Fix device memory not counted properly in comfyUI task
Fix cambricon devices not allocated properly
Fix wrong log and container request device count error
Fix vgpu-devices-allocated annotations are inconsistent
Fix removing node devices from node manager
Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
Fix device memory count error on cuMallocAsync
Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
Fix multi-process device memory count