Release Notes

v2.8.1

Key feature:

  • Support Huawei Ascend NPU sharing
  • Support CDI (Container Device Interface) mode on NVIDIA devices
  • Sync with NVIDIA k8s-device-plugin v0.18.0
  • Add hami_build_info Prometheus metrics and version print
  • Watch and hot reload TLS certificates without restarting pods
  • Support NVIDIA GPU Operator toolkit readiness check
  • Support GPUDirect RDMA copy (GDRCopy) and GPUDirect Storage (GDS) configuration
  • Support mock device plugin for testing environments
  • HAMi-WebUI upgraded to v1.10.0 for Hami v2.8 compatibility; v1.10.0 is compatible with Hami v2.7 and v2.8, while v1.5.0 is not compatible with Hami v2.8

Bug fix

  • Fix: HAMi-core updated to fix vLLM-related issues
  • Fix: Quota calculation error
  • Fix: MIG instance allocation error where scheduler was allocating incorrect MIG instances
  • Fix: nvidia-mig-parted upgraded to v0.12.2 for security fixes
  • Fix: After removing the device plugin from a GPU node, it could still appear
  • Fix: Concurrent map read/write errors
  • Fix: Device-NUMA acquisition logic
  • Fix: ClusterRoleBinding error when changing release name or chart name

v2.7.1

Key feature:

  • Support NVIDIA GPU ResourceQuota(ACP 4.2+)
  • Aggregated Scheduling Failure Events
  • Make node lock timeout configurable

Bug fix

  • Fix: After removing the device plugin from the gpu node, it can still be scheduled in the node.

v2.6.1

Bug fix

  • Fix: Device memory not counted properly when allocating with 'cuMallocAsync'
  • Fix: Device memory not counted properly when running gpu_burn
  • Fix: Segmentation fault on some scenarios
  • Fix: Utilization metrics not properly count when using multiple devices
  • Fix: Initialization error when using vllm with tp>2

v2.6.0

Key feature:

  • Optimize scheduler log
  • Support enflame gcu-share
  • Support metax GPU and metax sGPU
  • Helm chart add checksum annotation for restarting hami component after ConfigMap modification
  • Support for using RuntimeClass with nvidia devices
  • Add support for profiling via net/http/pprof package
  • Add nvidia gpu topoloy score registry to node
  • Feat: vGPUmonitor support MigInfo metrics

Bug fix

  • Fix stuck in driver 570+
  • Fix device memory not counted properly in comfyUI task
  • Fix cambricon devices not allocated properly
  • Fix wrong log and container request device count error
  • Fix vgpu-devices-allocated annotations are inconsistent
  • Fix removing node devices from node manager
  • Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
  • Fix device memory count error on cuMallocAsync
  • Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
  • Fix multi-process device memory count