Release Notes

TOC

v2.6.1

Bug fix

  • Fix: Device memory not counted properly when allocating with 'cuMallocAsync'
  • Fix: Device memory not counted properly when running gpu_burn
  • Fix: Segmentation fault on some scenarios
  • Fix: Utilization metrics not properly count when using multiple devices
  • Fix: Initialization error when using vllm with tp>2

v2.6.0

Key feature:

  • Optimize scheduler log
  • Support enflame gcu-share
  • Support metax GPU and metax sGPU
  • Helm chart add checksum annotation for restarting hami component after ConfigMap modification
  • Support for using RuntimeClass with nvidia devices
  • Add support for profiling via net/http/pprof package
  • Add nvidia gpu topoloy score registry to node
  • Feat: vGPUmonitor support MigInfo metrics

Bug fix

  • Fix stuck in driver 570+
  • Fix device memory not counted properly in comfyUI task
  • Fix cambricon devices not allocated properly
  • Fix wrong log and container request device count error
  • Fix vgpu-devices-allocated annotations are inconsistent
  • Fix removing node devices from node manager
  • Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
  • Fix device memory count error on cuMallocAsync
  • Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
  • Fix multi-process device memory count