Release Notes
TOC
v2.6.1
Bug fix
- Fix: Device memory not counted properly when allocating with 'cuMallocAsync'
- Fix: Device memory not counted properly when running gpu_burn
- Fix: Segmentation fault on some scenarios
- Fix: Utilization metrics not properly count when using multiple devices
- Fix: Initialization error when using vllm with tp>2
v2.6.0
Key feature:
- Optimize scheduler log
- Support enflame gcu-share
- Support metax GPU and metax sGPU
- Helm chart add checksum annotation for restarting hami component after ConfigMap modification
- Support for using RuntimeClass with nvidia devices
- Add support for profiling via net/http/pprof package
- Add nvidia gpu topoloy score registry to node
- Feat: vGPUmonitor support MigInfo metrics
Bug fix
- Fix stuck in driver 570+
- Fix device memory not counted properly in comfyUI task
- Fix cambricon devices not allocated properly
- Fix wrong log and container request device count error
- Fix vgpu-devices-allocated annotations are inconsistent
- Fix removing node devices from node manager
- Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
- Fix device memory count error on cuMallocAsync
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
- Fix multi-process device memory count