FAQ
TOC
Hami device plugin pod can't start when the NVIDIA driver API times outRuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailableThe hami scheduler locks a Node and cannot schedule pods on itHami device plugin pod can't start when the NVIDIA driver API times out
When the NVIDIA driver API is slow (so that nvidia-smi also takes a long time to return), the Hami device plugin fails to start.
Run nvidia-smi -pm enable and then restart the Hami device plugin pod to resolve it.
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
When running two inference services on one GPU card, one of them always fails.
Run nvidia-smi -i 0 -c 0 to allow all processes to access the GPU.
The hami scheduler locks a Node and cannot schedule pods on it
This occurs when a pod is accidentally deleted during the bind phase, leaving a dangling NodeLock. Other pods must wait for the lock to expire before being scheduled. This PR proactively clears the NodeLock when an error occurs, eliminating this issue. Fixed in HAMi v2.7 and later.