FAQ

Hami device plugin pod can't start when the NVIDIA driver API times out

When the NVIDIA driver API is slow (so that nvidia-smi also takes a long time to return), the Hami device plugin fails to start.

Run nvidia-smi -pm enable and then restart the Hami device plugin pod to resolve it.

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

When running two inference services on one GPU card, one of them always fails. Run nvidia-smi -i 0 -c 0 to allow all processes to access the GPU.

The hami scheduler locks a Node and cannot schedule pods on it

This occurs when a pod is accidentally deleted during the bind phase, leaving a dangling NodeLock. Other pods must wait for the lock to expire before being scheduled. This PR proactively clears the NodeLock when an error occurs, eliminating this issue. Fixed in HAMi v2.7 and later.

#FAQ

#TOC

#Hami device plugin pod can't start when the NVIDIA driver API times out

#RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

#The hami scheduler locks a Node and cannot schedule pods on it

FAQ

TOC

Hami device plugin pod can't start when the NVIDIA driver API times out

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

The hami scheduler locks a Node and cannot schedule pods on it