FAQ
TOC
Hami device plugin pod can't start when nvidia driver API timeout.
When Nvidia diver API is too slowly( nvidia-smi command's return is too slowly too), the Hami device plugin will start failed.
You can run nvidia-smi -pm enable and then restart the pod of hami device plugin to resolve it.
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavaliable
When running two inference services on one GPU card, one of them always fails.
You can run nvidia-smi -i 0 -c 0 to allow all processes can access the GPU.
The hami scheduler locks the Node and cannot schedule it.
This occurs when a pod is accidentally deleted during the bind phase, leaving a dangling NodeLock. Other pods must wait for the lock to expire before being scheduled. This PR proactively clears the NodeLock when an error occurs, eliminating this issue. This will be addressed in the next version of hami (2.7).