FAQ

Hami device plugin pod can't start when nvidia driver API timeout.

When Nvidia diver API is too slowly( nvidia-smi command's return is too slowly too), the Hami device plugin will start failed.

You can run nvidia-smi -pm enable and then restart the pod of hami device plugin to resolve it.

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavaliable

When running two inference services on one GPU card, one of them always fails. You can run nvidia-smi -i 0 -c 0 to allow all processes can access the GPU.

The hami scheduler locks the Node and cannot schedule it.

This occurs when a pod is accidentally deleted during the bind phase, leaving a dangling NodeLock. Other pods must wait for the lock to expire before being scheduled. This PR proactively clears the NodeLock when an error occurs, eliminating this issue. This will be addressed in the next version of hami (2.7).

#FAQ

#TOC

#Hami device plugin pod can't start when nvidia driver API timeout.

#RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavaliable

#The hami scheduler locks the Node and cannot schedule it.

FAQ

TOC

Hami device plugin pod can't start when nvidia driver API timeout.

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavaliable

The hami scheduler locks the Node and cannot schedule it.