How to reset GPUs on a running instance?
Note that the simplest and most robust way to reset a GPU is to restart the instance which would include resetting all its GPUs. This section covers another situation, resetting GPUs on a running instance without rebooting the instance. Beware that this is a more complex and potentially fragile process.
GPUs can be reset using a command sudo nvidia-smi -r [-i <idx>]
or sudo nvidia-smi --gpu-reset [-i <idx>]
. When no -i
argument is specified, the reset command is applied to all GPUs. After a successful reset, a message GPU … was successfully reset
is printed for each selected GPU.
Successful reset requires no active applications, services, kernel modules, etc. using the affected GPUs. In case there is ongoing use of GPU, the reset command fails with error The following GPUs could not be reset: ... In use by another client
.
A particular list of software using GPUs and preventing reset depends on your particular setup. In general, we recommend the following.
Make sure that no processes are listed under
nvidia-smi
Disable Persistence Mode on all GPUs using
sudo nvidia-smi -pm 0
Stop the following services (e.g., with
sudo systemctl stop
):nvidia-persistenced.service
,nvidia-fabricmanager.service
,docker.service
(if you are using NVIDIA Container Runtime).Unload kernel modules (e.g., with
sudo modprobe -r
)nvidia_nvm
,nvidia_drm
,nvidia_peermem
and other modules shown insudo lsmod | grep nvidia
.Run GPU reset command. If it does not succeed with
The following GPUs could not be reset: ... In use by another client
, search for more apps, services, and/or kernel modules that are using the chosen GPUs and repeat steps 3 and 4 for those.After a successful reset, load back the unloaded kernel modules and services, enable Persistence Mode (if you would like to have it). Make sure the loading order is correct, for example matches the order used when booting the OS.
In case the GPU reset was done correctly but during step 6 it is not possible to restore working software configuration, fall back to rebooting your cloud instance.