What can I do if a GPU has fallen off the bus?
While you are using a cloud instance, it may happen that a GPU disappears from output of nvidia-smi
while still present in lspci
output. This is usually accompanied by a Xid error “GPU has fallen off the bus”. In case you are distributing workload between GPUs, you may notice a drop in overall performance on that instance.
Although we test all GPU extensively before handing them over to a customer, such random errors may occur. Restarting the cloud instance may fix the issue.
Otherwise, please open a support ticket with us so that we become aware of the problem and can get in contact with our operations and on-site teams to resolve this as quickly as possible.