学校老师将一个服务器给我,让我重装系统,但是重装系统之后,NVIDIA Tesla T4 的驱动始终打不上,不管如何,切换各个版本,装了卸,卸了装。这怎么办?虽然是老师让我重装的,但是重装后显卡不能用还是有点难绷。要是老师真要讹我,我可真要有苦说不出了。
ubuntu-drivers 试过了,直接在 英伟达搜索驱动 安装后也打不上,参照 英伟达的官方教程安装文档 也安装不上,运行 nvidia-smi 报错NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
这可怎么办?我有点慌了。闲着没事看了一眼咸鱼,一看:一手2W,二手3-4k,心想:完了,过年攒的压岁钱就这样要没了。握着手里的汗,我紧张坏了,但是一看时间已经三点了。在金钱和生命之间,我还会选择生命。
第二天,用预先配置好的 SSH ,继续进行配置。DeepSeek,ChatGPT 这些AI都问过了,但是就是没有解决方案。(后来我意识到:AI对于这些用户问的问题,并不会主动的要求用户通过一些常用的方法定位问题,而是会杜撰一些可能的答案),于是我尝试完全卸载驱动(可以参考这篇文章),重启,之后使用 英伟达的官方教程安装文档 对驱动进行重新安装
distro=ubuntu2404;arch=x86_64;arch_ext=amd64
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt update
apt -V install cuda-drivers
reboot
这次是完全干净的安装,所以我相信这次应该不是软件的问题了,重启之后查看 dmesg :
dmesg | grep -i nvidia
我发现了这么一段输出:
[ 5.251470] nvidia: loading out-of-tree module taints kernel.
[ 5.251483] nvidia: module license 'NVIDIA' taints kernel.
[ 5.251489] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 5.251490] nvidia: module license taints kernel.
[ 5.624694] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 5.627970] nvidia 0000:02:00.0: enabling device (0100 -> 0102)
[ 5.628036] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 5.628047] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 5.628071] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 5.628073] NVRM: None of the NVIDIA devices were initialized.
[ 5.628468] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 6.369985] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 6.372853] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.372874] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 6.372896] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 6.372898] NVRM: None of the NVIDIA devices were initialized.
[ 6.373109] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 12.784066] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 12.787226] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 12.787240] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 12.787262] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 12.787264] NVRM: None of the NVIDIA devices were initialized.
[ 12.787553] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 25.221420] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 25.225024] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 25.225037] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 25.225061] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 25.225063] NVRM: None of the NVIDIA devices were initialized.
[ 25.225316] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 26.563426] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 26.566237] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 26.566250] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 26.566273] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 26.566275] NVRM: None of the NVIDIA devices were initialized.
[ 26.566505] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 28.293579] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 28.296300] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 28.296314] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 28.296339] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 28.296341] NVRM: None of the NVIDIA devices were initialized.
[ 28.296618] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 29.144924] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 29.147344] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 29.147372] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 29.147396] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 29.147397] NVRM: None of the NVIDIA devices were initialized.
[ 29.147662] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
其中的关键信息是:NVRM: This PCI I/O region assigned to your NVIDIA device is invalid
解决办法
NVIDIA驱动无法正确识别PCI区域,经过询问 DeepSeek ,解决办法有两个:
- BIOS修复
- 内核参数调整
# BIOS 修复
sudo systemctl reboot --firmware-setup # 进入BIOS后:
- 禁用Secure Boot
- 启用Above 4G Decoding/Resizable BAR
# 内核参数调整
sudo nano /etc/default/grub
# 修改GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="pci=realloc=off pci=nommconf"
sudo update-grub && sudo reboot
服务器不在我这,我选择优先使用第二个办法,修改 GRUB 的内核启动参数(GRUB_CMDLINE_LINUX)加上 pci=realloc=off pci=nommconf
,应用并重启
之后成功的显示:
➜ ~ dmesg | grep -i nvidia
[ 5.393452] nvidia: loading out-of-tree module taints kernel.
[ 5.393464] nvidia: module license 'NVIDIA' taints kernel.
[ 5.393469] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 5.393470] nvidia: module license taints kernel.
[ 5.598248] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 5.600918] nvidia 0000:02:00.0: enabling device (0100 -> 0102)
[ 5.648904] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 570.124.06 Wed Feb 26 02:12:04 UTC 2025
[ 5.673728] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 570.124.06 Wed Feb 26 01:42:18 UTC 2025
[ 5.680488] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[ 7.255760] [drm] Initialized nvidia-drm 0.0.0 for 0000:02:00.0 on minor 0
[ 7.255819] nvidia 0000:02:00.0: [drm] No compatible format found
[ 7.255823] nvidia 0000:02:00.0: [drm] Cannot find any crtc or sizes
➜ ~ nvidia-smi
Wed Mar 26 22:34:57 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:02:00.0 Off | 0 |
| N/A 38C P8 10W / 70W | 5MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1787 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
压岁钱保住了,真开心!
参考
https://www.nvidia.com/en-us/drivers/details/241401
https://docs.nvidia.com/datacenter/tesla/driver-installation-guide
Comments NOTHING