Ubuntu24 未能正常安装 NVIDIA Tesla T4 驱动

发布于 8 天前  63 次阅读


学校老师将一个服务器给我,让我重装系统,但是重装系统之后,NVIDIA Tesla T4 的驱动始终打不上,不管如何,切换各个版本,装了卸,卸了装。这怎么办?虽然是老师让我重装的,但是重装后显卡不能用还是有点难绷。要是老师真要讹我,我可真要有苦说不出了。

ubuntu-drivers 试过了,直接在 英伟达搜索驱动 安装后也打不上,参照 英伟达的官方教程安装文档 也安装不上,运行 nvidia-smi 报错
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

这可怎么办?我有点慌了。闲着没事看了一眼咸鱼,一看:一手2W,二手3-4k,心想:完了,过年攒的压岁钱就这样要没了。握着手里的汗,我紧张坏了,但是一看时间已经三点了。在金钱和生命之间,我还会选择生命。

第二天,用预先配置好的 SSH ,继续进行配置。DeepSeek,ChatGPT 这些AI都问过了,但是就是没有解决方案。(后来我意识到:AI对于这些用户问的问题,并不会主动的要求用户通过一些常用的方法定位问题,而是会杜撰一些可能的答案),于是我尝试完全卸载驱动(可以参考这篇文章),重启,之后使用 英伟达的官方教程安装文档 对驱动进行重新安装

distro=ubuntu2404;arch=x86_64;arch_ext=amd64
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt update
apt -V install cuda-drivers
reboot

这次是完全干净的安装,所以我相信这次应该不是软件的问题了,重启之后查看 dmesg :

dmesg | grep -i nvidia

我发现了这么一段输出:

[ 5.251470] nvidia: loading out-of-tree module taints kernel.
[ 5.251483] nvidia: module license 'NVIDIA' taints kernel.
[ 5.251489] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 5.251490] nvidia: module license taints kernel.
[ 5.624694] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 5.627970] nvidia 0000:02:00.0: enabling device (0100 -> 0102)
[ 5.628036] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 5.628047] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 5.628071] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 5.628073] NVRM: None of the NVIDIA devices were initialized.
[ 5.628468] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 6.369985] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 6.372853] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.372874] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 6.372896] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 6.372898] NVRM: None of the NVIDIA devices were initialized.
[ 6.373109] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 12.784066] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 12.787226] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 12.787240] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 12.787262] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 12.787264] NVRM: None of the NVIDIA devices were initialized.
[ 12.787553] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 25.221420] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 25.225024] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 25.225037] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 25.225061] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 25.225063] NVRM: None of the NVIDIA devices were initialized.
[ 25.225316] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 26.563426] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 26.566237] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 26.566250] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 26.566273] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 26.566275] NVRM: None of the NVIDIA devices were initialized.
[ 26.566505] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 28.293579] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 28.296300] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 28.296314] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 28.296339] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 28.296341] NVRM: None of the NVIDIA devices were initialized.
[ 28.296618] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 29.144924] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 29.147344] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 29.147372] nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
[ 29.147396] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 29.147397] NVRM: None of the NVIDIA devices were initialized.
[ 29.147662] nvidia-nvlink: Unregistered Nvlink Core, major device number 236

其中的关键信息是:NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

解决办法

NVIDIA驱动无法正确识别PCI区域,经过询问 DeepSeek ,解决办法有两个:

  • BIOS修复
  • 内核参数调整
# BIOS  修复
sudo systemctl reboot --firmware-setup  # 进入BIOS后:
   - 禁用Secure Boot
   - 启用Above 4G Decoding/Resizable BAR
# 内核参数调整
sudo nano /etc/default/grub
# 修改GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="pci=realloc=off pci=nommconf"
sudo update-grub && sudo reboot

服务器不在我这,我选择优先使用第二个办法,修改 GRUB 的内核启动参数(GRUB_CMDLINE_LINUX)加上 pci=realloc=off pci=nommconf ,应用并重启

之后成功的显示:

➜  ~ dmesg | grep -i nvidia
[    5.393452] nvidia: loading out-of-tree module taints kernel.
[    5.393464] nvidia: module license 'NVIDIA' taints kernel.
[    5.393469] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.393470] nvidia: module license taints kernel.
[    5.598248] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[    5.600918] nvidia 0000:02:00.0: enabling device (0100 -> 0102)
[    5.648904] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.124.06  Wed Feb 26 02:12:04 UTC 2025
[    5.673728] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.124.06  Wed Feb 26 01:42:18 UTC 2025
[    5.680488] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[    7.255760] [drm] Initialized nvidia-drm 0.0.0 for 0000:02:00.0 on minor 0
[    7.255819] nvidia 0000:02:00.0: [drm] No compatible format found
[    7.255823] nvidia 0000:02:00.0: [drm] Cannot find any crtc or sizes

➜  ~ nvidia-smi
Wed Mar 26 22:34:57 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:02:00.0 Off |                    0 |
| N/A   38C    P8             10W /   70W |       5MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1787      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

压岁钱保住了,真开心!

参考

https://www.nvidia.com/en-us/drivers/details/241401

https://docs.nvidia.com/datacenter/tesla/driver-installation-guide

https://www.oryoy.com/news/ubuntu-xi-tong-xia-che-di-xie-zai-bing-qing-li-nvidia-xian-ka-qu-dong-de-xiang-xi-bu-zhou-jiao-cheng.html