一. 基础环境准备#
1. 系统优化#
1. 锁定内核版本#
防止系统自动升级导致驱动不兼容:
1
|
apt-mark hold linux-generic linux-image-generic linux-headers-generic
|
2. 关闭自动更新#
1
2
|
sudo sed -i.bak 's/1/0/' /etc/apt/apt.conf.d/10periodic &> /dev/null
sudo sed -i.bak 's/1/0/' /etc/apt/apt.conf.d/20auto-upgrades &> /dev/null
|
2. 依赖安装#
安装编译必需的工具:
1
|
apt install gcc make -y
|
二. 驱动下载#
1. NVIDIA Driver下载#
访问NVIDIA官方驱动下载页面:
官方链接: https://www.nvidia.cn/drivers/lookup/

根据您的GPU型号和操作系统版本选择合适的驱动版本。下载后会获得一个.run
安装包,直接上传到服务器即可。
2. CUDA下载#
访问NVIDIA CUDA下载页面:
官方链接: https://developer.nvidia.com/cuda-downloads

选择对应的操作系统、架构和版本,推荐下载.run
格式的安装包,便于自定义安装选项。
3. NVIDIA Fabric Manager下载#
注:仅SXM接口的GPU需要,PCIe接口的GPU不需要此组件。
1
|
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/nvidia-fabricmanager-535_535.154.05-1_amd64.deb
|
三. 驱动安装#
1. NVIDIA Driver安装#
使用以下命令安装NVIDIA驱动(根据实际版本号修改):
1
2
3
4
5
6
|
./NVIDIA-Linux-x86_64-535.154.05.run \
--ui=none \
--no-questions \
--accept-license \
--disable-nouveau \
--install-libglvnd
|
2. CUDA安装#
安装CUDA Toolkit(根据实际版本号修改):
1
2
3
4
5
6
|
./cuda_12.2.0_535.154.05_linux.run \
--silent \
--toolkit \
--override \
--no-drm \
--no-opengl-libs
|
3. NVIDIA Fabric Manager安装#
仅适用于SXM接口的GPU(如A800、A100):
1
|
dpkg -i nvidia-fabricmanager-535_535.154.05-1_amd64.deb
|
四. 服务配置#
1. NVIDIA Fabric Manager服务#
启用和配置Fabric Manager服务(用于NVLink连接):
1
2
3
|
systemctl daemon-reload
systemctl start nvidia-fabricmanager
systemctl enable nvidia-fabricmanager
|
2. NVIDIA Persistence Daemon#
创建并启用NVIDIA持久化守护程序:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
cat << EOF > /lib/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
After=syslog.target
[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/*
TimeoutSec=300
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable nvidia-persistenced
systemctl start nvidia-persistenced
|
五. 环境配置#
1. CUDA环境设置#
将CUDA路径添加到系统环境变量:
1
2
3
4
5
6
7
|
cat << EOF >> /etc/profile
export PATH=/usr/local/cuda/bin:$PATH
export CUDA_HOME=/usr/local/cuda/
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/targets/x86_64-linux/lib/stubs:$LD_LIBRARY_PATH
EOF
|
加载环境变量:
六. 环境验证#
1. nvidia-smi验证#
运行nvidia-smi
命令验证驱动安装:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800-SXM4-80GB On | 00000000:10:00.0 Off | 0 |
| N/A 41C P0 62W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM4-80GB On | 00000000:16:00.0 Off | 0 |
| N/A 39C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
...
|
2. NVIDIA Fabric Manager#
检查Fabric Manager服务状态:
1
2
3
4
5
6
7
8
9
10
11
|
systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2023-05-07 08:33:50 CST; 3 months 20 days ago
Main PID: 2979 (nv-fabricmanage)
Tasks: 18 (limit: 629145)
Memory: 17.9M
CPU: 19min 27.766s
CGroup: /system.slice/nvidia-fabricmanager.service
└─2979 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
|
3. NVIDIA Persistence Daemon验证#
检查持久化守护程序状态:
1
2
3
4
5
6
7
8
9
10
11
|
systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2023-05-07 15:13:59 CST; 3 months 16 days ago
Main PID: 304182 (nvidia-persiste)
Tasks: 1 (limit: 629145)
Memory: 360.0K
CPU: 214ms
CGroup: /system.slice/nvidia-persistenced.service
└─304182 /usr/bin/nvidia-persistenced --verbose
|