一. 基础环境准备

1. 系统优化

1. 锁定内核版本

防止系统自动升级导致驱动不兼容:

1
apt-mark hold linux-generic linux-image-generic linux-headers-generic

2. 关闭自动更新

1
2
sudo sed -i.bak 's/1/0/' /etc/apt/apt.conf.d/10periodic &> /dev/null
sudo sed -i.bak 's/1/0/' /etc/apt/apt.conf.d/20auto-upgrades &> /dev/null

2. 依赖安装

安装编译必需的工具:

1
apt install gcc make -y

二. 驱动下载

1. NVIDIA Driver下载

访问NVIDIA官方驱动下载页面: 官方链接: https://www.nvidia.cn/drivers/lookup/

NVIDIA驱动下载页面

根据您的GPU型号和操作系统版本选择合适的驱动版本。下载后会获得一个.run安装包,直接上传到服务器即可。

2. CUDA下载

访问NVIDIA CUDA下载页面: 官方链接: https://developer.nvidia.com/cuda-downloads

CUDA下载页面

选择对应的操作系统、架构和版本,推荐下载.run格式的安装包,便于自定义安装选项。

3. NVIDIA Fabric Manager下载

注:仅SXM接口的GPU需要,PCIe接口的GPU不需要此组件。

1
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/nvidia-fabricmanager-535_535.154.05-1_amd64.deb

三. 驱动安装

1. NVIDIA Driver安装

使用以下命令安装NVIDIA驱动(根据实际版本号修改):

1
2
3
4
5
6
./NVIDIA-Linux-x86_64-535.154.05.run \
  --ui=none \
  --no-questions \
  --accept-license \
  --disable-nouveau \
  --install-libglvnd

2. CUDA安装

安装CUDA Toolkit(根据实际版本号修改):

1
2
3
4
5
6
./cuda_12.2.0_535.154.05_linux.run \
  --silent \
  --toolkit \
  --override \
  --no-drm \
  --no-opengl-libs

3. NVIDIA Fabric Manager安装

仅适用于SXM接口的GPU(如A800、A100):

1
dpkg -i nvidia-fabricmanager-535_535.154.05-1_amd64.deb

四. 服务配置

1. NVIDIA Fabric Manager服务

启用和配置Fabric Manager服务(用于NVLink连接):

1
2
3
systemctl daemon-reload
systemctl start nvidia-fabricmanager
systemctl enable nvidia-fabricmanager

2. NVIDIA Persistence Daemon

创建并启用NVIDIA持久化守护程序:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
cat << EOF > /lib/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
After=syslog.target

[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/*
TimeoutSec=300

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable nvidia-persistenced
systemctl start nvidia-persistenced

五. 环境配置

1. CUDA环境设置

将CUDA路径添加到系统环境变量:

1
2
3
4
5
6
7
cat << EOF >> /etc/profile
export PATH=/usr/local/cuda/bin:$PATH
export CUDA_HOME=/usr/local/cuda/
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/targets/x86_64-linux/lib/stubs:$LD_LIBRARY_PATH
EOF

加载环境变量:

1
source /etc/profile

六. 环境验证

1. nvidia-smi验证

运行nvidia-smi命令验证驱动安装:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05     Driver Version: 535.154.05     CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800-SXM4-80GB  On   | 00000000:10:00.0 Off |                    0 |
| N/A   41C    P0    62W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM4-80GB  On   | 00000000:16:00.0 Off |                    0 |
| N/A   39C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
...

2. NVIDIA Fabric Manager

检查Fabric Manager服务状态:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
systemctl status nvidia-fabricmanager

● nvidia-fabricmanager.service - NVIDIA fabric manager service
   Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2023-05-07 08:33:50 CST; 3 months 20 days ago
   Main PID: 2979 (nv-fabricmanage)
   Tasks: 18 (limit: 629145)
   Memory: 17.9M
   CPU: 19min 27.766s
   CGroup: /system.slice/nvidia-fabricmanager.service
           └─2979 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

3. NVIDIA Persistence Daemon验证

检查持久化守护程序状态:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
systemctl status nvidia-persistenced

● nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2023-05-07 15:13:59 CST; 3 months 16 days ago
   Main PID: 304182 (nvidia-persiste)
   Tasks: 1 (limit: 629145)
   Memory: 360.0K
   CPU: 214ms
   CGroup: /system.slice/nvidia-persistenced.service
           └─304182 /usr/bin/nvidia-persistenced --verbose