1. 检查nvidia-smi状态#
首先使用nvidia-smi
命令查看当前GPU状态:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
ubuntu@ubuntu:~$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14 Driver Version: 430.14 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:1B:00.0 Off | N/A |
| 22% 42C P0 63W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:1C:00.0 Off | N/A |
| 23% 43C P0 57W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:1D:00.0 Off | N/A |
| 23% 43C P0 62W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:1E:00.0 Off | N/A |
| 22% 42C P0 49W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... Off | 00000000:89:00.0 Off | N/A |
| 22% 41C P0 75W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... Off | 00000000:8A:00.0 Off | N/A |
| 22% 43C P0 63W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... Off | 00000000:8B:00.0 Off | N/A |
| 22% 41C P0 50W / 250W | 0MiB / 11019MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce RTX 208... Off | 00000000:8C:00.0 Off | N/A |
| 20% 41C P0 54W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
2. 模拟GPU移除操作#
通过PCI设备管理接口手动移除特定GPU:
1
|
ubuntu@ubuntu:~$ sudo echo 1 > /sys/bus/pci/devices/0000:8a:00.0/remove
|
执行后再次检查nvidia-smi
状态:
1
|
ubuntu@ubuntu:~$ nvidia-smi
|
移除GPU后的输出示例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14 Driver Version: 430.14 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:1B:00.0 Off | N/A |
| 22% 42C P0 63W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:1C:00.0 Off | N/A |
| 24% 44C P0 55W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:1D:00.0 Off | N/A |
| 23% 43C P0 61W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:1E:00.0 Off | N/A |
| 22% 42C P0 50W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... Off | 00000000:89:00.0 Off | N/A |
| 25% 44C P2 77W / 250W | 10527MiB / 11019MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... Off | 00000000:8B:00.0 Off | N/A |
| 22% 40C P0 49W / 250W | 0MiB / 11019MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... Off | 00000000:8C:00.0 Off | N/A |
| 19% 40C P0 54W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 4 6149 C python 10517MiB |
+-----------------------------------------------------------------------------+
|
注: 可以看到原本的GPU 5(Bus ID: 8A:00.0)已经不可见,GPU数量从8个减少到7个。
3. 重新扫描PCI设备#
使用以下命令重新扫描PCI总线,将新设备添加到系统中:
1
|
echo 1 > /sys/bus/pci/rescan
|
此命令用于扫描PCI总线上的设备,并将新设备添加到系统中。
获取GPU Bus ID信息#
方法1:使用lspci查看VGA控制器#
1
|
ubuntu@ubuntu:~$ lspci | grep VGA
|
输出示例:
1
2
3
4
5
6
7
8
|
03:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
1b:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1d:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1e:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
89:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
8b:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
8c:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
|

方法2:使用nvidia-smi查看Bus ID#
1
|
ubuntu@ubuntu:~$ nvidia-smi -a | grep "Bus Id"
|
输出示例:
1
2
3
4
5
6
7
|
Bus Id : 00000000:1B:00.0
Bus Id : 00000000:1C:00.0
Bus Id : 00000000:1D:00.0
Bus Id : 00000000:1E:00.0
Bus Id : 00000000:89:00.0
Bus Id : 00000000:8B:00.0
Bus Id : 00000000:8C:00.0
|
方法3:使用lspci查看所有NVIDIA设备#
1
|
ubuntu@ubuntu:~$ lspci | grep -i nvidia
|
输出示例:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
1b:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1b:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1b:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1b:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1c:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1c:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1c:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
[...]
8a:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
8a:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
8a:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
[...]
|
注意: 在这个输出中可以看到8a:00.1
、8a:00.2
、8a:00.3
设备存在,但8a:00.0
(主VGA控制器)缺失,这证实了该GPU确实被移除。