1. 检查nvidia-smi状态

首先使用nvidia-smi命令查看当前GPU状态:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
ubuntu@ubuntu:~$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14       Driver Version: 430.14       CUDA Version: 10.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 22%   42C    P0    63W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 23%   43C    P0    57W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 23%   43C    P0    62W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 22%   42C    P0    49W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:89:00.0 Off |                  N/A |
| 22%   41C    P0    75W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 22%   43C    P0    63W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce RTX 208...  Off  | 00000000:8B:00.0 Off |                  N/A |
| 22%   41C    P0    50W / 250W |      0MiB / 11019MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce RTX 208...  Off  | 00000000:8C:00.0 Off |                  N/A |
| 20%   41C    P0    54W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. 模拟GPU移除操作

通过PCI设备管理接口手动移除特定GPU:

1
ubuntu@ubuntu:~$ sudo echo 1 > /sys/bus/pci/devices/0000:8a:00.0/remove

执行后再次检查nvidia-smi状态:

1
ubuntu@ubuntu:~$ nvidia-smi

移除GPU后的输出示例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14       Driver Version: 430.14       CUDA Version: 10.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 22%   42C    P0    63W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 24%   44C    P0    55W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 23%   43C    P0    61W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 22%   42C    P0    50W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:89:00.0 Off |                  N/A |
| 25%   44C    P2    77W / 250W |  10527MiB / 11019MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:8B:00.0 Off |                  N/A |
| 22%   40C    P0    49W / 250W |      0MiB / 11019MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce RTX 208...  Off  | 00000000:8C:00.0 Off |                  N/A |
| 19%   40C    P0    54W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    4      6149    C   python                                    10517MiB |
+-----------------------------------------------------------------------------+

注: 可以看到原本的GPU 5(Bus ID: 8A:00.0)已经不可见,GPU数量从8个减少到7个。

3. 重新扫描PCI设备

使用以下命令重新扫描PCI总线,将新设备添加到系统中:

1
echo 1 > /sys/bus/pci/rescan

此命令用于扫描PCI总线上的设备,并将新设备添加到系统中。

获取GPU Bus ID信息

方法1:使用lspci查看VGA控制器

1
ubuntu@ubuntu:~$ lspci | grep VGA

输出示例:

1
2
3
4
5
6
7
8
03:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
1b:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1d:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1e:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
89:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
8b:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
8c:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)

lspci VGA输出示例

方法2:使用nvidia-smi查看Bus ID

1
ubuntu@ubuntu:~$ nvidia-smi -a | grep "Bus Id"

输出示例:

1
2
3
4
5
6
7
Bus Id : 00000000:1B:00.0
Bus Id : 00000000:1C:00.0
Bus Id : 00000000:1D:00.0
Bus Id : 00000000:1E:00.0
Bus Id : 00000000:89:00.0
Bus Id : 00000000:8B:00.0
Bus Id : 00000000:8C:00.0

方法3:使用lspci查看所有NVIDIA设备

1
ubuntu@ubuntu:~$ lspci | grep -i nvidia

输出示例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
1b:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1b:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1b:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1b:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1c:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1c:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1c:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
[...]
8a:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
8a:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
8a:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
[...]

注意: 在这个输出中可以看到8a:00.18a:00.28a:00.3设备存在,但8a:00.0(主VGA控制器)缺失,这证实了该GPU确实被移除。