0.开启vGPU
kubevirt 中 使用vGPU¶
- 在相关带GPU计算节点上安装vGPU软件,安装具体方法参考
openstack计算节点上vGPU虚拟化配置(A10).pdf. 说明:需禁用nouveau - 为vGPU对应类型生成uuid
# 0000:3b:00.4 表示vf cd /sys/class/mdev_bus/0000:3b:00.4/mdev_supported_types cd nvidia-594 将生成的id 写入到 create中 uuidgen > ../nvidia-594/create ls -l /sys/bus/mdev/devices/ total 0 lrwxrwxrwx 1 root root 0 May 27 12:48 d289e97f-8ff9-4ec3-aa32-7e89d736a6f8 -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.4/d289e97f-8ff9-4ec3-aa32-7e89d736a6f8 按照nvidia-594对应是 NVIDIA A10-6Q,可以支持4个vGPU,所以还可以在其他vf下创建3个 按上述步骤在其他vf下创建 vf是否还能分配vGPU,可通过以下命令查看,如果可用是0表示当前vf不能再创建vGPU MDEV设备了。 cd mdev_supported_types/ [root@srv11-kvm mdev_supported_types]# for i in * ; do echo $i, $(cat $i/name) $(cat $i/ava*) ; done nvidia-588, NVIDIA A10-1B 0 nvidia-589, NVIDIA A10-2B 0 nvidia-590, NVIDIA A10-1Q 0 nvidia-591, NVIDIA A10-2Q 0 nvidia-592, NVIDIA A10-3Q 0 nvidia-593, NVIDIA A10-4Q 0 nvidia-594, NVIDIA A10-6Q 0 nvidia-595, NVIDIA A10-8Q 0 nvidia-596, NVIDIA A10-12Q 0 nvidia-597, NVIDIA A10-24Q 0 nvidia-598, NVIDIA A10-1A 0 nvidia-599, NVIDIA A10-2A 0 nvidia-600, NVIDIA A10-3A 0 nvidia-601, NVIDIA A10-4A 0 nvidia-602, NVIDIA A10-6A 0 nvidia-603, NVIDIA A10-8A 0 nvidia-604, NVIDIA A10-12A 0 nvidia-605, NVIDIA A10-24A 0 nvidia-610, NVIDIA A10-4C 0 nvidia-611, NVIDIA A10-6C 0 nvidia-612, NVIDIA A10-8C 0 nvidia-613, NVIDIA A10-12C 0 nvidia-614, NVIDIA A10-24C 0 上述输出了A10支持的分割vGPU类型,以及当前VF的可用vGPU数 - 在 kubevirt-cr中开启GPU特性并配置mdev设备,注意yaml缩进格式
spec: configuration: developerConfiguration: featureGates: # 需要开启GPU特性 - GPU ... mediatedDevicesConfiguration: nodeMediatedDeviceTypes: - mediatedDevicesTypes: - nvidia-594 nodeSelector: kubernetes.io/hostname: k8s-11 # https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices permittedHostDevices: mediatedDevices: # mdevNameSelector 从 /sys/class/mdev_bus/0000:xx:xx.x/mdev_supported_types/nvidia-xx/name中查看 - mdevNameSelector: NVIDIA A10-6Q # resourceName 在创建vm时,在VirtualMachine.spec.gpus.deviceName中引用 resourceName: nvidia.com/NVIDIA_A10-6Q pciHostDevices: - externalResourceProvider: true # vendor_id:product_id 使用 lspci -nnv|grep -i nvidia 查看 pciVendorSelector: 10DE:2236 resourceName: nvidia.com/GA102GL_A10 # 直通卡3080Ti pciHostDevices: - externalResourceProvider: true pciVendorSelector: 10DE:2208 resourceName: nvidia.com/GA102_GEFORCE_RTX_3080_TI ## 如果要将A10也改成直通,参数如下配置 #- externalResourceProvider: true # pciVendorSelector: 10DE:2236 # resourceName: nvidia.com/GA102GL_A10 -
安装vgpu设备插件
-
安装vgpu插件后,能通过命令
kubectl describe node k8s-11看到当前可用的vGPU以及已经分配的vGPU数量.
问题1: 在修改好kubevirt cr后,并为vGPU对应类型生成uuid,写入到了 nvidia-594/name中,但想卸载或删除 /sys/bus/mdev/devices/xxxx 设备,想重新添加,该如何操作?
卸载或删除 /sys/bus/mdev/devices/ 下的设备,需要按照以下步骤进行操作:
# fuser -c /sys/bus/mdev/devices/d289e97f-8ff9-4ec3-aa32-7e89d736a6f8/
/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.4/d289e97f-8ff9-4ec3-aa32-7e89d736a6f8: 2224 3847883c
# echo 1 > /sys/bus/mdev/devices/d289e97f-8ff9-4ec3-aa32-7e89d736a6f8/remove
# fuser -c /sys/bus/mdev/devices/d289e97f-8ff9-4ec3-aa32-7e89d736a6f8/
Specified filename /sys/bus/mdev/devices/d289e97f-8ff9-4ec3-aa32-7e89d736a6f8/ does not exist.
解决办法:重复第2步,手工生成uuid,然后删除该宿主机上对应的nvidia-kubevirt-gpu的pod