嗨,
也许有人经历过类似的问题。
我们用:
HP Proliant DL380 Gen9服务器最新固件(881936_001_spp-2017.07.2-SPP2017072.2017_0922.6)
Mesla M60最新驱动程序(NVIDIA-vGPU-xenserver-7.1-384.73.x86_64.rpm)
Windows 7 Enterprise 64Bit,最新驱动程序(385.41_grid_win8_win7_server2012R2_server2008R2_64bit_interna
tional.exe)
Xendesktop(Win7)和XenApp(Windows Server 2012 R2),7.13
XenServer 7.1,应用了最新更新{/。]
[。] GRID M60-0B配置文件512MB
自从我们更新到最新的驱动程序NVIDIA-GRID-XenServer-7.1-384.73-385.41后,我们看到各种虚拟机在人们正在进行操作时才冻结。
Win7操作系统崩溃了。
当Delivery Controller尝试引导新VM时,我们在Citrix XenCenter中也会看到以下问题:运行此VM所需的模拟器无法启动。
同样适用于XenApp服务器,冻结和Vis挂起,最终崩溃。
在XenServer的控制台中,nvidia-smi显示一张卡使用100%vgpu。
2017年10月2日星期一11:54:15
+ -------------------------------------------------
---------------------------- +
|
NVIDIA-SMI 384.73驱动程序版本:384.73 |
| ------------------------------- + -----------------
----- + ---------------------- +
|
GPU名称持久性-M |
Bus-Id Disp.A |
挥发性的Uncorr。
ECC |
|
Fan Temp Perf Pwr:用法/上限|
内存使用|
GPU-Util Compute M. |
| =============================== + =================
===== + ====================== |
|
0特斯拉M60开|
00000000:86:00.0关闭|
关|
|
N / A 45C P8 25W / 150W |
3066MiB / 8191MiB |
0%默认值|
+ ------------------------------- + -----------------
----- + ---------------------- +
|
1特斯拉M60开|
00000000:87:00.0关闭|
关|
|
N / A 48C P0 58W / 150W |
18MiB / 8191MiB |
100%默认值|
+ ------------------------------- + -----------------
----- + ---------------------- +
人们可能会认为这是某种内存消耗,但我们发现当内存和GPU没有完全负载时,这只是突然发生的。
就在这之前的状态:
时间戳名称pci.bus_id driver_version pstate pcie.link.gen.max pcie.link.gen.current temperature.gpu utilization.gpu [%] utilization.memory [%] memory.total [MiB] memory.free [MiB] memory。
二手[MiB]
02.10.2017 09:03特斯拉M60 00000000:87:00.0 384.73 P0 3 3 40 1%0%8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03特斯拉M60 00000000:87:00.0 384.73 P0 3 3 40 3%0%8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03特斯拉M60 00000000:87:00.0 384.73 P0 3 3 41 16%1%8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03特斯拉M60 00000000:87:00.0 384.73 P0 3 3 41 100%0%8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04特斯拉M60 00000000:87:00.0 384.73 P0 3 3 43 100%0%8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04特斯拉M60 00000000:87:00.0 384.73 P0 3 3 43 100%0%8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04特斯拉M60 00000000:87:00.0 384.73 P0 3 3 44 100%0%8191 MiB 3093 MiB 5098 MiB
正如你所看到的那样,负载并没有那么多。
令人遗憾的是,用户在VM崩溃时失去了工作。
除了VM无法再次启动之外,临时解决问题的唯一方法是重新启动XenServer。
可悲的是,这无济于事,因为它会很快再次发生。
我们不得不从VM中移除所有GPU,...
Citrix声称这个问题不是他们的问题。
现在一切都指向Nvidia。
我们在2017年9月27日首次看到了这个问题。
以上来自于谷歌翻译
以下为原文
Hi,
Maybe someone experienced a similar issue. We use:
- HP Proliant DL380 Gen9 Servers latest firmware (881936_001_spp-2017.07.2-SPP2017072.2017_0922.6)
- Mesla M60 latest drivers (NVIDIA-vGPU-xenserver-7.1-384.73.x86_64.rpm)
- Windows 7 Enterprise 64Bit, latest driver (385.41_grid_win8_win7_server2012R2_server2008R2_64bit_international.exe)
- Xendesktop (Win7)and XenApp (Windows Server 2012 R2), 7.13
- XenServer 7.1, latest updates applied{/.]
[.]GRID M60-0B profiles 512MB
Since we updated to the latest driver NVIDIA-GRID-XenServer-7.1-384.73-385.41 we see various VM's just freezing while people are working on it. The Win7 OS crashes. We also see in Citrix XenCenter the following issue when Delivery Controller tries to boot new VM's: An emulator required to run this VM failed to start. Same applies to for XenApp Servers, freeze and Vis hanging, finally crashes.
In the console of the XenServer, nvidia-smi shows that one card is at 100% vgpu use.
Mon Oct 2 11:54:15 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.73 Driver Version: 384.73 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:86:00.0 Off | Off |
| N/A 45C P8 25W / 150W | 3066MiB / 8191MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 00000000:87:00.0 Off | Off |
| N/A 48C P0 58W / 150W | 18MiB / 8191MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
One could think that this is some kind of memory exhaust, but we see that this just happens out of the blue when memory and gpu is not fully under load. Here the state just before this happens:
timestamp name pci.bus_id driver_version pstate pcie.link.gen.max pcie.link.gen.current temperature.gpu utilization.gpu [%] utilization.memory [%] memory.total [MiB] memory.free [MiB] memory.used [MiB]
02.10.2017 09:03 Tesla M60 00000000:87:00.0384.73 P033401%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0384.73 P033403%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0384.73 P0334116%1% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0384.73 P03341100%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0384.73 P03343100%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0384.73 P03343100%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0384.73 P03344100%0% 8191 MiB 3093 MiB 5098 MiB
As you can see the load was not as much before this happened.
The sad thing about this, users lose their work as VM's crash. On top of that VM's cannot start again, the only thing that resolves the issue on a temporary basis is to reboot the XenServer. Sadly enough this will not help, since it will happen again quickly. We had to remove all our GPU's from VM's,...
Citrix claims this issue not their problem. Everything points to Nvidia at the moment.
We saw this issue first 27.09.2017.