特斯拉M60 Freeze，100％负载问题

嗨，
也许有人经历过类似的问题。
我们用：
HP Proliant DL380 Gen9服务器最新固件（881936_001_spp-2017.07.2-SPP2017072.2017_0922.6）
Mesla M60最新驱动程序（NVIDIA-vGPU-xenserver-7.1-384.73.x86_64.rpm）
Windows 7 Enterprise 64Bit，最新驱动程序（385.41_grid_win8_win7_server2012R2_server2008R2_64bit_international.exe）
Xendesktop（Win7）和XenApp（Windows Server 2012 R2），7.13
XenServer 7.1，应用了最新更新{/。]
[。] GRID M60-0B配置文件512MB
自从我们更新到最新的驱动程序NVIDIA-GRID-XenServer-7.1-384.73-385.41后，我们看到各种虚拟机在人们正在进行操作时才冻结。
Win7操作系统崩溃了。
当Delivery Controller尝试引导新VM时，我们在Citrix XenCenter中也会看到以下问题：运行此VM所需的模拟器无法启动。
同样适用于XenApp服务器，冻结和Vis挂起，最终崩溃。
在XenServer的控制台中，nvidia-smi显示一张卡使用100％vgpu。
2017年10月2日星期一11:54:15
+ -------------------------------------------------
---------------------------- +
|
NVIDIA-SMI 384.73驱动程序版本：384.73 |
| ------------------------------- + -----------------
----- + ---------------------- +
|
GPU名称持久性-M |
Bus-Id Disp.A |
挥发性的Uncorr。
ECC |
|
Fan Temp Perf Pwr：用法/上限|
内存使用|
GPU-Util Compute M. |
| =============================== + =================
===== + ====================== |
|
0特斯拉M60开|
00000000：86：00.0关闭|
关|
|
N / A 45C P8 25W / 150W |
3066MiB / 8191MiB |
0％默认值|
+ ------------------------------- + -----------------
----- + ---------------------- +
|
1特斯拉M60开|
00000000：87：00.0关闭|
关|
|
N / A 48C P0 58W / 150W |
18MiB / 8191MiB |
100％默认值|
+ ------------------------------- + -----------------
----- + ---------------------- +
人们可能会认为这是某种内存消耗，但我们发现当内存和GPU没有完全负载时，这只是突然发生的。
就在这之前的状态：
时间戳名称pci.bus_id driver_version pstate pcie.link.gen.max pcie.link.gen.current temperature.gpu utilization.gpu [％] utilization.memory [％] memory.total [MiB] memory.free [MiB] memory。
二手[MiB]
02.10.2017 09:03特斯拉M60 00000000：87：00.0 384.73 P0 3 3 40 1％0％8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03特斯拉M60 00000000：87：00.0 384.73 P0 3 3 40 3％0％8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03特斯拉M60 00000000：87：00.0 384.73 P0 3 3 41 16％1％8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03特斯拉M60 00000000：87：00.0 384.73 P0 3 3 41 100％0％8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04特斯拉M60 00000000：87：00.0 384.73 P0 3 3 43 100％0％8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04特斯拉M60 00000000：87：00.0 384.73 P0 3 3 43 100％0％8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04特斯拉M60 00000000：87：00.0 384.73 P0 3 3 44 100％0％8191 MiB 3093 MiB 5098 MiB
正如你所看到的那样，负载并没有那么多。
令人遗憾的是，用户在VM崩溃时失去了工作。
除了VM无法再次启动之外，临时解决问题的唯一方法是重新启动XenServer。
可悲的是，这无济于事，因为它会很快再次发生。
我们不得不从VM中移除所有GPU，...
Citrix声称这个问题不是他们的问题。
现在一切都指向Nvidia。
我们在2017年9月27日首次看到了这个问题。

以上来自于谷歌翻译

以下为原文

Hi,

Maybe someone experienced a similar issue. We use:

HP Proliant DL380 Gen9 Servers latest firmware (881936_001_spp-2017.07.2-SPP2017072.2017_0922.6)
Mesla M60 latest drivers (NVIDIA-vGPU-xenserver-7.1-384.73.x86_64.rpm)
Windows 7 Enterprise 64Bit, latest driver (385.41_grid_win8_win7_server2012R2_server2008R2_64bit_international.exe)
Xendesktop (Win7)and XenApp (Windows Server 2012 R2), 7.13
XenServer 7.1, latest updates applied{/.]
[.]GRID M60-0B profiles 512MB

Since we updated to the latest driver NVIDIA-GRID-XenServer-7.1-384.73-385.41 we see various VM's just freezing while people are working on it. The Win7 OS crashes. We also see in Citrix XenCenter the following issue when Delivery Controller tries to boot new VM's: An emulator required to run this VM failed to start. Same applies to for XenApp Servers, freeze and Vis hanging, finally crashes.

In the console of the XenServer, nvidia-smi shows that one card is at 100% vgpu use.

Mon Oct  2 11:54:15 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.73                Driver Version: 384.73                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name       Persistence-M| Bus-Id       Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|       Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
| 0  Tesla M60          On | 00000000:86:00.0 Off |                Off |
| N/A 45C P8 25W / 150W | 3066MiB /  8191MiB |    0%    Default |
+-------------------------------+----------------------+----------------------+
| 1  Tesla M60          On | 00000000:87:00.0 Off |                Off |
| N/A 48C P0 58W / 150W |    18MiB /  8191MiB | 100%    Default |
+-------------------------------+----------------------+----------------------+

One could think that this is some kind of memory exhaust, but we see that this just happens out of the blue when memory and gpu is not fully under load. Here the state just before this happens:

timestamp name pci.bus_id driver_version pstate pcie.link.gen.max pcie.link.gen.current temperature.gpu utilization.gpu [%] utilization.memory [%] memory.total [MiB] memory.free [MiB] memory.used [MiB]
02.10.2017 09:03 Tesla M60 00000000:87:00.0384.73 P033401%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0384.73 P033403%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0384.73 P0334116%1% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0384.73 P03341100%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0384.73 P03343100%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0384.73 P03343100%0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0384.73 P03344100%0% 8191 MiB 3093 MiB 5098 MiB

As you can see the load was not as much before this happened.

The sad thing about this, users lose their work as VM's crash. On top of that VM's cannot start again, the only thing that resolves the issue on a temporary basis is to reboot the XenServer. Sadly enough this will not help, since it will happen again quickly. We had to remove all our GPU's from VM's,...

Citrix claims this issue not their problem. Everything points to Nvidia at the moment.

We saw this issue first 27.09.2017.

王立冕

2018-10-8 14:34:25

也许作为插件，我们不使用HDX PRO 3D，我们在XenDesktop环境中使用标准的VDA部署。

以上来自于谷歌翻译

以下为原文

Maybe as an addon, we don't use HDX PRO 3D, we use standard VDA deployments forour XenDesktop environement.

朱佳婧

2018-10-8 14:50:32

你好
您是否尝试过不同的vGPU配置文件大小？
也许1B简介？
我认为你有SUM，你有没有用NVIDIA提出它？
如果上述两种情况都失败了，您是否可以回到之前的驱动程序，以便在Dev平台上进行故障排除时为您提供稳定性？
问候

以上来自于谷歌翻译

以下为原文

Hi

Have you tried a different vGPU profile size? Maybe the 1B profile?

I take it you have SUMs, have you raised it with NVIDIA?

Failing both of the above, can you not role back to the previous driver that was working to give you stability whilst you troubleshoot on a Dev platform?

Regards

周芳卿

2018-10-8 15:02:29

感谢您的回复。
是的，我们有SUMS，是的，我们已经提出了NVIDIA的问题（到目前为止还没有解决方案）。
回滚可能是一个选项，我们现在只需删除GPU，因为我们必须有快速解决方案。
也可以尝试1GB的配置文件，但是我只能运行64个用户，所以我需要更多的M60ties。
由于我们在标准VDA模式下使用Win7，我们认为512 Profile将是正确的。
我们的测试环境暂时没有任何M60，这些卡非常昂贵:-)。

以上来自于谷歌翻译

以下为原文

Thanks for your reply. Yes we have SUMS and yes, we've raised the issue with NVIDIA (no solution so far). Roll back could be an option, we simply removed the GPU for now, since we had to have quick solution. The 1GB profile could be tried as well, but then I can run only 64 users, so I would need more M60ties for that. Since we use Win7 in standard VDA mode we thought the 512 Profile will just be right. Our test environment does not have any M60 in it for the moment, those cards are quite expensive :-).

熊辉

2018-10-8 15:14:50

PM发送...

以上来自于谷歌翻译

以下为原文

PM Sent ...

宫昊

回帖（11）

王立冕

朱佳婧

周芳卿

熊辉

相关问答

M60可以用于深度学习

Nvidia Tesla M60 PassThought不能使用

带特斯拉M60的vSGA需要额外许可吗

计算M.禁止在特斯拉vGRID M60-2Q

使用Tesla M60安装裸机Windows 2016 Server

M60用于虚拟和特定于供应商的版本以外的其他版本

cuda可以和特斯拉M10一起使用吗？

虚拟GPU配置文件M60

Microsoft Server 2012R2/RemoteFX是否与M6/M60兼容？

Photoshop没有在使用ESXi 6.0的Tesla m60上选择Nvidia GPU

20万+工程师都在用，免费PCB检查工具