完善资料让更多小伙伴认识你,还能领取20积分哦, 立即完善>
我们有3个节点集群,所有3个节点都崩溃并生成转储文件。
查看崩溃错误,发现所有3个节点都崩溃了相同的错误代码。 所有3个节点都启用了vGPU。 这是崩溃转储的详细信息; VIDEO_TDR_FAILURE(116) 尝试重置显示驱动程序并从超时恢复失败。 参数: Arg1:ffff8d03a76a5010,指向内部TDR恢复上下文的可选指针(TDR_RECOVERY_CONTEXT)。 Arg2:fffff80d3a752678,指向负责设备驱动程序模块的指针(例如所有者标记)。 Arg3:ffffffffc000009a,上次失败操作的可选错误代码(NTSTATUS)。 Arg4:0000000000000004,可选的内部上下文相关数据。 调试细节: ------------------ TRIAGER:无法打开分类文件:e: dump_analysis program triage modclass.ini,错误2 FAULtiNG_IP: nvlddmkm + 982678 fffff80d`3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm + 0x7f6c20(fffff80d`3a5c6c20)] DEFAULT_BUCKET_ID:GRAPHICS_DRIVER_TDR_FAULT BUGCHECK_STR:0x116 Child-SP RetAddr呼叫站点 00 ffff8a00`aaa17a58 fffff806`44b3a298 nt!KeBugCheckEx 01 ffff8a00`aaa17a60 fffff806`44b1d13f dxgkrnl!TdrBugcheckOnTimeout + 0xec 02 ffff8a00`aaa17aa0 fffff806`44b1a2ef dxgkrnl!ADAPTER_RENDER ::重置+ 0x153 03 ffff8a00`aaa17ad0 fffff806`44b39a85 dxgkrnl!DXGADAPTER ::重置+ 0x307 04 ffff8a00`aaa17b20 fffff806`44b39bc7 dxgkrnl!TdrResetFromTimeout + 0x15 05 ffff8a00`aaa17b50 fffff802`e8ae2599 dxgkrnl!TdrResetFromTimeoutWorkItem + 0x27 06 ffff8a00`aaa17b80 fffff802`e8b32965 nt!ExpWorkerThread + 0xe9 07 ffff8a00`aaa17c10 fffff802`e8bd0e26 nt!PspSystemThreadStartup + 0x41 08 ffff8a00`aaa17c60 00000000`00000000 nt!KiStartSystemThread + 0x16 ----------------------------- 02 ffff8a00`aaa17aa0 fffff806`44b1a2ef dxgkrnl!ADAPTER_RENDER ::重置+ 0x153 1.所有3个故障转储指向相同的堆栈和寄存器值。 FAULTING_IP: nvlddmkm + 982678 fffff80d`3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm + 0x7f6c20(fffff80d`3a5c6c20)] 2. Windbg堆栈指向VIDEO_TDR_FAILURE(116)。 37:kd>!analyze -v >#********* ******************************* >#* Bugcheck分析* >#********* ******************************* VIDEO_TDR_FAILURE(116) 尝试重置显示驱动程序并从超时恢复失败。 参数: Arg1:ffffdd84719ea010,指向内部TDR恢复上下文的可选指针(TDR_RECOVERY_CONTEXT)。 Arg2:fffff80fe60e2678,指向负责设备驱动程序模块的指针(例如所有者标记)。 Arg3:ffffffffc000009a,上次失败操作的可选错误代码(NTSTATUS)。 Arg4:0000000000000004,可选的内部上下文相关数据。 根据Microsoft文档,这是由于以下原因造成的 https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x116---video-tdr-error 请参阅解决方案部分 *超频组件,如主板 *组件兼容性和设置不正确(尤其是内存配置和时序) *有缺陷的部件(内存模块,主板等) *系统功率不足 *系统冷却不足 我们正在使用具有以下规范的HP服务器; HP ProLiant DL380 Gen 9和ROM版本为P89 v2.30(2016年9月13日)。 此外,当我们尝试将驱动程序升级到最新版本385.54发布日期:25.9.2017时,我们无法运行虚拟GPU(Remote FX),因为GPU未在HyperV设置中显示。 一旦我们恢复到旧驱动程序376.84,我们就可以看到Hyper-V设置下的物理GPU。 可以判断某人是否遇到过与Driver版本相同的问题? 以上来自于谷歌翻译 以下为原文 We have 3 nodes cluster and all the 3 nodes were crashed and generated Dump files. Looking at the crash error found that all the 3 nodes were crashed with the same error code. vGPU was enabled for all the 3 nodes. This are the crash dumps details; VIDEO_TDR_FAILURE (116) Attempt to reset the display driver and recover from timeout failed. Arguments: Arg1: ffff8d03a76a5010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT). Arg2: fffff80d3a752678, The pointer into responsible device driver module (e.g. owner tag). Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation. Arg4: 0000000000000004, Optional internal context dependent data. Debugging Details: ------------------ TRIAGER: Could not open triage file : e:dump_analysisprogramtriagemodclass.ini, error 2 FAULTING_IP: nvlddmkm+982678 fffff80d`3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d`3a5c6c20)] DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_TDR_FAULT BUGCHECK_STR: 0x116 Child-SP RetAddr Call Site 00 ffff8a00`aaa17a58 fffff806`44b3a298 nt!KeBugCheckEx 01 ffff8a00`aaa17a60 fffff806`44b1d13f dxgkrnl!TdrBugcheckOnTimeout+0xec 02 ffff8a00`aaa17aa0 fffff806`44b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153 03 ffff8a00`aaa17ad0 fffff806`44b39a85 dxgkrnl!DXGADAPTER::Reset+0x307 04 ffff8a00`aaa17b20 fffff806`44b39bc7 dxgkrnl!TdrResetFromTimeout+0x15 05 ffff8a00`aaa17b50 fffff802`e8ae2599 dxgkrnl!TdrResetFromTimeoutWorkItem+0x27 06 ffff8a00`aaa17b80 fffff802`e8b32965 nt!ExpWorkerThread+0xe9 07 ffff8a00`aaa17c10 fffff802`e8bd0e26 nt!PspSystemThreadStartup+0x41 08 ffff8a00`aaa17c60 00000000`00000000 nt!KiStartSystemThread+0x16 ----------------------------- 02 ffff8a00`aaa17aa0 fffff806`44b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153 1. All 3 crash dump points to same stack and register value. FAULTING_IP: nvlddmkm+982678 fffff80d`3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d`3a5c6c20)] 2. Windbg stack points to VIDEO_TDR_FAILURE (116). 37: kd> !analyze -v >#******************************************************************************* >#* Bugcheck Analysis * >#******************************************************************************* VIDEO_TDR_FAILURE (116) Attempt to reset the display driver and recover from timeout failed. Arguments: Arg1: ffffdd84719ea010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT). Arg2: fffff80fe60e2678, The pointer into responsible device driver module (e.g. owner tag). Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation. Arg4: 0000000000000004, Optional internal context dependent data. As per Microsoft documentation this is cause by following reasons https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x116---video-tdr-error Refer to Resolution Section *Over-clocked components, such as the motherboard *Incorrect component compatibility and settings (especially memory configuration and timings) *Defective parts (memory modules, motherboards, etc.) *Insufficient system power *Insufficient system cooling We are using the HP Servers with following specification; HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016). And moreover when we tried to upgrade the drivers to the latest version 385.54 Release Date: 25.9.2017 they we were unable to run virtual GPU (Remote FX) as GPU does not show in the HyperV setting. Once we reverted to old driver 376.84, we could see physical GPUs under Hyper-V settings. Can any tell if someone has experience the same issue with the Driver version? |
|
相关推荐
3个回答
|
|
嗨Venky,
当您在Tesla M60上运行RemoteFX时,我假设您拥有所需的vPC许可证,因此请使用ESP打开支持票。 您应该从GRID5.0包(R384分支)运行支持的驱动程序。 问候 西蒙 以上来自于谷歌翻译 以下为原文 Hi Venky, As you are running RemoteFX on Tesla M60 I assume you have the required vPC licenses so please open a support ticket with ESP. You should run the supported driver from GRID5.0 package (R384 branch). Regards Simon |
|
|
|
西蒙你好,
感谢您重温这一点。 我们不使用任何许可证,因为我们只是使用nvidia.com的驱动程序来获取任何GRID软件或类似的东西。 我们碰巧将RemoteFX与以前版本的Tesla M60驱动程序一起使用。 最近我们观察到一些崩溃,并考虑更新驱动程序版本并遇到问题,因为我们无法使用vGPU。 问候, Venky 以上来自于谷歌翻译 以下为原文 Hello Simon, Thanks for getting back on this. We dnt use any licenses as we just use the driver from nvidia.com for any GRID software or something like that. We happen to use the RemoteFX with previous version of Tesla M60 drivers. Lately we observed some crashes and thought to update the driver version and bumped into issue as we were not able to use the vGPU. Regards, Venky |
|
|
|
嗨Venky,
所以请查看我们的许可/ EULA,因为您需要购买RemoteFX和Tesla M60部署的许可证! 看这里: http://images.nvidia.com/content/grid/pdf/161207-GRID-Packaging-and-Licensing-Guide.pdf 顺便说一句,你使用的驱动程序并不重要。 问候 西蒙 以上来自于谷歌翻译 以下为原文 Hi Venky, so please check our Licensing/EULA as you need to buy licenses for your deployment with RemoteFX and Tesla M60! See here: http://images.nvidia.com/content/grid/pdf/161207-GRID-Packaging-and-Licensing-Guide.pdf And btw it doesn't matter what driver you're using. Regards Simon |
|
|
|
只有小组成员才能发言,加入小组>>
使用Vsphere 6.5在Compute模式下使用2个M60卡遇到VM问题
3083 浏览 5 评论
是否有可能获得XenServer 7.1的GRID K2驱动程序?
3501 浏览 4 评论
小黑屋| 手机版| Archiver| 电子发烧友 ( 湘ICP备2023018690号 )
GMT+8, 2024-11-28 16:52 , Processed in 0.730072 second(s), Total 48, Slave 42 queries .
Powered by 电子发烧友网
© 2015 bbs.elecfans.com
关注我们的微信
下载发烧友APP
电子发烧友观察
版权所有 © 湖南华秋数字科技有限公司
电子发烧友 (电路图) 湘公网安备 43011202000918 号 电信与信息服务业务经营许可证:合字B2-20210191 工商网监 湘ICP备2023018690号