完善资料让更多小伙伴认识你,还能领取20积分哦, 立即完善>
你好。
我们有三款配备Tesla M60卡的全新戴尔R730服务器。 它们随XenServer 7.0一起安装 - 所有三个补丁都有21个。所有这三个都有以下问题: 一旦几个VM启动,它们只是重新启动而不显示任何信息。 在事件日志中,将记录以下问题: 在插槽6处的组件上检测到总线致命错误。 在总线0设备2功能0的组件上检测到致命错误。 M60安装在插槽6中。 电源插头已经更换(不正确)。 如果我们将卡移动到插槽4,也会发生同样的情况。 没有XenServer Crashdump。 http://nvidia-esp.custhelp.com/app/answers/detail/a_id/4249没有修复它。 任何提示在哪里搜索? 以上来自于谷歌翻译 以下为原文 Hello. We have three new Dell R730 Servers with Tesla M60 Cards. They are installed with XenServer 7.0 - all Patches to 21. All three have the following Problem: As soon as a few VMs start they are just rebooting without showing any informations. In the event-log the following Problem is logged: A bus fatal error was detected on a component at slot 6. A fatal error was detected on a component at bus 0 device 2 function 0. The M60 is installed in Slot 6. The powerplug was already replaced (was not correct). The same happens if we move the Card to Slot 4. There is no XenServer Crashdump. http://nvidia-esp.custhelp.com/app/answers/detail/a_id/4249 did not fix it. Any hints where to search? |
|
相关推荐
31个回答
|
|
我运行了一个支持GPU的XenServer 7.0,带有768GB RAM,没有任何问题(Cisco UCS)。
虽然它确实列出了戴尔R720& R730,除非你遇到这个问题,否则我不会在任何其他主机上烦恼。 只是出于兴趣,你安装了什么CPU? 你能给它全名吗? (EG:E5-2670 v4) 使用较新的管理程序,您不需要这样做,但在您的某个主机上,您可以尝试禁用“4GB以上的内存映射I / O”,看看会发生什么? 如上所述,您不需要使用较新的Hypervisor。 如果它之后无法启动或抛出任何错误,只需将其设置回原来的状态即可。 电源很好,这就是我的预期。 您能否确认一下M60如何通电? 你能拍出清晰的照片吗? (如果您不想在此发布,请随意告诉我。) 这是相当紧张的前进,但错误可能发生,我一直在那里。 (JS,如果你读这个,不是一个字!;-)) 由于您的服务器目前几乎无法使用,如果上面提到的BIOS更改没有做任何事情,您是否有时间从资源池中删除1并重新启动? 将R730 BIOS重置为出厂默认设置,干净安装XenServer(不要将其重新添加到资源池中,保持独立),许可XenServer(Enterprise或更高版本),不要使用上面提到的内存解决方法,完全 更新XenServer,安装最新的GRID驱动程序,构建一个干净的Windows VM(来自.iso,而不是预构建的模板),包含所有Windows更新并安装GRID驱动程序,不打扰应用程序或通过MCS运行它,只需查看是否 问题仍然存在。 奇怪的是,你在所有的R730上都有同样的问题。 1x R730或M60可能有一个问题,如果这只是1个主机,但是它存在于所有主机上,这是不太可能的。 它们之间显然存在一个共同的问题,即连接,安装或配置错误。 你是如何与Passthrough相处的? 以上来自于谷歌翻译 以下为原文 I've run a GPU enabled XenServer 7.0 with 768GB RAM without any issues (Cisco UCS). Although it does list Dell R720 & R730, unless you are experiencing that issue, I wouldn't bother with it on any other Hosts. Just out of interest, what CPU do you have installed? Can you give it's full name? (EG: E5-2670 v4) With newer Hypervisors you shouldn't need to do this, but on one of your Hosts, in the BIOS can you try disabling "Memory Mapped I/O above 4GB" and see what happens? As said, you shouldn't need to do it with newer Hypervisors. If it won't boot afterwards or throws any errors, just set it back to how it was. Power Supplies are fine, that's what I expected. Can you just confirm how you have the M60 powered up for me? Are you able to take a clear photo? (Feel free to PM me that if you'd rather not post it on here). It's fairly strait forward, but mistakes can happen, I've been there. (JS if you read this, not a word! ;-) ) As your servers are pretty much unusable at the moment, if the BIOS change mentioned above doesn't do anything, do you have time to remove 1 from the Resource Pool and start again? Reset R730 BIOS to factory default, clean install of XenServer (don't add it back into the Resource Pool, keep it stand-alone), license XenServer (Enterprise or above), don't use the memory workaround you mentioned above, fully update XenServer, install latest GRID drivers, build a clean Windows VM (from an .iso, not a pre-built template) with all Windows updates and install GRID drivers, don't bother with Apps or running it through MCS, just see if the issue remains. It's strange you have this same issue on all of the R730s. It's possible that 1x R730 or M60 may have had an issue if this were only 1 Host, but with it being on all of them, this is far less likely. There is obviously a common issue between them somewhere, something has been connected, installed or configured incorrectly. How did you get on with Passthrough? |
|
|
|
我使用Master-VM和GPU-Passthrough进行了一些测试 - 直到现在都没有问题。
所以看起来它仅限于vGPU。 CPU E5-2667 稍后会检查内存映射设置。 您想获得M60电源线的图片吗? 单独的电缆插入上升卡。 两个末端(6和8pin)连接到卡的两个8pin。 然后一张8pin进入卡片。 以上来自于谷歌翻译 以下为原文 I did some Tests with the Master-VM and a GPU-Passthrough - no Problems until now. So it Looks like it's limited to vGPU. CPU E5-2667 Will check the Memory Mapped Setting later. You would like to get Pictures of the M60 power cabling? The separate Cable is plugged into the rise Card. the two Ends (6 and 8pin) are connected to the two 8pin of the Card. Then one 8pin into the Card. |
|
|
|
好吧,只是检查CPU TDP,因为它听起来你已经在服务器之间交换了位(R7910> R730至少M60明智)所以不知道你一起购买了什么系统以及你可能在机箱之间移动了哪些其他组件。
布线听起来不错。 但是对于不同代的GPU和型号,有不同的电缆,你已经提到过你发送的电缆存在问题......如果你确信这是正确的,我们可以忘记这一点。 好,如果它只是vGPU,那就是软件问题。 这就是为什么我在前两篇文章中要求尝试Passthrough,因为它从循环中删除了Hypervisor驱动程序,你知道在哪里寻找问题。 如果您已从Hypervisor> rebooted>安装中删除了驱动程序,则新驱动程序>重新启动并确保正确配对的驱动程序位于主映像中(该位很重要,因为它是VM中问题的区别 并在主机之间跟随您,或者与所有3个主机有某种问题)。 然后我会按照上面的建议执行操作,然后重新启动一个XenServer。 全新的开始。 重置BIOS等(如上所述)......不需要很长时间,XenServer是一个快速简便的安装。 以上来自于谷歌翻译 以下为原文 Ok, just checking CPU TDP as it sounds like you've swapped bits around between servers (R7910 > R730 at least M60 wise) so no idea what systems you've bought together and what other components you may have moved between chassis. Cabling sounds right. But there are different cables for different generation GPUs and Models and you've already mentioned you had issues with the cables you had been sent ... If you're confident that it's right, we can forget about that. Good, if it's just vGPU, it's a software issue. This is why I asked in my first 2 posts to try Passthrough, as it removes the Hypervisor driver from the loop and you know where to look for the issue. If you've already removed the driver from the Hypervisor > rebooted > installed the new driver > rebooted again and made sure that the correctly paired driver is in your Master Image (that bit is important because it's the difference between the problem being in your VM and following you between hosts, or a problem of some sort with all 3 Hosts). Then I'd do as suggested above and start again with 1 of your XenServers. Complete fresh start. Reset BIOS etc etc (as above) ... Doesn't take long, XenServer is a quick and easy install. |
|
|
|
None
以上来自于谷歌翻译 以下为原文 No - we didn't swap Things between the Servers ;) Will make a Picture later and attach it. Master VM is using the Driver attached to 367.64. First check now is Memory mapped.... |
|
|
|
要检查主映像的内容...在“设备管理器”中,启用“显示隐藏设备”。
浏览并删除列出的每个Ghost设备(每个设备!无论它是什么,如果它是Ghost,然后将其删除)。 然后再次尝试vGPU。 在更改了vGPU配置文件或在Passthrough和vGPU之间进行了更改之后,您应该返回并删除不再使用的每个GPU的Ghost配置文件。 之前我没有提出过这个建议,因为我从来没有听说它会导致主机崩溃,所以它可能不相关,但它值得检查,无论如何都应该这样做只是为了保持Master Image的清洁。 以上来自于谷歌翻译 以下为原文 Something to check in your Master Image ... In "Device Manager", enable "Show Hidden Devices". Go through and remove every Ghost device that is listed (every one of them! Regardless of what it is, if it's a Ghost, then remove it). Then try a vGPU again. After you've either changed vGPU profile or changed between Passthrough and vGPU, you should go back and remove the Ghost profile for each GPU that is no longer in use. I didn't suggest this before because I've never heard of it crashing a Host, so it may not be relevant, but it's worth checking and should be done anyway just to keep the Master Image clean. |
|
|
|
很抱歉没有回复这么长时间(假期等)
我们正在与Dell / Citrix / Nvidia进行升级 - 但直到知道才找到解决方案。 我得到的信息是其他人与戴尔R730有同样的问题 - 但从未听说过与其他硬件供应商相同的问题。 一些有趣的笔记: XenServer 6.5没有问题 使用Xeon v3 CPU(而不是v4)时没问题 以上来自于谷歌翻译 以下为原文 Sorry for not replying such a long time (was on Holidays etc) We are in an escalation with Dell/Citrix/Nvidia - but no solution was found until know. I got the information that other people have the same problems with Dell R730 - but never heard about the same problems with other hardware vendors. A few interesting notes: No problems with XenServer 6.5 No problems when we use a Xeon v3 CPU (instead of v4) |
|
|
|
嗨jhmeier,
wenn ich mir deinen Namen anschaue bist du Deutscher ;-) Ich habe bei mehreren Kunden das gleiche Problem und indirekt bei FSC und Dell einen Supportfall auf。 Meines Erachtens liegt es dem gleichen问题wie hier: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2146388 Bei FSC isre bereits ein BIOS Update erschienen。 Bei Dell sollte einUpdatefürdasBIOS sicherlich folgen。 以上来自于谷歌翻译 以下为原文 Hi jhmeier, wenn ich mir deinen Namen anschaue bist du Deutscher;-) Ich habe bei mehreren Kunden das gleiche Problem und indirekt bei FSC und Dell einen Supportfall auf. Meines Erachtens liegt es dem gleichen Problem wie hier : https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2146388 Bei FSC ist bereits ein BIOS Update erschienen. Bei Dell sollte ein Update für das BIOS sicherlich folgen. |
|
|
|
嘿,
我们在使用网格m60卡的Xen 7.0服务器上遇到完全相同的问题,本文中的更改有所帮助,我们仍在测试https://support.citrix.com/article/CTX220674 以上来自于谷歌翻译 以下为原文 hey, we are having the exact same issues on Xen 7.0 servers with grid m60 cards, the changes in this article help, and we are still testing https://support.citrix.com/article/CTX220674 |
|
|
|
我们的LAB环境中存在完全相同的问题。
我们将M60放在R720服务器中。 它运行好几个月,但上周我们遇到了主机崩溃,并报告了相同的错误: 在总线64设备2功能0处的组件上检测到总线致命错误。 在插槽4处的组件上检测到总线致命错误。 这个案子有没有更新? 以上来自于谷歌翻译 以下为原文 We had the exact same issue on our LAB environment. We put an M60 in a R720 server. It was running OK for a few months but last week we had an host crash with exact the same errors reported: A bus fatal error was detected on a component at bus 64 device 2 function 0. A bus fatal error was detected on a component at slot 4. Is there any update on this case? |
|
|
|
None
以上来自于谷歌翻译 以下为原文 Hi RKossen, The case result was the WAR posted above in the CTX220674 article. You issue seems to be different as I doubt you're already using Intel v4 CPUs with Dell R720. In addition Tesla M60 is not supported at all for this hardware Regards Simon |
|
|
|
是的,“no-pml”解决了这个问题。
还有一个私有的修补程序可以修复它(无法确认,因为我没有时间来测试它)。 目前可以禁用pml功能 - 它适用于实时迁移(目前vGPU VM无法实现)。 以上来自于谷歌翻译 以下为原文 Yes the "no-pml" fixes the problem. There is also a private hotfix available to fix it (can't confirm because I didn't have the time to test it). Currently it's fine to disable the pml feature - it's for live-migrations (which currently are not possible with vGPU VMs). |
|
|
|
我知道它没有得到官方的支持,但它工作了几个月,我们突然得到了一个BSOD。
所以我认为我们的问题可能与这个话题有关(崩溃代码是相同的) 以上来自于谷歌翻译 以下为原文 I know it is not officialy supported but it worked for serveral months and suddenly we got a BSOD. So I thought our issue was maybe related to this topic ( crash codes are the same ) |
|
|
|
只有小组成员才能发言,加入小组>>
使用Vsphere 6.5在Compute模式下使用2个M60卡遇到VM问题
3122 浏览 5 评论
是否有可能获得XenServer 7.1的GRID K2驱动程序?
3530 浏览 4 评论
小黑屋| 手机版| Archiver| 电子发烧友 ( 湘ICP备2023018690号 )
GMT+8, 2024-12-20 12:37 , Processed in 0.728539 second(s), Total 67, Slave 61 queries .
Powered by 电子发烧友网
© 2015 bbs.elecfans.com
关注我们的微信
下载发烧友APP
电子发烧友观察
版权所有 © 湖南华秋数字科技有限公司
电子发烧友 (电路图) 湘公网安备 43011202000918 号 电信与信息服务业务经营许可证:合字B2-20210191 工商网监 湘ICP备2023018690号