完善资料让更多小伙伴认识你,还能领取20积分哦, 立即完善>
你好。
我们有三款配备Tesla M60卡的全新戴尔R730服务器。 它们随XenServer 7.0一起安装 - 所有三个补丁都有21个。所有这三个都有以下问题: 一旦几个VM启动,它们只是重新启动而不显示任何信息。 在事件日志中,将记录以下问题: 在插槽6处的组件上检测到总线致命错误。 在总线0设备2功能0的组件上检测到致命错误。 M60安装在插槽6中。 电源插头已经更换(不正确)。 如果我们将卡移动到插槽4,也会发生同样的情况。 没有XenServer Crashdump。 http://nvidia-esp.custhelp.com/app/answers/detail/a_id/4249没有修复它。 任何提示在哪里搜索? 以上来自于谷歌翻译 以下为原文 Hello. We have three new Dell R730 Servers with Tesla M60 Cards. They are installed with XenServer 7.0 - all Patches to 21. All three have the following Problem: As soon as a few VMs start they are just rebooting without showing any informations. In the event-log the following Problem is logged: A bus fatal error was detected on a component at slot 6. A fatal error was detected on a component at bus 0 device 2 function 0. The M60 is installed in Slot 6. The powerplug was already replaced (was not correct). The same happens if we move the Card to Slot 4. There is no XenServer Crashdump. http://nvidia-esp.custhelp.com/app/answers/detail/a_id/4249 did not fix it. Any hints where to search? |
|
相关推荐
31个回答
|
|
|
|
|
|
您有SUMS支持还是这个售前POC?
最好的祝愿, 雷切尔 以上来自于谷歌翻译 以下为原文 Do you have SUMS support or is this pre-sales POC? Best wishes, Rachel |
|
|
|
依靠 :)
我们已经许可了一些系统 - 但是对于这个系统我们需要测试哪些许可证是必要的 - 但为此我们需要在一个有效的环境中进行测试:) 以上来自于谷歌翻译 以下为原文 Depends :) We have some Systems already licensed - but for this one we need to test which licenses are necessary - but for that we need to test on a working enviroment :) |
|
|
|
我要添加的详细信息:
* VDA / XD版本 * NVIDIA驱动程序版本 * Bios 对于所有新的M60,我建议检查模式开关是否已正确应用:http://nvidia.custhelp.com/app/answers/detail/a_id/4106/kw/modeswitch 以上来自于谷歌翻译 以下为原文 Details I'd like added: * the VDA / XD versions * The NVIDIA driver versions * Bios With all new M60 I'd recommend checking modeswitch has applied correctly: http://nvidia.custhelp.com/app/answers/detail/a_id/4106/kw/modeswitch |
|
|
|
VDA 7.11
XD 7.11 驱动程序: XenServer的: NVIDIA vGPU(版本361.45.09) NVIDIA vGPU(版本367.43) VM: 369.17_grid_win8_win7_server2012R2_server2008R2_64bit_international BIOS: 2.2.5 在计算模式下,vms没有启动 - 只有图形模式才启动:) 以上来自于谷歌翻译 以下为原文 VDA 7.11 XD 7.11 Drivers: XenServer: NVIDIA vGPU (version 361.45.09) NVIDIA vGPU (version 367.43) VM: 369.17_grid_win8_win7_server2012R2_server2008R2_64bit_international Bios: 2.2.5 With compute mode the vms didn't start - only with graphics mode they start :) |
|
|
|
嗨jhmeier
您能否告诉我们为什么您有2个不同的主机驱动程序和上面列出的1个VM驱动程序? 驱动程序成对发布(主机/ VM)。 如果您使用多个驱动程序,我希望看到它们成对列出。 361.45.09来自GRID 3.1包,并且只能与362.56配对。 正如您在版本比较中所看到的,它远远落后于当前版本。 Xen的最新驱动程序是367.64,仅与369.71配对,可从此处获取:https://nvidia.flexnetoperations.com/control/nvda/login 当您仅启动单个VM时是出现问题,还是在启动多个VM时出现问题,并且您是否能够在分配了vGPU的情况下启动任何VM,或者无法成功启动? 在Xen主机上运行nvidia-smi时,结果如何? 当您创建主映像时,VM显然已分配了一个vGPU来安装NVIDIA驱动程序,您是否遇到过任何问题? 您是否只运行XenDesktop或XenApp以及您使用的操作系统? 它是否与Passthrough一起使用? 你的配置方法是什么? MCS还是PVS? 问候 本 以上来自于谷歌翻译 以下为原文 Hi jhmeier Can you please let us know why you have 2 different host drivers and only 1 VM driver listed above? The drivers are released in pairs (Host / VM). If you are using multiple drivers, I would have expected to see them listed in pairs. 361.45.09 is from the GRID 3.1 package, and should only be paired wtih 362.56. As you can see by version comparison, it's quite a way behind the current release. The latest drivers for Xen are 367.64 paired only with 369.71, available from here: https://nvidia.flexnetoperations.com/control/nvda/login Does the problem occur when you start only a single VM, or is it when multiple VMs are started and are you able to start any VMs at all with a vGPU assigned or do none power on successfully? When you run nvidia-smi on the Xen Hosts, what are the results? When you created your Master Image, the VM obviously had a vGPU assigned for you to install the NVIDIA drivers, did you experience any issues then? Are you running just XenDesktop or XenApp as well and which Operating Systems are you using? Does it do it with Passthrough as well? What is your provisioning method? MCS or PVS? Regards Ben |
|
|
|
嗨,
我们有两个主机驱动程序,因为我们从旧版本开始并更新到新版本。 据我所知,从XenCenter中删除一个版本是不可能的(完全重新安装除外)。 因此我们有367.64使用369.17。 据我所知,只有几个vms启动时才会发生。 Nvidia的-SMI: + ------------------------------------------------- ---------------------------- + | NVIDIA-SMI 367.43驱动程序版本:367.43 | | ------------------------------- + ----------------- ----- + ---------------------- + | GPU名称持久性-M | Bus-Id Disp.A | 挥发性的Uncorr。 ECC | | Fan Temp Perf Pwr:用法/上限| 内存使用| GPU-Util Compute M. | | =============================== + ================= ===== + ====================== | | 0特斯拉M60开| 0000:05:00.0关闭| 关| | N / A 35C P8 25W / 150W | 14MiB / 8191MiB | 0%默认值| + ------------------------------- + ----------------- ----- + ---------------------- + | 1特斯拉M60开| 0000:06:00.0关闭| 关| | N / A 31C P8 23W / 150W | 14MiB / 8191MiB | 0%默认值| + ------------------------------- + ----------------- ----- + ---------------------- + + ------------------------------------------------- ---------------------------- + | 进程:GPU内存| | GPU PID类型进程名称用法| | ================================================= ============================ | | 找不到正在运行的进程| + ------------------------------------------------- ---------------------------- + 没有主机问题 - 但主映像是在其他主机上创建的(没有问题)。 我们正在使用XA / XD - 但在这种情况下,它是带有Windows 7的XD 7.11。 MCS 以上来自于谷歌翻译 以下为原文 Hi, we have two host Drivers because we started with the old Version and updated to the new one. As far as I know it'S not possible to remove one Version from XenCenter (except with a full reinstallation). Thus we have 367.64 with 369.17 in use. As far as I can see it only happened when a few vms have been started. Nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.43 Driver Version: 367.43 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M60 On | 0000:05:00.0 Off | Off | | N/A 35C P8 25W / 150W | 14MiB / 8191MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 On | 0000:06:00.0 Off | Off | | N/A 31C P8 23W / 150W | 14MiB / 8191MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ No Problems with the master - but the master Image was created on other hosts (without the Problem). We are using both XA/XD - but in this Case it's XD 7.11 with Windows 7. MCS |
|
|
|
刚刚使用主vm崩溃了一个服务器 - 没有其他虚拟机 - 所以它也只发生在一个虚拟机上。
以上来自于谷歌翻译 以下为原文 Just crashed a Server only with the master vm - no other vms - so it also happens with just one vm. |
|
|
|
从XenServer中删除NVIDIA驱动程序 -
- 查询NVIDIA驱动程序:rpm -qa | grep -i nvidia 让我们假设它回来了:NVIDIA-vgx-xenserver-7.0-361.45.09(如果它不同,请调整到您的版本) - 删除NVIDIA驱动程序:rpm -ev NVIDIA-vgx-xenserver-7.0-361.45.09 我通常会在删除后重启。 - 使用WinSCP将新的NVIDIA驱动程序复制到Xen主机并安装:rpm -iv /change-to-your-path/NVIDIA-vgx-xenserver-7.0。 。 。 .rpm(调整到您的驱动程序版本) 或者您可以使用.iso使用GUI方法进行安装 安装完成后重新启动。 ===== ===== 当你说“崩溃服务器”时,你的意思是R730重新启动了吗? 我把它全部3个XenServer主机正确许可? 您使用的是哪些vGPU配置文件? 您是否对R730 BIOS进行了任何更改? 你能为我试一下Passthrough的个人资料,让我知道会发生什么吗? 问候 本 以上来自于谷歌翻译 以下为原文 To remove the NVIDIA driver from XenServer - - Query the NVIDIA driver: rpm -qa | grep -i nvidia Let's assume it comes back with: NVIDIA-vgx-xenserver-7.0-361.45.09 (Adjust to your version if it is different) - Remove NVIDIA driver: rpm -ev NVIDIA-vgx-xenserver-7.0-361.45.09 I typically put a reboot in here after removal. - Copy new NVIDIA driver to Xen host using WinSCP and install: rpm -iv /change-to-your-path/NVIDIA-vgx-xenserver-7.0. . . .rpm (Adjust to your driver version) Or you can use the GUI method of mounting using a .iso Reboot after install completes. ===== ===== When you say "crashed a server", do you mean the R730 rebooted? I take it all 3 XenServer hosts are correctly licensed? Which vGPU profiles are you using? Have you made any changes to the R730 BIOS? Can you try a Passthrough profile for me and let me know what happens? Regards Ben |
|
|
|
是的 - r730重启了。
是的所有许可。 M60-0b - 用m60-1b准备测试,但部署需要一些时间。 不 - 我发现提示应该有一个戴尔文档可用于网格的BIOS设置 - 但我找不到。 m60-1b测试也可以吗? 以上来自于谷歌翻译 以下为原文 yes - the r730 rebooted. Yes all licensed. M60-0b - preparing a test with m60-1b but deployment takes some time. No - I found a hint that there should be a Dell document available with bios Settings for grid - but I can't find that. Is the m60-1b test also ok? |
|
|
|
好的,感谢您提供更多信息。
你所做的只是将帧缓冲区从512MB增加到1GB。 我要求进行Passthrough测试的原因是,Passthrough不会在管理程序中使用驱动程序,而任何其他配置文件都会使用。 我不认为增加帧缓冲会阻止这个问题。 让我做一些调查...... 问候 本 以上来自于谷歌翻译 以下为原文 Ok, thanks for the additional info. All you're doing is increasing the framebuffer from 512MB to 1GB. The reason I asked for a Passthrough test, is that Passthrough will not use the driver in the hypervisor, whereas any other profile will. I don't think increasing the framebuffer will stop this issue. Let me do some investigation ... Regards Ben |
|
|
|
你好
你能否回顾一下,让我知道你的想法:http://nvidia.custhelp.com/app/answers/detail/a_id/4163/~/nvidia-grid-vgpu-on-dell-r730-%2F- R720的服务器,-on升级到Citrix的XenServer的 可能值得更新您当前的BIOS版本...您还可以通过尝试Passthrough配置文件来帮助验证。 该页面底部还有一些其他建议。 问候 本 以上来自于谷歌翻译 以下为原文 Hi Can you please review this and let me know what you think: http://nvidia.custhelp.com/app/answers/detail/a_id/4163/~/nvidia-grid-vgpu-on-dell-r730-%2F-r720-servers,-on-upgrade-to-citrix-xenserver May well be worth an update to your current BIOS version... You can also help validate that by trying a Passthrough profile. There are also some other suggestions at the bottom of that page. Regards Ben |
|
|
|
感谢hin - 已经检查过 - bios等都是最新的。
与大多数提示不同的是,我们的虚拟机启动 - 在大多数情况下,虚拟机无法启动。 以上来自于谷歌翻译 以下为原文 Thanks for the hin - already checked that - bios etc is all up to date. The Major different to Most hints is that our vms start - in most Scenarios the vms don't boot. |
|
|
|
只是试图删除一个旧的Nvidia suplemental包:
错误:未安装包NVIDIA-vgx-xenserver-7.0-361.45.09 我猜他们在升级过程中被删除了 - 但不完全,因此在XenCenter中仍然可以看到旧版本。 以上来自于谷歌翻译 以下为原文 just tried to remove one of the old Nvidia suplemental packs: error: package NVIDIA-vgx-xenserver-7.0-361.45.09 is not installed I guess they are removed during upgrade - but not fully thus old Version is still visible in XenCenter. |
|
|
|
好
有3台Dell R730服务器。 所有相同的规格,所有相同的固件和所有完全最新。 BIOS在所有3个主机(2.2.5)上的版本相同,并以相同的方式配置(出厂默认设置)。 (戴尔网站报道有更新的BIOS版本可用(2.3.4)。我不是说它会解决问题,因为虚拟机启动,只有新版本可用) http://www.dell.com/support/home/uk/en/ukbsdt1/Drivers/DriversDetails?driverId=0FR48&fileId=3586834680&osCode=CXS07&productCode=poweredge-r730&languageCode=en&categoryId=BI XS 7.0在所有R730上都经过完全修补和许可。 所有XS主机上的NVIDIA驱动程序都是相同的(或应该是)。 使用进一步提供的信息,您现在可以在所有XS主机上运行相同的GRID驱动程序。 VM驱动程序与主机驱动程序正确配对。 在每台主机上运行nvidia-smi都没有显示错误。 你能在所有3个XS主机上运行这个“rpm -qa | grep -i nvidia”并发布结果。 当您使用MCS时,您希望使用每个GPU配置文件的存储连接,并在创建XD目录时使用正确的存储连接。 但是,即使在您创建VM并分配vGPU之前,整个主机也会崩溃,无论是单个VM还是多个VM? 基于此,我能想到的唯一一步就是尝试使用Passthrough GPU,看看是否有任何区别...... 您可能希望在Citrix论坛上发帖,看看是否有人可以提供一些建议:https://discussions.citrix.com/forum/523-gpu-technologies/ 问候 本 以上来自于谷歌翻译 以下为原文 Ok There are 3x Dell R730 servers. All identical specs, all identical firmware and all completely up to date. The BIOS is the same version across all 3 hosts (2.2.5) and configured in the same way (factory default). (Dell website is reporting that there is a newer BIOS version available (2.3.4). I'm not saying it will fix the issue as the VMs boot, only that a newer version is available) http://www.dell.com/support/home/uk/en/ukbsdt1/Drivers/DriversDetails?driverId=0FR48&fileId=3586834680&osCode=CXS07&productCode=poweredge-r730&languageCode=en&categoryId=BI XS 7.0 is fully patched and licensed across all R730s. NVIDIA drivers are the same (or should be) across all XS hosts. Using the information provided further up, you can now run the same GRID drivers across all XS hosts. The VM driver is correctly paired with the host driver. Running nvidia-smi on each host shows no errors. Can you run this "rpm -qa | grep -i nvidia" on all 3 XS hosts and post the results. As you're using MCS, you have a storage connection for each GPU profile you wish to use and the correct storage connection is being used when you create your XD Catalog. However, even before you get that far, when you create a VM and assign a vGPU, the entire host crashes with either a single VM or multiple VMs? Based on that, the only step left I can think of, is to try a Passthrough GPU and see if that makes any difference ... You might want to post on the Citrix Forums to see if anyone there can offer some advice: https://discussions.citrix.com/forum/523-gpu-technologies/ Regards Ben |
|
|
|
bios以相同的方式配置 - 我们将一个更改为uefi启动 - 但没有任何区别。
刚刚安装了新的BIOS(不能通过lifecylcle Controller和ftp.dell.com获得) - 没有区别。 XS 7完全打补丁 - nvidia驱动器是一样的。 nvidia-smi没有显示任何错误 rpm -qa | grep -i nvidia 全部三个显示:NVIDIA-vGPU-xenserver-7.0-367.43.x86_64 目前我可以用一个手动创建的VM重现问题 - 不涉及MCS。 我稍后会检查Passthrough并给出反馈。 以上来自于谷歌翻译 以下为原文 The bios was configure in the same way - we Changed one to uefi boot - but didn't make a difference. just installed the new bios (not available through lifecylcle Controller and ftp.dell.com) - no difference. XS 7 is fully patched - nvidia drive is the same. nvidia-smi shows no error rpm -qa | grep -i nvidia All three Show: NVIDIA-vGPU-xenserver-7.0-367.43.x86_64 At the moment I can reproduce the Problem with one VM which was manually created - no MCS involved. I will check Passthrough later and give Feedback. |
|
|
|
只有512gb内存错误消息消失了 - 相反,服务器只是冻结。
当m60连接错误的电源插头时,我们遇到了与dell 7910相同的问题 - 但电源插头在730中是正确的(根据dell)。 有没有办法检查M60是否有足够的电量? (当第二次重启VM时,服务器大多冻结)。 以上来自于谷歌翻译 以下为原文 With only 512gb ram the error Messages are gone - instead the Server just freezes. We had the same Problem with dell 7910 when the m60 was connected with a wrong power plug - but the power plug is correct in the 730 (according to dell). Is there a method to check if the M60 receives enough power? (The Servers mostly freezes when the VMs are rebooted the second time). |
|
|
|
只有512GB内存?
你之前安装了什么? ...如果安装了1TB或更多RAM,则这不是兼容的服务器配置。 您在R730中使用了哪些电源? 当您订购R730时,您还应该为每台服务器(来自Dell)订购“GPU Enablement Kits”。 这为您提供了正确的电源线和低调的热同步。 订购了吗? ... 以上来自于谷歌翻译 以下为原文 With only 512GB RAM? What did you have installed previously? ... If you had 1TB or more RAM installed, this is not a compatible server configuration. Which power supplies do you have in the R730s? When you ordered the R730s, you should also have ordered "GPU Enablement Kits" for each server (from Dell). This gives you the correct power cables and low profile heat-syncs. Did order those? ... |
|
|
|
安装了576gb - 对于超过512GB的系统,解决方法/ opt / xensource / libexec / xen-cmdline --set-dom0 iommu = dom0-passthrough没有帮助。
冗余1100w 是的,gpu安装工具包是订单的一部分。 (也是针对其他的 - 但是戴尔提供了错误的 - 而且在730并不是所有的电源插头都连接好了,而且它们都在气流中......) 以上来自于谷歌翻译 以下为原文 576gb had been installed - the Workaround /opt/xensource/libexec/xen-cmdline --set-dom0 iommu=dom0-passthrough for Systems with more than 512gb did not help. Redundant 1100w yes the gpu Installation kit was part of the order. (Was also for the other ones - but Dell delivered the wrong ones - furthermore in the 730 not all power plugs had been connected and they were in the air flow....) |
|
|
|
只有小组成员才能发言,加入小组>>
使用Vsphere 6.5在Compute模式下使用2个M60卡遇到VM问题
3122 浏览 5 评论
是否有可能获得XenServer 7.1的GRID K2驱动程序?
3530 浏览 4 评论
小黑屋| 手机版| Archiver| 电子发烧友 ( 湘ICP备2023018690号 )
GMT+8, 2024-12-20 12:01 , Processed in 1.122965 second(s), Total 84, Slave 75 queries .
Powered by 电子发烧友网
© 2015 bbs.elecfans.com
关注我们的微信
下载发烧友APP
电子发烧友观察
版权所有 © 湖南华秋数字科技有限公司
电子发烧友 (电路图) 湘公网安备 43011202000918 号 电信与信息服务业务经营许可证:合字B2-20210191 工商网监 湘ICP备2023018690号