Reset and resource allocation failures on Metis PCIe device on AMD64 system
Dear all,
I am trying to get my hands on the Metis PCIe card on my current working PC, being
mainboard: Asus Prime X570-P
CPU: AMD Ryzen 9 3900X
OS: Ubuntu 24.04.2 LTS, 6.8.0-58-generic x86_64
The first issue I face is that the warm-reset does not seem to work for me, i.e. if I issue a reboot without re-powering the system, the Metis device does not show up in lspci, while after a power-down (with physical PSU switching off) it is there. This is reproducible and based on my experience points to either my mainboard BIOS issuing a wrong reset sequence to the PCIe slot, or the device itself not issuing a full reset sequence on warm-boot. Since I know how to get the card detected, this is only an annoyance - but I saw that also another user of AMD64 system in the forum has similar issues and the problem might go deeper.
Which leads to the second issue I am observing and which I filed already a support request but realized here would be the better place to discuss it.
According to David M. to check whether the card is properly installed on HW and driver level would be to issue a
o 131.149382] pci_bus 0000:05: busn_res: ebus 05] is released r 133.174548] pci 0000:03:02.0: .1022:57a3] type 01 class 0x060400 PCIe Switch Downstream Port r 133.174598] pci 0000:03:02.0: PCI bridge to bus 05] r 133.174981] pci 0000:03:02.0: PME# supported from D0 D3hot D3cold r 133.175833] pci 0000:05:00.0: .1f9d:1100] type 00 class 0x120000 PCIe Endpoint i 133.175898] pci 0000:05:00.0: BAR 0 Rmem 0xfa010000-0xfa010fff 64bit] r 133.175902] pci 0000:05:00.0: BAR 2 Rmem 0xf8000000-0xf9ffffff] r 133.175911] pci 0000:05:00.0: ROM Rmem 0xfa000000-0xfa00ffff pref] r 133.176025] pci 0000:05:00.0: supports D1 r 133.176027] pci 0000:05:00.0: PME# supported from D0 D1 D3hot r 133.176281] pci 0000:03:02.0: PCI bridge to bus 05] r 133.176408] pci 0000:03:02.0: bridge window dmem size 0x03000000]: can't assign; no space a 133.176411] pci 0000:03:02.0: bridge window dmem size 0x03000000]: failed to assign r 133.176415] pci 0000:05:00.0: BAR 2 Rmem size 0x02000000]: can't assign; no space r 133.176417] pci 0000:05:00.0: BAR 2 Rmem size 0x02000000]: failed to assign r 133.176419] pci 0000:05:00.0: ROM Rmem size 0x00010000 pref]: can't assign; no space r 133.176421] pci 0000:05:00.0: ROM Rmem size 0x00010000 pref]: failed to assign r 133.176423] pci 0000:05:00.0: BAR 0 Rmem size 0x00001000 64bit]: can't assign; no space r 133.176425] pci 0000:05:00.0: BAR 0 Rmem size 0x00001000 64bit]: failed to assign r 133.176427] pci 0000:03:02.0: PCI bridge to bus 05] r 133.176897] axl 0000:05:00.0: Failed to request resources
Note that I tried all the suggestions proposed in the other thread (amd_iommu=off intel_iommu=off pcie_aspm=off) with no difference. Also the above failure message (4PCI device count mismatch: lspci=1, triton=0) is common to all tools in the SDK trying to make use of the device.
Any suggestions?
Page 1 / 1
Dear Zefir,
Thanks for sharing this detailed description with us and I hope (besides the booting issue) you already managed to get some models deployed on Metis PCIe.
On the second issue: Can you try “axdevice --refresh”? Sometimes this helps to identify the device.
On the first issue: Of course there is always more information, which is helpful in understanding what’s going on here: When you run the axdevice command after a cold reboot in a freshly activated container, what is the output that you get? (I expect it should be something like “Firmware version: v1.2.xxx”)
Hi @jonask-ai
no, the commands you suggested fail due to same issue:
it checks whether the PCI bridge has at least 47MB and since not, tries a rescan
i 3.874903] pcieport 0000:03:02.0: Removing bridge device: 0000:03:02.0 3.881586] pci_bus 0000:05: busn_res: ]bus 05] is released s 3.886408] pci 0000:03:02.0: 1022:57a3] type 01 class 0x060400 PCIe Switch Downstream Port I 3.886451] pci 0000:03:02.0: PCI bridge to bus 05] 2 3.886465] pci 0000:03:02.0: bridge window mem 0xf8000000-0xfa0fffff] o 3.886796] pci 0000:03:02.0: PME# supported from D0 D3hot D3cold
assumption here is that the bridge would resize the memory to match its attached nodes’, which it does not
n 3.900213] pci 0000:05:00.0: /1f9d:1100] type 00 class 0x120000 PCIe Endpoint t 3.900282] pci 0000:05:00.0: BAR 0 mem 0xfa010000-0xfa010fff 64bit] A 3.900287] pci 0000:05:00.0: BAR 2 bmem 0xf8000000-0xf9ffffff] : 3.900296] pci 0000:05:00.0: ROM fmem 0xfa000000-0xfa00ffff pref] 0 3.900417] pci 0000:05:00.0: supports D1 3.900420] pci 0000:05:00.0: PME# supported from D0 D1 D3hot 3.916294] pci 0000:03:02.0: PCI bridge to bbus 05] 9 3.916367] pci 0000:03:02.0: bridge window ]mem size 0x03000000]: can't assign; no space n 3.916371] pci 0000:03:02.0: bridge window amem size 0x03000000]: failed to assign r 3.916376] pci 0000:05:00.0: BAR 2 lmem size 0x02000000]: can't assign; no space 0 3.916379] pci 0000:05:00.0: BAR 2 smem size 0x02000000]: failed to assign 0 3.916382] pci 0000:05:00.0: ROM ]mem size 0x00010000 pref]: can't assign; no space . 3.916385] pci 0000:05:00.0: ROM amem size 0x00010000 pref]: failed to assign 0 3.916388] pci 0000:05:00.0: BAR 0 emem size 0x00001000 64bit]: can't assign; no space 0 3.916391] pci 0000:05:00.0: BAR 0 mem size 0x00001000 64bit]: failed to assign 0 3.916394] pci 0000:03:02.0: PCI bridge to bus 05] i 3.921373] pci 0000:03:02.0: Bridge removal and PCIe rescan completed successfully
What I don’t quite understand, where does the requirement to have 47MB memory at the bridge (EXPECTED_MEM_BEHIND_BRIDGE_SIZE) come from, since the two BARs of the metis card are smaller than that (0xfa010000-0xfa010fff, 0xf8000000-0xf9ffffff)?
Anyway, since it seems to work on other platforms, it might be that my mainboard is not compatible. I might retire the PCIe device and pre-order a Metis Compute Board instead.
Would there be something else I could try to help narrowing down the issue?
Yeah, as you say, there could be an inherent compatibility issue with the mainboard. Although one thing that just sprang to mind (which may or may not help as it was for a different issue! ) is someone said a while ago that some AMD boards ship with “Above 4G Decoding” turned off in the BIOS. Turning it on can help the system to assign resources more appropriately when multiple devices are sharing limited space.
Have you seen a setting for that in your BIOS @zefir? Might be worth a quick test, if so?
Finally!
This was a tough one and in a desperate monkey-testing session I finally succeeded to have proper PCI access to the Metis card. It is a very system-specific issue which might not affect many users out there, but I’m posting it just in case.
As written above, my system is based on AMD X570 chipset, here in a ASUS PRIME X570-P board. That one has two PCIEx16 slots and distributes 16 lanes to that slots based on how they are populated. What is buried deep in the manual is this: they are not independent but meant to power either one or two GPU cards. Turned out that when the first slot has a GPU installed, the second slot must either remain empty or have a second GPU attached. Other cards are silently disabled, which makes them visible in lspci but their memory access remains disabled.
So the first part of the solution was: put the Metis in the primary slot, put the VGA card in the second
At BIOS side I had to play around until a workable combination was found stable:
CPU PCIE ASPM Mode: Auto → Disabled
PCIEx16_1 Bandwidth Bifurcation Configuration: Auto → x8/x8
PCIE Above 4G Decoding: Disabled → Enabled
PCIE Resize BAR Support: Disabled → Auto
PCIE SR-IOV Support: Auto → Disabled
With those BIOS setting for my system it is not required to add boot parameters to the kernel anymore.
As noted in the other thread, the warm-boot failure persists, which is already a known issue.
Thanks for the support so far.
Thanks for sharing the solution @zefir !
@Victor Labian - worth adding to the WIP PCI troubleshooting guide
Hello @zefir ,
Thank you for sharing your case and solution.
Regarding the warm boot I have a few things we can check. Can you please help us by testing them one by one and see if one of these makes it work?
(If 1. works, no need to test 2. or 3.)
1. Metis device not detected after startup, but is after reboot
This can be due to the pcie-rescan script not being enabled on boot. Run the following to enable it:
systemctl enable pcie-check.service
Then in any future boots the service will be enabled.
(I don’t expect this to solve your issue but it is worth checking)
2. Enable PCIe Shutdown Mode on BIOS
Check if your BIOS has any of these options and enable it.
Look for an option such as:
"PCIe Slot Power Control"
"PCIe Power-On Reset"
"PCIe Cold Reset"
"PCIe Shutdown Mode"
PCIe Shutdown Mode (Cold Reset) is a feature on some boards that cuts power entirely to a PCIe slot during a reboot or reset. We have seen hosts in which enabling this feature helped.
3. Cold boot via PCIe vuses triton_multi_ctx]
If you stablished a PCIe connection with the Metis card but you face communication issues, performing a cold boot via PCIE might help.
Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate command.
Execute the following command:
triton_multi_ctx --cold-boot 3
Let us know if any of these helped.
Best,
Victor
Wow, well done on figuring that complex combo out @zefir! Amazing work! And thanks for sharing - who knows, it may turn out not to be as fringe an issue as we think, as more and more host systems are brought into action.
I just invented a System Sensei badge for exactly this kind of thing, and I think you’ve more than earned it!
Yo, that’s a huge badge - was definitively worth the work :) Thanks @Spanner
Also thanks @Victor Labian for the suggestions regarding warm-boot failure.
Tried the first, the service is now running but the card still not listed after warm-boot.
As for the second, my BIOS has billions of settings on how to overclock this and that, but unfortunately nothing regarding PCIe slot handling at reboot - priorities Asus...
The last might come handy once I observe the card hangs at later stages, but as long as it does not show up at the bus, all tools obviously don’t work.
For now I can live with the limitation and power-cycle the PC to get access. Maybe you can make it to add a warm-boot detection into a future FW to handle the reset at card side.
BTW, the two micro-switches at the upper end of the PCB (SW1, SW2) by chance are no HW-reset ones, right?