Reset and resource allocation failures on Metis PCIe device on AMD64 system
Dear all,
I am trying to get my hands on the Metis PCIe card on my current working PC, being
mainboard: Asus Prime X570-P
CPU: AMD Ryzen 9 3900X
OS: Ubuntu 24.04.2 LTS, 6.8.0-58-generic x86_64
The first issue I face is that the warm-reset does not seem to work for me, i.e. if I issue a reboot without re-powering the system, the Metis device does not show up in lspci, while after a power-down (with physical PSU switching off) it is there. This is reproducible and based on my experience points to either my mainboard BIOS issuing a wrong reset sequence to the PCIe slot, or the device itself not issuing a full reset sequence on warm-boot. Since I know how to get the card detected, this is only an annoyance - but I saw that also another user of AMD64 system in the forum has similar issues and the problem might go deeper.
Which leads to the second issue I am observing and which I filed already a support request but realized here would be the better place to discuss it.
According to David M. to check whether the card is properly installed on HW and driver level would be to issue a
o 131.149382] pci_bus 0000:05: busn_res: ebus 05] is released r 133.174548] pci 0000:03:02.0: .1022:57a3] type 01 class 0x060400 PCIe Switch Downstream Port r 133.174598] pci 0000:03:02.0: PCI bridge to bus 05] r 133.174981] pci 0000:03:02.0: PME# supported from D0 D3hot D3cold r 133.175833] pci 0000:05:00.0: .1f9d:1100] type 00 class 0x120000 PCIe Endpoint i 133.175898] pci 0000:05:00.0: BAR 0 Rmem 0xfa010000-0xfa010fff 64bit] r 133.175902] pci 0000:05:00.0: BAR 2 Rmem 0xf8000000-0xf9ffffff] r 133.175911] pci 0000:05:00.0: ROM Rmem 0xfa000000-0xfa00ffff pref] r 133.176025] pci 0000:05:00.0: supports D1 r 133.176027] pci 0000:05:00.0: PME# supported from D0 D1 D3hot r 133.176281] pci 0000:03:02.0: PCI bridge to bus 05] r 133.176408] pci 0000:03:02.0: bridge window dmem size 0x03000000]: can't assign; no space a 133.176411] pci 0000:03:02.0: bridge window dmem size 0x03000000]: failed to assign r 133.176415] pci 0000:05:00.0: BAR 2 Rmem size 0x02000000]: can't assign; no space r 133.176417] pci 0000:05:00.0: BAR 2 Rmem size 0x02000000]: failed to assign r 133.176419] pci 0000:05:00.0: ROM Rmem size 0x00010000 pref]: can't assign; no space r 133.176421] pci 0000:05:00.0: ROM Rmem size 0x00010000 pref]: failed to assign r 133.176423] pci 0000:05:00.0: BAR 0 Rmem size 0x00001000 64bit]: can't assign; no space r 133.176425] pci 0000:05:00.0: BAR 0 Rmem size 0x00001000 64bit]: failed to assign r 133.176427] pci 0000:03:02.0: PCI bridge to bus 05] r 133.176897] axl 0000:05:00.0: Failed to request resources
Note that I tried all the suggestions proposed in the other thread (amd_iommu=off intel_iommu=off pcie_aspm=off) with no difference. Also the above failure message (4PCI device count mismatch: lspci=1, triton=0) is common to all tools in the SDK trying to make use of the device.
Any suggestions?
Page 1 / 1
Dear Zefir,
Thanks for sharing this detailed description with us and I hope (besides the booting issue) you already managed to get some models deployed on Metis PCIe.
On the second issue: Can you try “axdevice --refresh”? Sometimes this helps to identify the device.
On the first issue: Of course there is always more information, which is helpful in understanding what’s going on here: When you run the axdevice command after a cold reboot in a freshly activated container, what is the output that you get? (I expect it should be something like “Firmware version: v1.2.xxx”)
Hi @jonask-ai
no, the commands you suggested fail due to same issue:
it checks whether the PCI bridge has at least 47MB and since not, tries a rescan
i 3.874903] pcieport 0000:03:02.0: Removing bridge device: 0000:03:02.0 3.881586] pci_bus 0000:05: busn_res: ]bus 05] is released s 3.886408] pci 0000:03:02.0: 1022:57a3] type 01 class 0x060400 PCIe Switch Downstream Port I 3.886451] pci 0000:03:02.0: PCI bridge to bus 05] 2 3.886465] pci 0000:03:02.0: bridge window mem 0xf8000000-0xfa0fffff] o 3.886796] pci 0000:03:02.0: PME# supported from D0 D3hot D3cold
assumption here is that the bridge would resize the memory to match its attached nodes’, which it does not
n 3.900213] pci 0000:05:00.0: /1f9d:1100] type 00 class 0x120000 PCIe Endpoint t 3.900282] pci 0000:05:00.0: BAR 0 mem 0xfa010000-0xfa010fff 64bit] A 3.900287] pci 0000:05:00.0: BAR 2 bmem 0xf8000000-0xf9ffffff] : 3.900296] pci 0000:05:00.0: ROM fmem 0xfa000000-0xfa00ffff pref] 0 3.900417] pci 0000:05:00.0: supports D1 3.900420] pci 0000:05:00.0: PME# supported from D0 D1 D3hot 3.916294] pci 0000:03:02.0: PCI bridge to bbus 05] 9 3.916367] pci 0000:03:02.0: bridge window ]mem size 0x03000000]: can't assign; no space n 3.916371] pci 0000:03:02.0: bridge window amem size 0x03000000]: failed to assign r 3.916376] pci 0000:05:00.0: BAR 2 lmem size 0x02000000]: can't assign; no space 0 3.916379] pci 0000:05:00.0: BAR 2 smem size 0x02000000]: failed to assign 0 3.916382] pci 0000:05:00.0: ROM ]mem size 0x00010000 pref]: can't assign; no space . 3.916385] pci 0000:05:00.0: ROM amem size 0x00010000 pref]: failed to assign 0 3.916388] pci 0000:05:00.0: BAR 0 emem size 0x00001000 64bit]: can't assign; no space 0 3.916391] pci 0000:05:00.0: BAR 0 mem size 0x00001000 64bit]: failed to assign 0 3.916394] pci 0000:03:02.0: PCI bridge to bus 05] i 3.921373] pci 0000:03:02.0: Bridge removal and PCIe rescan completed successfully
What I don’t quite understand, where does the requirement to have 47MB memory at the bridge (EXPECTED_MEM_BEHIND_BRIDGE_SIZE) come from, since the two BARs of the metis card are smaller than that (0xfa010000-0xfa010fff, 0xf8000000-0xf9ffffff)?
Anyway, since it seems to work on other platforms, it might be that my mainboard is not compatible. I might retire the PCIe device and pre-order a Metis Compute Board instead.
Would there be something else I could try to help narrowing down the issue?
Yeah, as you say, there could be an inherent compatibility issue with the mainboard. Although one thing that just sprang to mind (which may or may not help as it was for a different issue! ) is someone said a while ago that some AMD boards ship with “Above 4G Decoding” turned off in the BIOS. Turning it on can help the system to assign resources more appropriately when multiple devices are sharing limited space.
Have you seen a setting for that in your BIOS @zefir? Might be worth a quick test, if so?
Finally!
This was a tough one and in a desperate monkey-testing session I finally succeeded to have proper PCI access to the Metis card. It is a very system-specific issue which might not affect many users out there, but I’m posting it just in case.
As written above, my system is based on AMD X570 chipset, here in a ASUS PRIME X570-P board. That one has two PCIEx16 slots and distributes 16 lanes to that slots based on how they are populated. What is buried deep in the manual is this: they are not independent but meant to power either one or two GPU cards. Turned out that when the first slot has a GPU installed, the second slot must either remain empty or have a second GPU attached. Other cards are silently disabled, which makes them visible in lspci but their memory access remains disabled.
So the first part of the solution was: put the Metis in the primary slot, put the VGA card in the second
At BIOS side I had to play around until a workable combination was found stable:
CPU PCIE ASPM Mode: Auto → Disabled
PCIEx16_1 Bandwidth Bifurcation Configuration: Auto → x8/x8
PCIE Above 4G Decoding: Disabled → Enabled
PCIE Resize BAR Support: Disabled → Auto
PCIE SR-IOV Support: Auto → Disabled
With those BIOS setting for my system it is not required to add boot parameters to the kernel anymore.
As noted in the other thread, the warm-boot failure persists, which is already a known issue.
Thanks for the support so far.
Thanks for sharing the solution @zefir !
@Victor Labian - worth adding to the WIP PCI troubleshooting guide
Hello @zefir ,
Thank you for sharing your case and solution.
Regarding the warm boot I have a few things we can check. Can you please help us by testing them one by one and see if one of these makes it work?
(If 1. works, no need to test 2. or 3.)
1. Metis device not detected after startup, but is after reboot
This can be due to the pcie-rescan script not being enabled on boot. Run the following to enable it:
systemctl enable pcie-check.service
Then in any future boots the service will be enabled.
(I don’t expect this to solve your issue but it is worth checking)
2. Enable PCIe Shutdown Mode on BIOS
Check if your BIOS has any of these options and enable it.
Look for an option such as:
"PCIe Slot Power Control"
"PCIe Power-On Reset"
"PCIe Cold Reset"
"PCIe Shutdown Mode"
PCIe Shutdown Mode (Cold Reset) is a feature on some boards that cuts power entirely to a PCIe slot during a reboot or reset. We have seen hosts in which enabling this feature helped.
3. Cold boot via PCIe vuses triton_multi_ctx]
If you stablished a PCIe connection with the Metis card but you face communication issues, performing a cold boot via PCIE might help.
Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate command.
Execute the following command:
triton_multi_ctx --cold-boot 3
Let us know if any of these helped.
Best,
Victor
Wow, well done on figuring that complex combo out @zefir! Amazing work! And thanks for sharing - who knows, it may turn out not to be as fringe an issue as we think, as more and more host systems are brought into action.
I just invented a System Sensei badge for exactly this kind of thing, and I think you’ve more than earned it!
Yo, that’s a huge badge - was definitively worth the work :) Thanks @Spanner
Also thanks @Victor Labian for the suggestions regarding warm-boot failure.
Tried the first, the service is now running but the card still not listed after warm-boot.
As for the second, my BIOS has billions of settings on how to overclock this and that, but unfortunately nothing regarding PCIe slot handling at reboot - priorities Asus...
The last might come handy once I observe the card hangs at later stages, but as long as it does not show up at the bus, all tools obviously don’t work.
For now I can live with the limitation and power-cycle the PC to get access. Maybe you can make it to add a warm-boot detection into a future FW to handle the reset at card side.
BTW, the two micro-switches at the upper end of the PCB (SW1, SW2) by chance are no HW-reset ones, right?
Hi @zefir , thanks for the suggestion.
The switches are meant for internal testing, not recommended to play with them .
Hi @Bram Verhoef - of course I tried the switches (how would one not :)) and neither the card did a HW-reset nor did it explode, so I’ll leave them to internal testing.
@Victor Labian - to sum-up and close the warm-boot issue, over the weekend I tried the card on various PCs in my lab and for me the failure is systematic. There were recent mainboards (e.g. MSI MAG Z790) down to 10+ years old ones (Gigabyte B75M) of with the card all worked after cold-boot but did not show up in lspci after warm-boot.
Thank you for the update @zefir .
@zefir For our information, how did you do cold boot of the Metis card? You just unplugged and plugged the card manually?
@zefir Just to confirm, are you using a M.2 Metis card?
@Bram Verhoef @Spanner from Zefir’s update (this issue occurring systematically on multiple hosts) this looks to me as if the Metis card had old FW/BC.
Let’s wait for Zefir’s reply but we might need to consider to replace with a card in which we are sure the FW and BC is updated. It should have been updated but maybe there was a mistake. If we need to do this @Spanner can you take care of it and support @zefir in the replacing process? Thanks!
Hi @zefir ,
Before considering to replace the card, there is something we can check to see if the card might have old Board Controller or not.
Can you help us sharing the following?:
Right after the cold boot, without running inference, do the following and share the output with us (of course these will work only if you manage to establish pcie connection after cold boot as before):
1. Check Firmware version of your Metis card cuses triton_multi_ctx]
Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate command.
Execute the following command:
triton_multi_ctx --fwver
This will display the version of the Metis Firmware loaded in your card, for example,
Firmware version: v1.2.5
Share this version and your Voyager SDK version with Axelera AI Support Team.
2. Check Board Controller version of your Metis card cuses triton_multi_ctx]
Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate command.
Execute the following command:
triton_multi_ctx --bc-version
This will display the version of the Board Controller in your card, for example,
Share this version and your Voyager SDK version with Axelera AI Support Team.
If the commands didn’t work try the following and then repeat 1. and 2.
Cold boot via PCIe vuses triton_multi_ctx]
If you stablished a PCIe connection with the Metis card but you face communication issues, performing a cold boot via PCIE might help.
Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate command.
Execute the following command:
triton_multi_ctx --cold-boot 3
FYI @Spanner @Bram Verhoef
You’re the man @Victor Labian!
Hi @Victor Labian,
For clarification, this is a PCIe card. With cold-boot I refer to power-cycling the PC, i.e. issuing a `poweroff` in linux until PSU powers off and then powering up again, while with warm-boot I mean rebooting without powering down via issuing `reboot`.
The commands you suggested give the following result:
If it is the latter, this is after I ran what manubot proposed to update FW with `axdevice --reload-firmware` in the other thread re FW update.
Thanks, and no need for card replacement from my side as long as power-cycling works for me.
Hi @zefir ,
Thank you for the update. The Board Controller version looks fine (v1.0) and the flashed FW (Firmware) version looks ok as well.
To your question, the FW version you see now is 1.2.0 which is the FW flashed in the device, that is ok and expected. When you run inference for the first time after power on, automatically the SDK will load the 1.2.5 FW on the Metis card. You can also do that manually with axdevice --reload-firmware .
We are investigating the warm boot issue and will update you if we have any findings.
Thank you for your understanding and for the help providing all the details, this will be very helpful for the community.
Best,
Victor
Hi @zefir — thanks for sharing those version details.
Just to clarify: despite the version number being a bit confusing, v1.2.0 is the expected firmware version for SDK v1.2.5, so it looks like your card is actually running the correct firmware.
The naming will likely be tweaked in future versions to make it clearer, but for now, it doesn’t look as though any further firmware update is needed from your side.
How’s everything else going with your project build?