Skip to main content

Hi all,

not sure if my issue is related to this one, but symptoms are at least similar, so I hope it matches best here. Otherwise sorry for derailing.

I am trying to get my hands on the PCIe metis card on my AMD64 based system driven by an ASUS PRIME X570-P mainboard under Ubuntu-24.04 (6.8.0-58-generic x86_64). First issue was that the card did no show up under lspci, no matter which kernel boot options I tried. Then I looked into the mainboard BIOS to double-check the peripheral config and noticed that the PCIe slots are not fully independent. In fact, the board has two PCIex16 slots and three PCIex1, of whose the latter are meant for expansion cards and the former for graphic cards. They are pre-configured for ‘auto’ mode, which means that the BIOS detects how the slots are populated and splits available lanes between the two x16 slots.

I changed this ‘auto’ setting to manually split the lanes to ‘x8+x8’ and after rebooting the card is now visible to the kernel:

zefir@zefir-PC:~$ lspci | grep -v AMD
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
05:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 15)
0a:00.0 VGA compatible controller: NVIDIA Corporation GK208B tGeForce GT 710] (rev a1)
0a:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)

 

So far so good, BUT I observed that this only works after cold-reset, i.e. after power-cycling the system. A warm-reboot at my side makes the metis card disappear, it only shows up after the PSU prior was turned off physically. From experience with PCIe cards we build in our company, to me this looks like either the BIOS is not issuing the correct PCIe-reset sequence to the slot, or the card’s reset processing is outside the spec. But anyway, at least I have a means to get the card visible.

Unfortunately this turned out to be only the first issue. As I got instructed from David Marks from the support team, I tried to figure if the card is running the latest FW by issuing:

(venv) root@zefir-PC:~/voyager-sdk# axdevice -v --refresh
INFO:axelera.runtime.axdevice:Removing 0000:03:02.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:05:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 05:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis                  90112  0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
RuntimeError: No devices found, use -v for more information

 

While this failure happens, the kernel log issues;

n  175.217669] pci_bus 0000:05: busn_res: 2bus 05] is released
u  177.252957] pci 0000:03:02.0: /1022:57a3] type 01 class 0x060400 PCIe Switch Downstream Port
0  177.253000] pci 0000:03:02.0: PCI bridge to 0bus 05]
0  177.253327] pci 0000:03:02.0: PME# supported from D0 D3hot D3cold
#  177.253890] pci 0000:03:02.0: Adding to iommu group 17
0  177.254434] pci 0000:05:00.0: <1f9d:1100] type 00 class 0x120000 PCIe Endpoint
p  177.254502] pci 0000:05:00.0: BAR 0 >mem 0xfa010000-0xfa010fff 64bit]
0  177.254507] pci 0000:05:00.0: BAR 2 /mem 0xf8000000-0xf9ffffff]
0  177.254516] pci 0000:05:00.0: ROM bmem 0xfa000000-0xfa00ffff pref]
 177.254635] pci 0000:05:00.0: supports D1
.  177.254638] pci 0000:05:00.0: PME# supported from D0 D1 D3hot
5  177.254978] pci 0000:05:00.0: Adding to iommu group 23
i  177.255273] pci 0000:03:02.0: PCI bridge to bus 05]
 177.255400] pci 0000:03:02.0: bridge window [mem size 0x03000000]: can't assign; no space
m  177.255403] pci 0000:03:02.0: bridge window >mem size 0x03000000]: failed to assign
d  177.255407] pci 0000:05:00.0: BAR 2 smem size 0x02000000]: can't assign; no space
m  177.255409] pci 0000:05:00.0: BAR 2 smem size 0x02000000]: failed to assign
B  177.255411] pci 0000:05:00.0: ROM omem size 0x00010000 pref]: can't assign; no space
 177.255413] pci 0000:05:00.0: ROM mem size 0x00010000 pref]: failed to assign
O  177.255416] pci 0000:05:00.0: BAR 0 omem size 0x00001000 64bit]: can't assign; no space
m  177.255418] pci 0000:05:00.0: BAR 0 mem size 0x00001000 64bit]: failed to assign
R  177.255420] pci 0000:03:02.0: PCI bridge to gbus 05]
7  177.255930] axl 0000:05:00.0: Failed to request resources
5  177.255968] axl: probe of 0000:05:00.0 failed with error -12
 

And unfortunately here ends my expertise. Obviously the memory the card requires can not be assigned, but I don’t know if that is a limitation of the bridge configuration or my chipset is too old.

Digging further with similar issues (e.g. https://forums.unraid.net/topic/132930-drives-and-usb-devices-visible-in-bios-not-available-once-booted-asus-wrx80-sage-5965wx/) and will report back on results.

 

Happy hacking

Not totally sure if this helps, but maybe try checking these in your BIOS:

  • Above 4G Decoding – try turning this on if it’s available. I think it helps with making more memory space available for PCIe devices.

  • IOMMU – you could try turning this off (sometimes called SVM on AMD boards). It was mentioned in similar case where PCIe devices couldn’t get assigned enough memory.

It might not solve everything, but could be worth a shot, especially since it seems like a memory mapping issue. Let me know if it helps at all, and we can keep working on it!


I think this thread can be closed, since the warm-boot issue is already known as documented in the Voyager SDK v1.2.5 Release Notes which states

Known Issues & Limitations

Here we document known issues and limitations for this release.

  • Metis does not power on after reboot on RK3588 hosts (SDK-5176)
    Sometimes on RK3588-based platforms the PCIe card or M.2 module is not powered on by the host upon reboot. Until the issue is solved by Rockchip, the issue can be prevented by powering the host off and on instead of rebooting. To recover from a system where the issue manifests, a host power cycle is required, too (i.e., power-off followed by power-on).

 

… my bad I did not rtfm before wasting that much time :(

Only thing relevant to add is that this is not limited to Rockchip but also happens on other systems, namely here on a AMD X570 chipset based PC.


Worth knowing about it happening beyond Rockchip systems, thanks ​@zefir 👍


Reply