Skip to main content

Hi,

Bit of a weird issue with the M.2 Evaluation System (SBC model AIB-MR1B-A1), and I’m worried it could be a hardware fault. I somehow got to a state where I would see 

AXR_ERROR_CONNECTION_ERROR: No target device found in lspci output".

lspci didn’t show the accelerator card, but instead listed “Non-VGA unclassified device”:

00:00.0 PCI bridge: Rockchip Electronics Co., Ltd RK3588 (rev 01)
01:00.0 Non-VGA unclassified device: Synopsys, Inc. DWC_usb3 / PCIe bridge

dmesg showed:

Tue Sep 2 19:51:14 2025] rk-pcie fe170000.pcie: PCIe Linking... LTSSM is 0x3 
xTue Sep 2 19:51:16 2025] rk-pcie fe170000.pcie: PCIe Link Fail
iTue Sep 2 19:51:16 2025] rk-pcie fe170000.pcie: failed to initialize host

After trying a few things, I resorted to re-imaging the SBC. Then, on the first log-in from adb shell, I still saw Non-VGA unclassified device, but after a reboot, SSH’ing into the Eval System, I saw it as:

00:00.0 PCI bridge: Rockchip Electronics Co., Ltd Device 3588 (rev 01)
01:00.0 Processing accelerators: Device 1f9d:1100

And then after installing the Voyager SDK:

00:00.0 PCI bridge: Rockchip Electronics Co., Ltd RK3588 (rev 01)
01:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)

Incidentally I always shut down using the sudo poweroff command, i.e. never an uncontrolled shutdown. All fine, but again today when I came to use it, I saw Non-VGA unclassified device. I rebooted, and lspci showed nothing at all.

root@aetina:~# lspci
root@aetina:~# sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
root@aetina:~# lspci -nn
root@aetina:~# lspci
root@aetina:~# dmesg | grep -iE 'rk-pcie|pcie link|nvme'
n 1.810292] rk-pcie fe150000.pcie: invalid prsnt-gpios property in node
1.810321] rk-pcie fe170000.pcie: invalid prsnt-gpios property in node
1.815781] rk-pcie fe170000.pcie: missing legacy IRQ resource
o 1.815800] rk-pcie fe170000.pcie: IRQ msi not found
f 1.815811] rk-pcie fe170000.pcie: use outband MSI support
p 1.815819] rk-pcie fe170000.pcie: Missing *config* reg space
s 1.815832] rk-pcie fe170000.pcie: host bridge /pcie@fe170000 ranges:
n 1.815854] rk-pcie fe170000.pcie: err 0x00f2000000..0x00f20fffff -> 0x00f2000000
0 1.815870] rk-pcie fe170000.pcie: IO 0x00f2100000..0x00f21fffff -> 0x00f2100000
0 1.815886] rk-pcie fe170000.pcie: MEM 0x00f2200000..0x00f2ffffff -> 0x00f2200000
0 1.815898] rk-pcie fe170000.pcie: MEM 0x0980000000..0x09bfffffff -> 0x0980000000
0 1.815931] rk-pcie fe170000.pcie: Missing *config* reg space
s 1.815959] rk-pcie fe170000.pcie: invalid resource
o 1.826810] rk-pcie fe150000.pcie: missing legacy IRQ resource
o 1.826836] rk-pcie fe150000.pcie: IRQ msi not found
f 1.826845] rk-pcie fe150000.pcie: use outband MSI support
p 1.826865] rk-pcie fe150000.pcie: host bridge /pcie@fe150000 ranges:
n 1.826905] rk-pcie fe150000.pcie: IO 0x00f0100000..0x00f01fffff -> 0x00f0100000
0 1.826939] rk-pcie fe150000.pcie: MEM 0x0900000000..0x091fffffff -> 0x0040000000
0 1.826957] rk-pcie fe150000.pcie: MEM 0x0920000000..0x093fffffff -> 0x0060000000
0 1.827026] rk-pcie fe150000.pcie: invalid resource
o 2.021962] rk-pcie fe170000.pcie: PCIe Linking... LTSSM is 0x3
s 2.031961] rk-pcie fe150000.pcie: PCIe Linking... LTSSM is 0x0
s 2.047510] rk-pcie fe170000.pcie: PCIe Linking... LTSSM is 0x3
s 2.057516] rk-pcie fe150000.pcie: PCIe Linking... LTSSM is 0x0
s <truncated similar lines>
s 27.187539] rk-pcie fe150000.pcie: PCIe Linking... LTSSM is 0x1
s 27.207577] rk-pcie fe170000.pcie: PCIe Linking... LTSSM is 0x3
s 27.214288] rk-pcie fe150000.pcie: PCIe Linking... LTSSM is 0x0
s 29.164223] rk-pcie fe170000.pcie: PCIe Link Fail
29.164294] rk-pcie fe170000.pcie: failed to initialize host
29.170893] rk-pcie fe150000.pcie: PCIe Link Fail
29.170958] rk-pcie fe150000.pcie: failed to initialize host

After another reboot, I’m back to the Non-VGA unclassified device.

I can try re-imaging the system again, but I’m concerned the same thing will happen again, unless I try to figure out what could have gone wrong. 

I also tried retrieving the live device-tree using sudo dtc -I dtb -O dts -o live.dts /sys/firmware/fdt and it is attached.

Has anyone seen a similar issue? Any debugging steps I should take? If it is indeed a SBC hardware issue or likely to be, then I’ll try to purchase another SBC, or Eval System,  but would like to be fairly sure that it is indeed hardware-related before I try that option.

Incidentally I have tried to re-seat the accelerator card, but it made no difference. I was fairly sure that couldn’t have been an issue anyway, since the board is protected in a cover with just fan holes (no dust or knocks possible to unseat or affect the connections), but figured it was worth a try.

Many thanks!

 

I narrowed it down; I tried re-imaging again, and at the same stage (just after the ./flash.sh and sudo adb shell), I again saw “Non-VGA unclassified device” followed by the correct “Processing accelerators: Device 1f9d:1100” after a reboot.

This was too much of a coincidence! 

Test 1:  After a sudo poweroff, and then unplug and re-insert the DC power connector, again I saw “Non-VGA unclassified device”. I repeated this test 10 times, and saw “Non-VGA unclassified device” each time

Test 2: After a sudo poweroff and then press the little black power button, I still saw “Non-VGA unclassified device”. Repeated about 5 times, no difference.

Test 3: After a sudo poweroff, and then unplug and re-insert the DC power connector, again I saw “Non-VGA unclassified device” (this is the same as Test 1 so far), but this time I typed sudo reboot and then I saw “Processing accelerators: Device 1f9d:1100”! I repeated this 10 times, and it’s consistent, I see the correct output after a subsequent reboot, but not immediately after a poweroff cycle.

I think this confirms to me that it’s a SBC issue, most likely power related. I’m speculating loads but perhaps the power consumption at power-up is just slightly high enough to cause a slight voltage dip to cause the the link to fail, whereas the power may be more stable during a reboot. I’ll try to reach out to the SBC manufacturer to see if they have something to suggest based on these symptoms. At least I now know that a reboot will (or may! - I have yet to plug in other things into the SBC which will consume some current) get things working if the board has been power-cycled. I think I may need to replace this SBC however, since there may be knock-on effects, if it is indeed a hardware fault with it.

 


Hi ​@shabaz,

 

thanks for your extensive investigation and description!  Before we continue, can you clarify which version of the Voyager SDK you were using during your tests?  And also which firmware versions were installed on your Metis card and which kernel driver version was installed on the host?

You can find out the versions as follows:

  • Voyager SDK version: Starting from the latest 1.4.0 release, the command axversion is available.  Otherwise, inspect the existing directory names under /opt/axelera/ and derive the version number from those (i.e., /opt/axelera/runtime-1.3.3-1 would mean SDK 1.3.3).
  • Firmware versions on Metis card: Run axdevice -v.  This only works if the device was recognized by the host, but from your description it seems you now know a way to get it into that state reliably.
  • Kernel driver version: cat /sys/class/metis/version

Hi Manuel,

Thanks for the response!

Here is the info:

Voyager SDK version: 1.4.0

Metis Firmware versions (obtained after doing the sudo reboot to get it visible):

INFO: Found PCI device: 01:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO: Found AIPU driver: metis 57344 0
INFO: Current firmware version v1.2.0-rc2+bl1-stage0 != required version v1.4.0
INFO: Device firmware version is not compatible, loading now
INFO: Using device metis-0:1:0
Device 0: metis-0:1:0 1GiB m2 flver=1.2.0-rc2 bcver=1.0 clock=800MHz(0-3:800MHz) mvm=0-3:100%
device_runtime_firmware=v1.4.0
board_controller_board_type=ortles
sw_throttling: 200°C, hysteresis 5°C, throttle rate:12%
hw_throttling: 105°C, hysteresis 10°C
pvt_warning_threshold: 95°C

Kernel driver version: 1.2.2

 


Reply