Question

Radxa Rock 5T pcie insufficient memory allocation

Forum|Forum|2 months ago
April 1, 2026
6 replies
123 views

Maverickz1989
Cadet

Dear Axelera Support,

I am attempting to install Voyager SDK on the radxa rock 5T board with the axelera AI Accelerator M.2. While the SDK suggests using Ubuntu 22.04, I could not find a version of ubuntu for the 5T (https://github.com/DHDAXCW/ubuntu-rockchip-rk3588?tab=readme-ov-file)

I have reviewed previous threads but either they’re not fixed (https://community.axelera.ai/support-central-47/subject-metis-aipu-detected-on-pcie-but-not-responding-stage0-load-failure-on-rk3588-compute-board-voyager-sdk-1-5-3-1280) or there’s an ubuntu version available (https://community.axelera.ai/metis-pcie-7/pcie-connection-doesn-t-work-with-radxa-rock-5b-1101)

I’m not exactly sure if I can use the ubuntu image meant for the Radxa Rock 5B+ on the Rock 5T?

radxa@rock-5t:~$ cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

radxa@rock-5t:~$ cat /sys/class/metis/version

1.4.16

Installation of SDK for debian

This is how I manage to get the SDK to be installed on debian:

After cloning the repository: https://github.com/axelera-ai-hub/voyager-sdk/blob/0b25b098ca5fa591d7b5aa2fe71ad017089d64b3/docs/tutorials/install.md

Install base dependencies:

sudo apt update

sudo apt install -y \

gstreamer1.0-tools \

gstreamer1.0-plugins-base \

gstreamer1.0-plugins-good \

gstreamer1.0-plugins-bad \

gstreamer1.0-libav \

libgstreamer1.0-dev \

python3-gi \

python3-pip \

pciutils \

dkms

Create Debian config (copy from Ubuntu):

cd ~/Documents/voyager-sdk

cp cfg/config-ubuntu-2204-arm64.yaml \

cfg/config-debian-12-arm64.yaml

sed -i 's/python3\.10-dev/python3.11-dev/g' \

cfg/config-debian-12-arm64.yaml

Fix Python version mismatch:

sudo apt install -y \

python3.11-dev \

libpython3-dev \

python3-venv \

build-essential

Remove invalid Ubuntu-only GStreamer package:

python3 - <<'PY'

from pathlib import Path

p = Path("cfg/config-debian-12-arm64.yaml")

text = p.read_text().replace(

" - libgstreamer-plugins-good1.0-dev\n", ""

)

p.write_text(text)

print("patched")

sudo apt install -y \

gstreamer1.0-plugins-good \

libgstreamer-plugins-base1.0-dev \

libgstreamer-plugins-bad1.0-dev \

libgstrtspserver-1.0-dev

sudo apt install -y \

librga-dev \

librga2 \

libasio-dev \

libboost-log1.74.0 \

libboost-program-options1.74.0 \

libboost-regex1.74.0 \

libeigen3-dev \

graphviz \

unzip

Operator build failure - Fixing Werror / arch workaround

Problem: Unsupported architecture: unknown ""

export ARCH=arm64

export AARCH64=1

export WERROR=0

Run installer:

./install.sh --runtime --no-development

(venv) radxa@rock-5t:~/Documents/voyager-sdk$ lspci | grep -i axelera

0000:01:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)

(venv) radxa@rock-5t:~/Documents/voyager-sdk$ lsmod | grep metis

metis 118784 0

(venv) radxa@rock-5t:~/Documents/voyager-sdk$ triton_multi_ctx

usage: triton_multi_ctx

Issues

However, these are the issues:

1st issue: (venv) radxa@rock-5t:~/Documents/voyager-sdk$ axdevice

WARNING: 4PCI device count mismatch: lspci=1, triton=0

The output can be seen in axdevice.txt

I could not properly diagnose this, so I threw the output to chatgpt which gave me this diagnosis (needs to be verified):

BAR 2: no space for [mem size 0x02000000]

followed by BAR 0 / BAR 6 also failing to assign. That means the ROCK 5T host can enumerate the Metis card on PCIe, but it is not giving the device enough MMIO/BAR address space, so the Axelera runtime cannot fully map the card. That matches why you get lspci=1, triton=0: Linux sees the device, but the runtime cannot use it.

2nd issue: Another worthy thing I diagnosed is axelera-multi-device.service where the output can be found in axelera-multi-device.service.txt (failed to start)

Root problem

The Metis needs 32MB for BAR2 (non-prefetchable). The original pcie@fe150000 DT node only had a 14MB MEM window (0xf0200000–0xf0ffffff), so the kernel could never assign BAR2.

Attempted Fixes

Attempted solutions targeting memory allocation for the PCIE within rk3588-rock-5t.dtb (device tree) found in radxa@rock-5t:/usr/lib/linux-image-6.1.84-8-rk2410/rockchip/rk3588-rock-5t.dtb :

Fallback, Recovery & Serial Debugging

Made a copy before any edits and this is the usual pipeline:

vi ~/rock5t-safe.dts # change only 0xe00000 → 0x04000000

dtc -I dts -O dtb -o ~/rock5t-safe.dtb ~/rock5t-safe.dts

sudo cp ~/rock5t-safe.dtb /usr/lib/linux-image-$(uname -r)/rockchip/rk3588-rock-5t.dtb

sudo reboot

If there’s a soft crash, this is fixed by transferring the SD card to my laptop and reverting the dtb

sudo cp \

/media/<usr-name>/rootfs/home/radxa/dtb-backup/linux-image-6.1.84-8-rk2410/rockchip/rk3588-rock-5t.dtb \

/media/<usr-name>/rootfs/usr/lib/linux-image-6.1.84-8-rk2410/rockchip/rk3588-rock-5t.dtb

sync

sudo umount /media/<usr-name>/rootfs

sudo umount /media/<usr-name>/config

The reason for the crashes can be found via serial debugging via a USB-to-TTL cable

GPIO pins: pin 6 →GND, pin 8 → TX, pin 10 → RX

View serial logs via Tabby on baud rate 1500000 on device /dev/ttyUSB0 with data bits 8, stop bits 1 and no parity. Default settings otherwise.

I’m not an expert, so I relied on claude and chatgpt to help with the ideation and diagnosis of the fixes, I can provide the logs where necessary:

Attempt Logs

Attempt 1 — Expand fe150000 MEM to 64MB

Changed ranges size from 0x00e00000 to 0x04000000.

Failed — this caused fe150000 MEM end (0xf41fffff) to overlap with fe180000 config (0xf3000000) and fe190000 config (0xf4000000), triggering a resource collision and kernel NULL pointer dereference crash in rk_pcie_remove.

Attempt 2 — Expand fe150000 + shift fe160000/fe170000/fe180000/fe190000

Moved all five controllers to new address ranges to avoid collision.

Failed — fe180000 and fe190000 were moved to 0xf7/0xf8 ranges which appear to be outside what the RK3588 hardware actually supports, causing a boot hang. The system never reached SSH.

Attempt 3 — Expand fe150000 to 46MB + shift only fe160000/fe170000, leave fe180000/fe190000 original

fe150000 MEM: 0xf0200000–0xf2ffffff (46MB, enough for Metis).

fe160000 config: 0xf5000000, fe170000 config: 0xf6000000.

fe180000/fe190000 unchanged at original 0xf3/0xf4 addresses.

Partially successful — no collision, no crash. fe150000 linked up at Gen.3 x2. However the system still hung — appeared to be a readl spin on CPU7 during PCIe enumeration.

Attempt 4 — Identify source of hang

Disabled fe150000 entirely to isolate whether it was causing the hang.

Result — system still hung at same point ([14.77x] imx415 sensor). Revealed the hang was not fe150000 but something else.

Attempt 5 — Disable camera DTBO overlays

Removed rock-5t-cam0-radxa-camera-4k.dtbo and rock-5t-cam1-radxa-camera-4k.dtbo from extlinux.conf.

Result — revealed a new and different crash: kernel stack overflow in pci_do_find_bus with infinite recursion (~80+ levels deep). The camera overlays were masking this crash by hanging earlier.

Root cause of stack overflow identified

The Metis internal bridge advertises subordinate bus = 0xff, causing pci_scan_bridge_extend → pci_scan_child_bus_extend → pci_scan_bridge_extend to recurse infinitely until kernel stack exhausts.

Attempt 6 — Change bus-range from <0x00 0x0f> to <0x01 0x0f>

Hypothesis was that bus 0 conflict caused the recursion.

Failed — same stack overflow, same recursion depth. Bus range starting at 1 did not prevent the subordinate bus scan recursion.

Attempt 7 — Add pci=noaer pcie_aspm.policy=performance kernel parameters

Failed — same stack overflow unchanged. These parameters do not affect bridge subordinate bus scanning behavior.

Current state

The DT memory allocation is correct and working (46MB window, no collisions, Gen.3 x2 link confirmed). The remaining problem is purely a kernel-level PCIe bridge scanning issue triggered by the Metis advertising subordinate=0xff. The next attempt is pci=noaer,noexpand pcie_aspm=off pcie_port_pm=off which specifically prevents subordinate bus range expansion during enumeration. I still cannot complete the boot up sequence with this bug.

I have attached the dtb in .txt format for your perusal. Thank you!

Spanner
Axelera Team
Forum|Forum|2 months ago
April 1, 2026

Hi @Maverickz1989, really thorough debugging here, especially catching that the camera overlays were masking the real crash.

The BAR allocation issue is a known RK3588 limitation and your Attempt 3 DT changes (46MB window) are on the right track. On another RK3588 board (Orange Pi 5 Plus) the missing piece was adding pci=realloc as a kernel parameter alongside the expanded DT memory window. That parameter changes how the kernel assigns PCIe resources during enumeration, which may also affect the subordinate bus scanning behaviour you're hitting.

Could you try your Attempt 3 DT with pci=realloc added to your kernel cmdline in extlinux.conf and share the full dmesg from that boot? If we chip into that, it might let us know if we’re moving in the right direction! Let me know!

Maverickz1989
Author
Cadet
Forum|Forum|2 months ago
April 2, 2026

Hi Spanner, thank you for the follow up.

pci=realloc didn't help — the stack overflow is identical to before. The crash is the same root cause: the Metis internal bridge advertising subordinate=0xff, causing infinite recursion in pci_do_find_bus.

Where the log confirms:

[   14.460750] Insufficient stack space to handle exception!
[   14.460783] pc : pci_do_find_bus+0x14/0x68
[   14.460797] lr : pci_do_find_bus+0x58/0x68
[   14.460879]  pci_do_find_bus+0x14/0x68
[   14.460885]  pci_do_find_bus+0x58/0x68
                ... (repeated ~50 times)
[   14.461222]  pci_scan_bridge_extend+0x450/0x56c
[   14.461230]  pci_scan_child_bus_extend+0x204/0x2a8
[   14.461799]  pci_scan_child_bus+0x18/0x20
[   14.461804]  pci_scan_root_bus_bridge+0xa4/0xc4
[   14.461822]  rk_pcie_really_probe+0x8d8/0xafc
[   14.461826]  kthread+0xc0/0xd0
[   14.461838] Kernel panic - not syncing: kernel stack overflow

The Metis bridge is advertising subordinate bus 0xff (255). The kernel keeps calling pci_do_find_bus recursively for every bus number up to 255, exhausting the kernel stack before it gets there.

I believe that pci=realloc tells the kernel to re-examine and reassign all BAR/memory resources from scratch rather than trusting what firmware set up. Useful for the original BAR allocation problem, but completely unrelated to the bridge scanning recursion. It doesn't touch how many buses get scanned.

I’ve attached the boot up messages (serial logging) in add-pci-realloc.txt

A comment from claude:

The recursion happens inside pci_do_find_bus, which walks the kernel's internal linked list of already-registered bus structs — not the DT bus-range. Here's the sequence:

Kernel enumerates fe150000, finds the Metis card
Metis bridge advertises subordinate=0xff in its PCI config register
Kernel calls pci_scan_bridge_extend() which reads that 0xff directly from the hardware register
Kernel then calls pci_do_find_bus() recursively for every bus up to 0xff
Stack overflows at around bus 50-something

The critical point is step 3 — the kernel reads subordinate=0xff from the hardware, not from the DT. The DT bus-range governs resource reservation but does not cap what the kernel reads from the bridge's subordinate register during scanning.
This was confirmed by the preovious Attempt 6 — changing bus-range from <0x00 0x0f> to <0x01 0x0f> made zero difference to the recursion depth.

Additional kernel config findings

Thereafter, looked into the various pci configs for other pci configs that can be tapped into:

Can be found in pci-configs.txt, this is what claude commented on:

Two important things from your kernel config:

CONFIG_PCIE_RK_THREADED_INIT=y

The Rockchip PCIe driver uses a threaded init — this is why the crash happens on a kthread (rk-pcie) rather than during main boot. It won't help directly but confirms the driver behaviour.

# CONFIG_PCIEAER is not set

# CONFIG_HOTPLUG_PCI_PCIE is not set

AER and PCIe hotplug are already compiled out, so pci=noaer was always a no-op too.

Alternative kernel investigated - linux-image-6.1.84-13-rk2410-nocsf

Since kernel parameters were exhausted, I looked for a newer Radxa BSP kernel available via apt that might contain a PCIe fix. linux-image-6.1.84-13-rk2410-nocsf appeared promising as it is a newer patch revision of the same base kernel (-13 vs -8). The -nocsf suffix suggested a meaningful difference — CSF stands for Command Stream Frontend, a Mali GPU feature.
To assess whether it contained any PCIe fixes without installing it, I downloaded the .deb and extracted its kernel config, then diffed it against the running kernel's config filtering for PCI-related lines. The diff returned no PCI differences whatsoever. The only changes were:

CONFIG_BLK_DEV_LOOP changed from built-in to module (irrelevant)
CONFIG_MALI_CSF_SUPPORT disabled (confirming -nocsf is purely a Mali GPU change)

The PCI subsystem is compiled identically to the current kernel. This kernel was ruled out as it would not affect the bridge scanning behaviour.

Potential solutions?

Will be attempting these next

Option 1 (Kernel Fix): The kernel source for pci_scan_bridge_extend() reads the subordinate bus number directly from the bridge's config register. When it gets 0xff, it trusts it unconditionally and tries to scan all 255 buses. A patch would add a clamp like: "if subordinate - primary > some reasonable limit (e.g. 8), cap it."

Option 2 (Ubuntu Radxa 5B+ kernel + 5T DTB): I have also found an alternative but less proven solution which is to take the ubuntu version meant for rock 5b+ (https://joshua-riek.github.io/ubuntu-rockchip-download/boards/rock-5b-plus.html) and utilise the device tree blob for 5T (https://github.com/rafayahmed317/Rock-5T-DTB).

Some questions

Does Axelera have a downstream kernel patch or recommended kernel version for this board/card combination?
Is there a known BSP patch for drivers/pci/probe.c targeting the subordinate=0xff infinite recursion on RK3588?

Thank you so much!

add-pci-realloc.txt

pci-configs.txt

Habib
Community Manager
Forum|Forum|1 month ago
April 2, 2026

@Maverickz1989 thanks for your message! For custom SBCs using Voyager SDK v1.5.3, we recommend installing the driver natively (you likely already have the latest version). However, the SDK itself is best installed via Docker. However, the latter might change in future releases.

Also, please try to install the driver natively via Voyager SDK’s installer, this will make sure the driver is compatible with the release version v1.5.3.

Victor Labian
Axelera Team
Forum|Forum|1 month ago
April 2, 2026

Hi @Maverickz1989 ,

What you are describing is known for RK 3588 hosts.

The default SOC device tree assigns less than 33MB for non-prefetchable memory windows for each PCIe root complex.

However, you need to increase the size in your custom device-tree or insert a PCIe overlay to work with Metis.

I have never tested a radxa rock but here are two guides I created time ago on how I managed to support Orange Pi 5 Plus and NanoPC-T6 by modifying the device tree:

In the case of NanoPC-T6 I needed to use the the Ubuntu 22.04 image from https://joshua-riek.github.io/ubuntu-rockchip-download/boards/nanopc-t6.html, as the official NanoPC-T6 did not allow us to properly modify the device tree.

I hope these two guides serve as a good initial reference. Note that we now support Ubuntu 22.04 and Ubuntu 24.04 (more OSs to come soon). For other OSs please use Docker as explained in https://support.axelera.ai/hc/en-us/articles/25953148201362-Install-Voyager-SDK-in-a-Docker-Container .

Let us know how it goes.

Best,

Victor

Maverickz1989
Author
Cadet
Forum|Forum|1 month ago
April 22, 2026

Hi Victor,

I have got my hands on a radxa Rock 5B plus and also the nanoPC-T6.

The nanoPC-T6 guide works with joshua’s ubuntu 2204! Thx for curating the guide

However, the radxa Rock 5B Plus is showing the same exact issue as the Rock 5T. Will be following the threads until a solution is found.

I do understand it’s 2 separate products:

I’m quite curious how the pcie version works here on the rock 5b plus but not the m2 version

Victor Labian
Axelera Team
Forum|Forum|1 month ago
April 29, 2026

Hi @Maverickz1989 ,

Thank you for the update, let us know how it goes for the Rock 5B.

I recommend that you install the latest SDK, which is v1.6 (https://github.com/axelera-ai-hub/voyager-sdk).

Something you might need to check is that the linux headers are correctly installed and that the driver is built for the right kernel version. I include here a guide with info about this: https://support.axelera.ai/hc/en-us/articles/29335553753874-Build-and-load-Metis-driver-on-host-manually

Kind regards,

Victor

Installation of SDK for debian

Issues

Attempted Fixes

Fallback, Recovery & Serial Debugging

Attempt Logs

Additional kernel config findings

Alternative kernel investigated - linux-image-6.1.84-13-rk2410-nocsf

Potential solutions?

Sign up

Log in, or create an Axelera AI account

Login to the community

Log in, or create an Axelera AI account

Scanning file for viruses.

This file cannot be downloaded