Hello Axelera support team,
I am running the Axelera Metis M.2 accelerator on a Radxa Rock 5B (RK3588 SoC) with Armbian and kernel 6.18.10-current-rockchip64. After extensive debugging over several days, I have identified a compatibility issue between the metis driver and the ARM SMMU-v3 on this platform. I would appreciate your guidance on resolving this.
HARDWARE & SOFTWARE
- Board: Radxa Rock 5B (RK3588, 8GB RAM, PCIE3.0-4X)
- OS: Armbian, kernel 6.18.10-current-rockchip64 (mainline)
- Accelerator: Axelera Metis M.2
- Driver version: metis 1.4.16
- SDK/Runtime version: voyager-sdk / runtime-1.6.0-1
- PSU: 65W
WHAT WORKS
- The Metis is detected correctly via PCIe: lspci shows "Axelera AI Metis AIPU (rev 02)" at 0000:01:00.0
- After patching the Device Tree to expand the PCIe memory window (the default 14MB window was too small; expanded the 64-bit prefetchable range at 0x900000000 to 128MB), the driver loads successfully:
- /dev/metis0 -> metis-0:1:0 is created
- "Axelera AIPU PCIe Driver, version 1.4.16, init OK" appears in dmesg
- PCIe link runs at 8GT/s x4 correctly
- Model compilation and deployment completes correctly (yolov8s-coco-onnx for 4 cores) when CPU frequency is limited to 1.8GMHz or below
THE PROBLEM After the model deploy completes, the system fails with a Bus error when the SDK attempts to load the runtime firmware into the Metis via DMA:
[libaxldev_linux.c:1515] Device communication timed out: device did not respond within 1 seconds.
Failed to load firmware: /opt/axelera/device-1.6.0-1/omega/bin/start_axelera_runtime.elf
Bus errorWith the default MSI configuration (32 MSI), the system produces a silent hard reset between 25-50 seconds into the deploy process instead. The relevant dmesg pattern before the reset is:
axl 0000:01:00.0: vmsi configured
axl 0000:01:00.0: IRQ MSI timeout (12 1)
axl 0000:01:00.0: vmsi configuredImportant observations about the reset behaviour:
- kernel.panic=10 has no effect — it is a hardware-level reset, not a kernel crash
- No AER errors are recorded in /sys/bus/pci/devices/0000:01:00.0/aer_dev_fatal or aer_rootport_total_err_fatal
- CPU load at time of reset is ~20%, temperature is ~40°C — ruling out thermal or CPU overload causes
- With additional external cooling keeping temperatures below 40°C, the reset takes longer (50 seconds vs 25 seconds without extra cooling). This correlates with CPU thermal throttling: without extra cooling the CPU runs at higher frequencies causing more aggressive DMA, which triggers the SMMU fault faster
- Limiting CPU frequency to 1008MHz allows the deploy to complete (though it takes ~28 minutes), but the Bus error still occurs when loading the runtime firmware afterwards
- The problem persists regardless of the number of cores configured: both AXELERA_CONFIGURE_BOARD=,30 (4 cores) and AXELERA_CONFIGURE_BOARD=,10 (1 core) produce the same Bus error
- The reset is completely silent — no kernel messages whatsoever before the system goes down
The root cause identified is the ARM SMMU-v3 (fc900000) intercepting MSI interrupts and DMA from the Metis:
arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x30 ipa: 0x0
arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault"The SMMU-v3 is initialized by TF-A/BL31 firmware before the kernel boots. Even with status="disabled" in the Device Tree for the iommu@fc900000 node, the kernel finds it already active and continues using it. Kernel parameters iommu=off and iommu.passthrough=1 have no effect.
THINGS ALREADY TRIED
- iommu.passthrough=1, iommu=off, pci=noaer kernel parameters — no effect
- status="disabled" on iommu@fc900000 in Device Tree — SMMU still active (TF-A initializes it before kernel)
- Removing iommu-map from pcie@fe150000 DT node — causes C_BAD_STREAMID errors, same reset
- modprobe metis single_msi=1 — changes error from F_TRANSLATION to C_BAD_STREAMID + IRQ MSI timeout + Bus error
- modprobe metis single_msi=1 dma_poll=1 — no improvement
- Changing msi-map to use ITS0 (0x89) instead of ITS1 (0x132) — same result
- pcie_acs_override=downstream,multifunction — not compiled in this kernel
- Disabling SMMU via sysfs bypass — rejected (EINVAL, group shares Root Port)
- Limiting CPU to 1.8gHz — deploy completes but Bus error persists when loading runtime firmware
- Using AXELERA_CONFIGURE_BOARD=,10 (single core) — same Bus error, problem is not related to number of cores
- Monitoring AER counters in real time — no errors recorded before reset
KEY OBSERVATION This platform uses kernel 6.18 (mainline) with CONFIG_IOMMU_DEFAULT_TRANSLATED. The ARM SMMU-v3 cannot be disabled at kernel level because TF-A initializes it before the kernel. The metis driver 1.4.16 does not appear to support operation under an active ARM SMMU-v3 with translated DMA in mainline kernels.
The Axelera documentation references the Orange Pi 5 Plus (also RK3588) as a supported platform. That board typically runs a BSP kernel (5.10 or 6.1) where the SMMU is not active for PCIe. Could you confirm whether the metis driver supports ARM SMMU-v3 with mainline kernels, and if so, what configuration is required?
QUESTIONS
- Does metis driver 1.4.16 support operation with ARM SMMU-v3 active (mainline kernel, translated DMA mode)?
- Is there a known workaround for RK3588 platforms with mainline kernel 6.x?
- Is a driver update planned that adds proper SMMU-v3 support?
- Can you share the exact kernel configuration used for the Orange Pi 5 Plus reference setup?
- Is there a way to configure the DMA operations in the SDK to work within the SMMU constraints?
Thank you for your time. I am happy to provide additional logs, dmesg output, or test any patches you may have.
Best regards, Miguel
