Subject
Metis M.2 on Toradex Verdin i.MX95 enumerates over PCIe but all Axelera commands time out after (reportedly) successful firmware update; MSI interrupts never increment and board firmware version can no longer be queried
Summary
We are using an Axelera Metis M.2 accelerator on a Toradex Verdin i.MX95 WB on Mallow carrier. PCIe enumeration succeeds and the `metis` kernel driver binds to the endpoint, but all Axelera device commands fail to retrieve board/device information. The kernel repeatedly logs:
axl 0000:01:00.0: IRQ MSI timeout (12 1)
This persists with ARM SMMU passthrough enabled via `iommu.passthrough=1`.
The most important observation is that the Metis MSI interrupt counters do not increment when an Axelera command is issued. The endpoint advertises MSI and the driver allocates MSI vectors, but the expected completion interrupt is never observed by Linux.
Environment
Host:
Toradex Verdin iMX95 WB on Mallow Board
Linux 6.6.138-7.6.1-devel
TDX Wayland with XWayland 7.6.1-devel
aarch64
Axelera stack:
Voyager SDK: 1.6.1
axelera-runtime-1.6.1: 1.6.1
axelera-device-1.6.1: 1.6.1
Metis kernel driver: 1.4.16
Container: voyager-sdk-1.6
Current kernel command line:
root=PARTUUID=9cb1583e-02 ro rootwait console=tty1 console=ttyLP2,115200 iommu.passthrough=1
IOMMU state:
iommu: Default domain type: Passthrough (set via kernel command line)
arm-smmu-v3 490d0000.iommu: msi_domain absent - falling back to wired irqs
PCIe state
Metis endpoint enumerates:
0000:01:00.0 Processing accelerators [1200]: Axelera AI Metis AIPU (rev 02) [1f9d:1100]
Kernel driver in use: axl
Kernel modules: metis
Region 0: Memory at 912010000 (64-bit, non-prefetchable) [size=4K]
Region 2: Memory at 910000000 (32-bit, non-prefetchable) [size=32M]
LnkSta: Speed 8GT/s, Width x1 (downgraded)
MSI after loading with `single_msi=1`:
Capabilities: [50] MSI: Enable+ Count=1/32 Maskable+ 64bit+
Address: 0000000048050040 Data: 0000
Previously, with default driver options:
MSI registered 32 (32)
irq vec number 119
Initializing Metis MSI (nmsi=32, max_msi=32)
## Current failure
`axdevice -v` reports:
[libaxldev_linux.c:1515] Device communication timed out: device did not respond within 1 seconds. (4294967295)
WARNING: Failed to get valid board type for device metis-0:1:0 got 8
INFO: Ignoring device 0(metis-0:1:0) as it has not returned a valid board type: 8
Device 0: metis-0:1:0 board_type=unknown (not responding)
`axcmd --device metis-0:1:0 --fwver` reports:
[libaxldev_linux.c:1515] Device communication timed out: device did not respond within 1 seconds. (4294967295)
Failed to read firmware version from device.
The generated `axdevice report archive` shows:
Firmware version:
Board type: BoardType.unknown
Board revision: 0
Board DDR: 0
Flashed firmware version:
Board controller firmware version:
Board controller board type:
DRAM size: 18446744073709551615 bytes (17592186044416 MiB)
Interrupt observations
With default driver options, `/proc/interrupts` showed 32 Metis MSI vectors, all zero.
Before issuing `axcmd --fwver`:
119: 0 0 0 0 0 0 ITS-MSI 524288 Edge msi-metis-0:1:0-0
...
150: 0 0 0 0 0 0 ITS-MSI 524319 Edge msi-metis-0:1:0-31
```
After issuing `axcmd --fwver`, the counters were still zero. The command timed out and dmesg logged:
axl 0000:01:00.0: IRQ MSI timeout (12 1)
We then tried:
modprobe -r metis
modprobe metis single_msi=1
The same behavior was observed: PCIe enumeration and driver bind work, but `axdevice` and `axcmd` cannot retrieve board information and no command completion interrupt is observed.
We also tried:
modprobe -r metis
modprobe metis single_msi=1 dma_poll=1
No interrupt increment was observed, and board information retrieval still failed.
SMMU / DMA tests already performed
Initially the kernel default domain was translated:
iommu: Default domain type: Translated
We added:
iommu.passthrough=1
through the Toradex `tdxargs` U-Boot variable and verified after reboot:
iommu: Default domain type: Passthrough (set via kernel command line)
This did not change the Axelera failure mode.
## Firmware update history
Before the firmware update, the card was responsive enough for inference after runtime workarounds. The system previously reported approximately:
driver: metis.ko 1.4.16
SDK/runtime: 1.6.1
device package: 1.6.1
flver: 1.3.2
bcver: 1.4
board_type: m2
DRAM: 1 GiB
We ran the official interactive firmware update tool. The update reported success, but after the required restart/cold-cycle attempts, the device now enumerates on PCIe while all Axelera commands time out as described above.
We inspected the lower-level flash scripts and observed that the available update paths still require the normal `axcmd` command/MSI path, for example `--fwload`, `--flashforce`, and `--flashload`. Since `axcmd --fwver` and `axdevice` cannot currently complete, we have not attempted to bypass the updater version checks or force a lower-level flash.
Other relevant observations
The driver logs:
```text
axl 0000:01:00.0: vmsi not available
```
but still allocates normal MSI vectors. We are unsure whether `vmsi not available` is expected on Verdin i.MX95 or a sign that the Metis/host MSI path is not configured correctly.
We also observed that reading some Metis debugfs DMA register paths showed `0xffffffff` values. Attempting to read deeper linked-list debugfs paths in that bad state caused a kernel oops inside the Metis driver's debugfs read path, so we stopped reading those debugfs files.
Previous runtime issues before firmware update
Before the firmware update, we were debugging two separate runtime issues:
1. DMA heap naming on Toradex:
- Toradex exposed `/dev/dma_heap/linux,cma` and `/dev/dma_heap/linux,cma-uncached`.
- SDK expected `/dev/dma_heap/system` or `/dev/dma_heap/reserved`.
- Temporary aliases allowed runtime allocation to proceed.
2. Per-core clock configuration:
- SDK attempted per-core clock configuration.
- Direct global clock command worked:
```bash
axdevice --set-clock 600 -dmetis-0:1:0
axdevice --set-clock 800 -dmetis-0:1:0
```
- Per-core clock path failed with:
```text
CMD_SET_CLOCK_AICORE_FREQ returned an error code
```
- Mapping SDK clock requests to the global `clock_profile` path allowed inference to run before the firmware update.
These earlier issues may be unrelated to the current post-update board communication failure, but they may be useful context. We believed FW update may fix the per-core clock failure
Questions
- Is `vmsi not available` expected on Toradex Verdin i.MX95 with Metis M.2 and driver 1.4.16?
- Given PCIe enumeration and MSI allocation succeed, but `/proc/interrupts` does not increment during `axcmd --fwver`, what is the recommended next diagnostic step?
- Does the Verdin i.MX95 PCIe/ITS/SMMU configuration require a specific device tree property or kernel config for Metis MSI writes to reach the GIC ITS?
- Is there a supported recovery path when the official interactive firmware update reports success but the device then cannot answer `axcmd --fwver`, `axdevice`, or `axdevice --reload-firmware`?
- Is there a known compatibility issue with early Metis M.2 cards updated from `flver=1.3.2` / `bcver=1.4` to SDK 1.6.1 firmware on ARM64 platforms?
Thank you very much for any support you can give us. We were really careful with the FW update process but this is not something we were expecting after the tool reported “successful” update.
