Hello, i just got my m.2 card and on my AMD64 desktop workstation, the Metis M.2 AIPU does not survive an S3
(suspend-to-RAM) cycle. After resume, every large DMA transfer to the card
fails permanently. Only a full cold power-cycle restores function — a warm
reboot is not enough. This makes the card unusable on any normal desktop,
where suspend/resume is the standard daily workflow.
I narrowed this down together with Claude Opus 4.8 (Anthropic) over a detailed
debugging session. Full diagnosis below.
## Hardware / Software
- Host: AMD Ryzen, B850 platform, RTX 3090 in primary slot
- OS: CachyOS (Arch-based), kernel 7.0.12 (Clang/LLVM-built), Secure Boot off
- Metis M.2 AIPU, PCIe 0000:0c:00.0 (1f9d:1100 rev 02), Gen3 x4
- Driver: axelera-driver release/v1.6 (module version 1.4.17), built with
make LLVM=1, DKMS-managed
- SDK: Voyager 1.6.1 (axelera-rt / devkit), inference via distrobox/Ubuntu 24.04
## Reproduction
1. Cold boot. Run inference (yolo26s-coco-onnx, fakevideo) → works, ~529 fps.
2. systemctl suspend → resume (verified as a real S3 resume via dmesg marker,
NOT a reboot).
3. Run the same inference again → DMA timeout, crash.
Kernel side:
axl 0000:0c:00.0: DMA RD CH0 timeout (irq 4)
axl 0000:0c:00.0: DMA WR CH0 status -110
Userspace side:
DMABUF_METIS_WAIT failed: Connection timed out
Failed to write module binary to device memory
→ inference aborts.
## Key diagnostic evidence (this is the important part)
After resume, with DMA already dead, I checked the device state:
- power_state = D0 (NOT D3cold)
- runtime PM = active
- PCIe link = full 8GT/s x4
- DevSta clean: no CorrErr / NonFatalErr / FatalErr
- dmesg from the resume marker onward shows NO AER and NO DMA error at resume
itself — only "axl vmsi configured" (driver re-inits MSI).
- No AMD-Vi IO_PAGE_FAULT → IOMMU is not involved.
So the link never drops, ASPM is not the cause, and the device does not fall
into D3cold. The loss is device-internal: the DMA engine (and the on-device
firmware / model-binary state) is lost across S3, and the driver's resume only
re-initializes MSI — it never re-initializes the DMA engine or reloads
firmware. This is consistent with the fact that only a cold power-cycle
(= firmware reload) recovers the card.
## Things that do NOT fix it (already tested, please save others the time)
- pcie_port_pm=off : no effect (link never enters D3cold anyway)
- pcie_aspm=off : pointless (link is already full speed)
- axdevice --reboot (device self-reset) : does NOT restore DMA
- PCIe remove + rescan as a resume workaround
(echo 1 > /sys/.../remove ; echo 1 > /sys/bus/pci/rescan) :
caused a complete kernel HARD-FREEZE requiring a hard power-off — see
related unload bug below.
## Related driver bug (likely blocks a clean resume workaround)
modprobe -r metis / rmmod metis crashes in the unload path:
RIP: axl_aipu_irq_fn_metis+0x27 [metis], called from cleanup_module
i.e. the IRQ handler fires (card sends an MSI) while module_exit is already
freeing DMA memory → use-after-free. cleanup_module never completes, module
stuck in "going", refcnt -1, only a reboot recovers. The PCIe sysfs "remove"
path hits this same detach code, which is why remove+rescan freezes the box.
Interrupts appear to be torn down too late in the v1.4.17 unload sequence.
## Request
This issue is already acknowledged by Axelera staff on AMD64 and RK3588
("systematic warm-boot issue across multiple host platforms", cold power-cycle
required), and the SDK v1.2.5 release notes (SDK-5176) document the same cold-power-cycle requirement for RK3588 hosts. But there is currently NO suspend/resume (.suspend/.resume
dev_pm_ops) path in the driver at all.
For anyone using Metis in a desktop/workstation — not a 24/7 edge box —
suspend/resume is mandatory. Please either:
1. implement a proper S3 resume path (re-init DMA engine + reload firmware
on .resume), and
2. fix the unload/detach IRQ ordering so a remove+rescan recovery is at
least possible,
or, at minimum, document prominently that the card requires a cold power-cycle
and is not suitable for systems that suspend.
Happy to provide full dmesg, lspci -vvv, and re-run any diagnostics you need.
As a note, i really wanted to play arround with this card but the suspend thing is really a deal breaker for my use case.
Best regards,
Markus with the help of opus 4.8
