Skip to main content
Question

S3 suspend/resume → permanent DMA timeout on Metis M.2 (AMD64 desktop) — driver has no resume path. Please fix or document.

  • June 14, 2026
  • 0 replies
  • 3 views

Hello, i just got my m.2 card and on my AMD64 desktop workstation, the Metis M.2 AIPU does not survive an S3
(suspend-to-RAM) cycle. After resume, every large DMA transfer to the card
fails permanently. Only a full cold power-cycle restores function — a warm
reboot is not enough. This makes the card unusable on any normal desktop,
where suspend/resume is the standard daily workflow.

I narrowed this down together with Claude Opus 4.8 (Anthropic) over a detailed
debugging session. Full diagnosis below.

## Hardware / Software

- Host: AMD Ryzen, B850 platform, RTX 3090 in primary slot
- OS: CachyOS (Arch-based), kernel 7.0.12 (Clang/LLVM-built), Secure Boot off
- Metis M.2 AIPU, PCIe 0000:0c:00.0 (1f9d:1100 rev 02), Gen3 x4
- Driver: axelera-driver release/v1.6 (module version 1.4.17), built with
  make LLVM=1, DKMS-managed
- SDK: Voyager 1.6.1 (axelera-rt / devkit), inference via distrobox/Ubuntu 24.04

## Reproduction

1. Cold boot. Run inference (yolo26s-coco-onnx, fakevideo) → works, ~529 fps.
2. systemctl suspend → resume (verified as a real S3 resume via dmesg marker,
   NOT a reboot).
3. Run the same inference again → DMA timeout, crash.

Kernel side:
  axl 0000:0c:00.0: DMA RD CH0 timeout (irq 4)
  axl 0000:0c:00.0: DMA WR CH0 status -110
Userspace side:
  DMABUF_METIS_WAIT failed: Connection timed out
  Failed to write module binary to device memory
  → inference aborts.

## Key diagnostic evidence (this is the important part)

After resume, with DMA already dead, I checked the device state:

- power_state = D0 (NOT D3cold)
- runtime PM = active
- PCIe link = full 8GT/s x4
- DevSta clean: no CorrErr / NonFatalErr / FatalErr
- dmesg from the resume marker onward shows NO AER and NO DMA error at resume
  itself — only "axl vmsi configured" (driver re-inits MSI).
- No AMD-Vi IO_PAGE_FAULT → IOMMU is not involved.

So the link never drops, ASPM is not the cause, and the device does not fall
into D3cold. The loss is device-internal: the DMA engine (and the on-device
firmware / model-binary state) is lost across S3, and the driver's resume only
re-initializes MSI — it never re-initializes the DMA engine or reloads
firmware. This is consistent with the fact that only a cold power-cycle
(= firmware reload) recovers the card.

## Things that do NOT fix it (already tested, please save others the time)

- pcie_port_pm=off : no effect (link never enters D3cold anyway)
- pcie_aspm=off : pointless (link is already full speed)
- axdevice --reboot (device self-reset) : does NOT restore DMA
- PCIe remove + rescan as a resume workaround
  (echo 1 > /sys/.../remove ; echo 1 > /sys/bus/pci/rescan) :
  caused a complete kernel HARD-FREEZE requiring a hard power-off — see
  related unload bug below.

## Related driver bug (likely blocks a clean resume workaround)

modprobe -r metis / rmmod metis crashes in the unload path:
  RIP: axl_aipu_irq_fn_metis+0x27 [metis], called from cleanup_module
i.e. the IRQ handler fires (card sends an MSI) while module_exit is already
freeing DMA memory → use-after-free. cleanup_module never completes, module
stuck in "going", refcnt -1, only a reboot recovers. The PCIe sysfs "remove"
path hits this same detach code, which is why remove+rescan freezes the box.
Interrupts appear to be torn down too late in the v1.4.17 unload sequence.

## Request

This issue is already acknowledged by Axelera staff on AMD64 and RK3588
("systematic warm-boot issue across multiple host platforms", cold power-cycle
required), and the SDK v1.2.5 release notes (SDK-5176) document the same cold-power-cycle requirement for RK3588 hosts. But there is currently NO suspend/resume (.suspend/.resume
dev_pm_ops) path in the driver at all.

For anyone using Metis in a desktop/workstation — not a 24/7 edge box —
suspend/resume is mandatory. Please either:
  1. implement a proper S3 resume path (re-init DMA engine + reload firmware
     on .resume), and
  2. fix the unload/detach IRQ ordering so a remove+rescan recovery is at
     least possible,
or, at minimum, document prominently that the card requires a cold power-cycle
and is not suitable for systems that suspend.

Happy to provide full dmesg, lspci -vvv, and re-run any diagnostics you need.

 

As a note, i really wanted to play arround with this card but the suspend thing is really a deal breaker for my use case.

 

Best regards,

 

Markus with the help of opus 4.8