I have a 4 AIPU Axelera card, and it is plugged into the Dell PowerEdge R760 server.
Currently the card fails to enumerate all the Metis devices, and I see PCIE AER errors in the logs relatd to the Metis devices.
Below is the the output of the axdevice command. Attached problem report as well.
Have you seen this issue before ?
Command line logs:
(axelera-env) user@r760-sales-demo:~$ sudo lspci -d 1f9d: -vt
-[0000:0c]---01.0-[0d-12]----00.0--+-02.0-[11]----00.0 Axelera AI Metis AIPU (rev 02)
\-03.0-[12]----00.0 Axelera AI Metis AIPU (rev 02)
(axelera-env) user@r760-sales-demo:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-111-generic root=UUID=5c7f0039-5e33-4e4b-85e3-9d0215b77b59 ro iommu=pt loglevel=8
(axelera-env) user@r760-sales-demo:~$ axdevice
WARNING: 4PCI device count mismatch: lspci=2, triton=0
ERROR: min() iterable argument is empty
(axelera-env) user@r760-sales-demo:~$ axdevice -v
INFO: Found AIPU driver: metis 208896 0
WARNING: 4PCI device count mismatch: lspci=2, triton=0
INFO: Using device
Traceback (most recent call last):
File "/home/user/lukasz/axelera-env/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
^^^^^^^^^^^^^^^^^
File "/home/user/lukasz/axelera-env/lib/python3.12/site-packages/axelera/runtime/axdevice.py", line 1121, in entrypoint_main
result = args.func(args, extras)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/lukasz/axelera-env/lib/python3.12/site-packages/axelera/runtime/axdevice.py", line 807, in main
max_cores = min(d.subdevice_count for d in devices)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: min() iterable argument is empty
(axelera-env) user@r760-sales-demo:~$ axdevice --report
WARNING: 4PCI device count mismatch: lspci=2, triton=0
Wrote report to /home/user/report-2026-05-13_16_29_08.zip
If you create a ticket, please attach this file.Kernel Hradware Error:
[ +0.204022] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ +0.000057] axl 0000:11:00.0: Link down
[ +0.000049] axl 0000:12:00.0: Unregister directory 0000:12:00.0
[ +0.000075] axl 0000:12:00.0: Unregistered metis-0:12:0 (3 3)
[ +0.000002] axl 0000:12:00.0: Release dma mem metis-0:12:0
[ +0.000253] {1}[Hardware Error]: event severity: recoverable
[ +0.000306] {1}[Hardware Error]: Error 0, type: fatal
[ +0.000298] {1}[Hardware Error]: section_type: PCIe error
[ +0.000289] {1}[Hardware Error]: port_type: 4, root port
[ +0.000211] axl 0000:11:00.0: Unregister directory 0000:11:00.0
[ +0.000078] {1}[Hardware Error]: version: 3.0
[ +0.000038] axl 0000:11:00.0: Unregistered metis-0:11:0 (2 2)
[ +0.000254] {1}[Hardware Error]: command: 0x0547, status: 0x4010
[ +0.000002] {1}[Hardware Error]: device_id: 0000:0c:01.0
[ +0.000001] {1}[Hardware Error]: slot: 0
[ +0.000001] {1}[Hardware Error]: secondary_bus: 0x0d
[ +0.000000] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x352a
[ +0.000001] {1}[Hardware Error]: class_code: 060400
[ +0.000193] axl 0000:11:00.0: Release dma mem metis-0:11:0
[ +0.000284] {1}[Hardware Error]: bridge: secondary_status: 0x6000, control: 0x0003
[ +0.001420] {1}[Hardware Error]: aer_uncor_status: 0x00004000, aer_uncor_mask: 0x03310000
[ +0.000278] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000269] {1}[Hardware Error]: TLP Header: 05880001 fc005e03 12000044 00000000
[ +0.000270] {1}[Hardware Error]: Error 1, type: fatal
[ +0.000265] {1}[Hardware Error]: section_type: PCIe error
[ +0.000260] {1}[Hardware Error]: port_type: 6, downstream switch port
[ +0.000261] {1}[Hardware Error]: version: 3.0
[ +0.000257] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ +0.000258] {1}[Hardware Error]: device_id: 0000:0e:00.0
[ +0.000254] {1}[Hardware Error]: slot: 1
[ +0.000248] {1}[Hardware Error]: secondary_bus: 0x0f
[ +0.000247] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ +0.000248] {1}[Hardware Error]: class_code: 060400
[ +0.000244] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ +0.000246] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ +0.000248] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000249] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ +0.000250] {1}[Hardware Error]: Error 2, type: fatal
[ +0.000247] {1}[Hardware Error]: section_type: PCIe error
[ +0.000241] {1}[Hardware Error]: port_type: 6, downstream switch port
[ +0.000239] {1}[Hardware Error]: version: 3.0
[ +0.000231] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ +0.000228] {1}[Hardware Error]: device_id: 0000:0e:01.0
[ +0.000225] {1}[Hardware Error]: slot: 2
[ +0.000218] {1}[Hardware Error]: secondary_bus: 0x10
[ +0.000217] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ +0.000215] {1}[Hardware Error]: class_code: 060400
[ +0.000212] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ +0.000216] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ +0.000216] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000213] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ +0.000211] {1}[Hardware Error]: Error 3, type: fatal
[ +0.000209] {1}[Hardware Error]: section_type: PCIe error
[ +0.000206] {1}[Hardware Error]: port_type: 6, downstream switch port
[ +0.000213] {1}[Hardware Error]: version: 3.0
[ +0.000200] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ +0.000200] {1}[Hardware Error]: device_id: 0000:0e:02.0
[ +0.000199] {1}[Hardware Error]: slot: 3
[ +0.000198] {1}[Hardware Error]: secondary_bus: 0x11
[ +0.000200] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ +0.000201] {1}[Hardware Error]: class_code: 060400
[ +0.000200] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ +0.000204] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ +0.000208] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000209] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ +0.000212] {1}[Hardware Error]: Error 4, type: fatal
[ +0.000225] {1}[Hardware Error]: section_type: PCIe error
[ +0.000208] {1}[Hardware Error]: port_type: 6, downstream switch port
[ +0.000208] {1}[Hardware Error]: version: 3.0
[ +0.000205] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ +0.000205] {1}[Hardware Error]: device_id: 0000:0e:03.0
[ +0.000458] {1}[Hardware Error]: slot: 4
[ +0.000201] {1}[Hardware Error]: secondary_bus: 0x12
[ +0.000203] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ +0.000202] {1}[Hardware Error]: class_code: 060400
[ +0.000202] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ +0.000205] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ +0.000210] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000210] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ +0.000249] pcieport 0000:0c:01.0: AER: aer_status: 0x00004000, aer_mask: 0x03310000
[ +0.000160] pcieport 0000:0c:01.0: [14] CmpltTO (First)
[ +0.000157] pcieport 0000:0c:01.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[ +0.000158] pcieport 0000:0c:01.0: AER: aer_uncor_severity: 0x044ef030
[ +0.066062] pci 0000:11:00.0: AER: can't recover (no error_detected callback)
[ +0.000011] pci 0000:12:00.0: AER: can't recover (no error_detected callback)
[ +0.000002] pci 0000:0d:00.1: AER: can't recover (no error_detected callback)
[ +0.135027] pcieport 0000:0c:01.0: AER: Root Port link has been reset (0)
[ +0.000108] pcieport 0000:0c:01.0: AER: device recovery failed
[ +0.000020] pcieport 0000:0e:00.0: AER: aer_status: 0x00000020, aer_mask: 0x01a10000
[ +0.000164] axl 0000:10:00.0: Unregister directory 0000:10:00.0
[ +0.000418] pcieport 0000:0e:00.0: [ 5] SDES (First)
[ +0.000079] axl 0000:10:00.0: Unregistered metis-0:10:0 (1 1)
[ +0.000615] axl 0000:10:00.0: Release dma mem metis-0:10:0
[ +0.000004] pcieport 0000:0e:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ +0.000534] pcieport 0000:0e:00.0: AER: aer_uncor_severity: 0x044ef030
[ +0.000796] axl 0000:0f:00.0: axl_io_error_detected : Request a slot reset 00000000252f50e9:000000000c4eaacf
[ +0.381261] pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
[ +0.000270] axl 0000:0f:00.0: Unregister directory 0000:0f:00.0
[ +0.000351] axl 0000:0f:00.0: Unregistered metis-0:f:0 (0 0)
[ +0.000020] axl 0000:0f:00.0: Release dma mem metis-0:f:0
[ +0.002383] ------------[ cut here ]------------
[ +0.000003] axl 0000:0f:00.0: disabling already-disabled device
[ +0.000012] WARNING: CPU: 62 PID: 2898 at drivers/pci/pci.c:2387 pci_disable_device+0xc4/0xf0
[ +0.000010] Modules linked in: overlay qrtr cfg80211 sunrpc binfmt_misc xfs nls_iso8859_1 intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac skx_edac_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dell_wmi cmdlinepart video kvm dax_hmem ledtrig_audio cxl_acpi iaa_crypto spi_nor sparse_keymap ipmi_ssif cxl_port dell_smbios pmt_telemetry irqbypass mtd pmt_class intel_sdsi rapl dcdbas intel_cstate wmi_bmof dell_wmi_descriptor cxl_core i2c_i801 isst_if_mbox_pci mgag200 idxd mei_me isst_if_mmio spi_intel_pci i2c_algo_bit metis(OE) idxd_bus intel_vsec switchtec isst_if_common mei spi_intel i2c_ismt i2c_smbus ipmi_si acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler input_leds joydev mac_hid sch_fq_codel dm_multipath msr parport_pc ppdev lp parport efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
[ +0.000086] hid_generic usbhid hid crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel nvme sha256_ssse3 sha1_ssse3 megaraid_sas nvme_core ahci nvme_auth tg3 xhci_pci libahci xhci_pci_renesas wmi pinctrl_emmitsburg aesni_intel crypto_simd cryptd
[ +0.000023] CPU: 62 PID: 2898 Comm: axelera-multi-d Tainted: G OE 6.8.0-111-generic #111-Ubuntu
[ +0.000004] Hardware name: Dell Inc. PowerEdge R760/05H0JD, BIOS 2.6.3 03/26/2025
[ +0.000002] RIP: 0010:pci_disable_device+0xc4/0xf0
[ +0.000004] Code: 4d 85 e4 75 07 4c 8b a3 c8 00 00 00 48 8d bb c8 00 00 00 e8 5e e6 21 00 4c 89 e2 48 c7 c7 b8 1a 4c 9a 48 89 c6 e8 3c d9 77 ff <0f> 0b e9 57 ff ff ff 48 89 df e8 8d fe ff ff 80 a3 51 08 00 00 df
[ +0.000004] RSP: 0018:ff5adcfec77f3ba8 EFLAGS: 00010246
[ +0.000003] RAX: 0000000000000000 RBX: ff2b2e6dc86b8000 RCX: 0000000000000000
[ +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ +0.000002] RBP: ff5adcfec77f3bb8 R08: 0000000000000000 R09: 0000000000000000
[ +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: ff2b2e6dc7b20fb0
[ +0.000002] R13: 0000000000000000 R14: ff2b2e6dc86b8148 R15: 0000000000000080
[ +0.000001] FS: 0000795150e8c740(0000) GS:ff2b2e6da0180000(0000) knlGS:0000000000000000
[ +0.000003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000002] CR2: 0000795150e04650 CR3: 0000000106c72006 CR4: 0000000000f71ef0
[ +0.000002] PKRU: 55555554
[ +0.000002] Call Trace:
[ +0.000002] <TASK>
[ +0.000005] axl_aipu_remove+0xf6/0x110 [metis]
[ +0.000011] pci_device_remove+0x3e/0xb0
[ +0.000005] device_remove+0x40/0x80
[ +0.000004] device_release_driver_internal+0x20b/0x270
[ +0.000004] device_release_driver+0x12/0x20
[ +0.000003] pci_stop_bus_device+0x7a/0xb0
[ +0.000004] pci_stop_bus_device+0x30/0xb0
[ +0.000003] pci_stop_bus_device+0x41/0xb0
[ +0.000002] pci_stop_bus_device+0x41/0xb0
[ +0.000003] pci_stop_and_remove_bus_device_locked+0x1a/0x40
[ +0.000004] remove_store+0x8f/0xa0
[ +0.000005] dev_attr_store+0x14/0x40
[ +0.000004] sysfs_kf_write+0x3b/0x60
[ +0.000006] kernfs_fop_write_iter+0x163/0x1f0
[ +0.000005] vfs_write+0x2a1/0x470
[ +0.000005] ksys_write+0x73/0x100
[ +0.000004] __x64_sys_write+0x19/0x30
[ +0.000003] x64_sys_call+0x7e/0x25a0
[ +0.000004] do_syscall_64+0x7f/0x180
[ +0.000004] ? arch_exit_to_user_mode_prepare.isra.0+0x1a/0xe0
[ +0.000005] ? irqentry_exit_to_user_mode+0x38/0x1e0
[ +0.000005] ? irqentry_exit+0x43/0x50
[ +0.000004] ? exc_page_fault+0x94/0x1b0
[ +0.000003] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ +0.000007] RIP: 0033:0x795150d1c5a4
[ +0.000032] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d a5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[ +0.000002] RSP: 002b:00007fffd768d748 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ +0.000003] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 0000795150d1c5a4
[ +0.000002] RDX: 0000000000000002 RSI: 00005d8f2ecac450 RDI: 0000000000000001
[ +0.000001] RBP: 00007fffd768d770 R08: 0000000000000073 R09: 0000000000000000
[ +0.000002] R10: 00000000ffffffff R11: 0000000000000202 R12: 0000000000000002
[ +0.000001] R13: 00005d8f2ecac450 R14: 0000795150e045c0 R15: 0000795150e01ee0
[ +0.000004] </TASK>
[ +0.000001] ---[ end trace 0000000000000000 ]---

