Skip to main content
Question

AER Hardware Error, enumeration issues - 4 AIPU Metis board

  • May 13, 2026
  • 3 replies
  • 19 views

 I have a 4 AIPU Axelera card, and it is plugged into the Dell PowerEdge R760 server.

Currently the card fails to enumerate all the Metis devices, and I see PCIE AER errors in the logs relatd to the Metis devices.

Below is the the output of the axdevice command. Attached problem report as well.

Have you seen this issue before ? 


Command line logs:

(axelera-env) user@r760-sales-demo:~$ sudo lspci -d 1f9d: -vt
-[0000:0c]---01.0-[0d-12]----00.0--+-02.0-[11]----00.0 Axelera AI Metis AIPU (rev 02)
\-03.0-[12]----00.0 Axelera AI Metis AIPU (rev 02)
(axelera-env) user@r760-sales-demo:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-111-generic root=UUID=5c7f0039-5e33-4e4b-85e3-9d0215b77b59 ro iommu=pt loglevel=8
(axelera-env) user@r760-sales-demo:~$ axdevice
WARNING: 4PCI device count mismatch: lspci=2, triton=0
ERROR: min() iterable argument is empty
(axelera-env) user@r760-sales-demo:~$ axdevice -v
INFO: Found AIPU driver: metis 208896 0
WARNING: 4PCI device count mismatch: lspci=2, triton=0
INFO: Using device
Traceback (most recent call last):
File "/home/user/lukasz/axelera-env/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
^^^^^^^^^^^^^^^^^
File "/home/user/lukasz/axelera-env/lib/python3.12/site-packages/axelera/runtime/axdevice.py", line 1121, in entrypoint_main
result = args.func(args, extras)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/lukasz/axelera-env/lib/python3.12/site-packages/axelera/runtime/axdevice.py", line 807, in main
max_cores = min(d.subdevice_count for d in devices)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: min() iterable argument is empty
(axelera-env) user@r760-sales-demo:~$ axdevice --report
WARNING: 4PCI device count mismatch: lspci=2, triton=0


Wrote report to /home/user/report-2026-05-13_16_29_08.zip
If you create a ticket, please attach this file.

Kernel Hradware Error:
 

[  +0.204022] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ +0.000057] axl 0000:11:00.0: Link down
[ +0.000049] axl 0000:12:00.0: Unregister directory 0000:12:00.0
[ +0.000075] axl 0000:12:00.0: Unregistered metis-0:12:0 (3 3)
[ +0.000002] axl 0000:12:00.0: Release dma mem metis-0:12:0
[ +0.000253] {1}[Hardware Error]: event severity: recoverable
[ +0.000306] {1}[Hardware Error]: Error 0, type: fatal
[ +0.000298] {1}[Hardware Error]: section_type: PCIe error
[ +0.000289] {1}[Hardware Error]: port_type: 4, root port
[ +0.000211] axl 0000:11:00.0: Unregister directory 0000:11:00.0
[ +0.000078] {1}[Hardware Error]: version: 3.0
[ +0.000038] axl 0000:11:00.0: Unregistered metis-0:11:0 (2 2)
[ +0.000254] {1}[Hardware Error]: command: 0x0547, status: 0x4010
[ +0.000002] {1}[Hardware Error]: device_id: 0000:0c:01.0
[ +0.000001] {1}[Hardware Error]: slot: 0
[ +0.000001] {1}[Hardware Error]: secondary_bus: 0x0d
[ +0.000000] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x352a
[ +0.000001] {1}[Hardware Error]: class_code: 060400
[ +0.000193] axl 0000:11:00.0: Release dma mem metis-0:11:0
[ +0.000284] {1}[Hardware Error]: bridge: secondary_status: 0x6000, control: 0x0003
[ +0.001420] {1}[Hardware Error]: aer_uncor_status: 0x00004000, aer_uncor_mask: 0x03310000
[ +0.000278] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000269] {1}[Hardware Error]: TLP Header: 05880001 fc005e03 12000044 00000000
[ +0.000270] {1}[Hardware Error]: Error 1, type: fatal
[ +0.000265] {1}[Hardware Error]: section_type: PCIe error
[ +0.000260] {1}[Hardware Error]: port_type: 6, downstream switch port
[ +0.000261] {1}[Hardware Error]: version: 3.0
[ +0.000257] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ +0.000258] {1}[Hardware Error]: device_id: 0000:0e:00.0
[ +0.000254] {1}[Hardware Error]: slot: 1
[ +0.000248] {1}[Hardware Error]: secondary_bus: 0x0f
[ +0.000247] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ +0.000248] {1}[Hardware Error]: class_code: 060400
[ +0.000244] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ +0.000246] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ +0.000248] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000249] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ +0.000250] {1}[Hardware Error]: Error 2, type: fatal
[ +0.000247] {1}[Hardware Error]: section_type: PCIe error
[ +0.000241] {1}[Hardware Error]: port_type: 6, downstream switch port
[ +0.000239] {1}[Hardware Error]: version: 3.0
[ +0.000231] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ +0.000228] {1}[Hardware Error]: device_id: 0000:0e:01.0
[ +0.000225] {1}[Hardware Error]: slot: 2
[ +0.000218] {1}[Hardware Error]: secondary_bus: 0x10
[ +0.000217] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ +0.000215] {1}[Hardware Error]: class_code: 060400
[ +0.000212] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ +0.000216] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ +0.000216] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000213] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ +0.000211] {1}[Hardware Error]: Error 3, type: fatal
[ +0.000209] {1}[Hardware Error]: section_type: PCIe error
[ +0.000206] {1}[Hardware Error]: port_type: 6, downstream switch port
[ +0.000213] {1}[Hardware Error]: version: 3.0
[ +0.000200] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ +0.000200] {1}[Hardware Error]: device_id: 0000:0e:02.0
[ +0.000199] {1}[Hardware Error]: slot: 3
[ +0.000198] {1}[Hardware Error]: secondary_bus: 0x11
[ +0.000200] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ +0.000201] {1}[Hardware Error]: class_code: 060400
[ +0.000200] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ +0.000204] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ +0.000208] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000209] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ +0.000212] {1}[Hardware Error]: Error 4, type: fatal
[ +0.000225] {1}[Hardware Error]: section_type: PCIe error
[ +0.000208] {1}[Hardware Error]: port_type: 6, downstream switch port
[ +0.000208] {1}[Hardware Error]: version: 3.0
[ +0.000205] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ +0.000205] {1}[Hardware Error]: device_id: 0000:0e:03.0
[ +0.000458] {1}[Hardware Error]: slot: 4
[ +0.000201] {1}[Hardware Error]: secondary_bus: 0x12
[ +0.000203] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ +0.000202] {1}[Hardware Error]: class_code: 060400
[ +0.000202] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ +0.000205] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ +0.000210] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ +0.000210] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ +0.000249] pcieport 0000:0c:01.0: AER: aer_status: 0x00004000, aer_mask: 0x03310000
[ +0.000160] pcieport 0000:0c:01.0: [14] CmpltTO (First)
[ +0.000157] pcieport 0000:0c:01.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[ +0.000158] pcieport 0000:0c:01.0: AER: aer_uncor_severity: 0x044ef030
[ +0.066062] pci 0000:11:00.0: AER: can't recover (no error_detected callback)
[ +0.000011] pci 0000:12:00.0: AER: can't recover (no error_detected callback)
[ +0.000002] pci 0000:0d:00.1: AER: can't recover (no error_detected callback)
[ +0.135027] pcieport 0000:0c:01.0: AER: Root Port link has been reset (0)
[ +0.000108] pcieport 0000:0c:01.0: AER: device recovery failed
[ +0.000020] pcieport 0000:0e:00.0: AER: aer_status: 0x00000020, aer_mask: 0x01a10000
[ +0.000164] axl 0000:10:00.0: Unregister directory 0000:10:00.0
[ +0.000418] pcieport 0000:0e:00.0: [ 5] SDES (First)
[ +0.000079] axl 0000:10:00.0: Unregistered metis-0:10:0 (1 1)
[ +0.000615] axl 0000:10:00.0: Release dma mem metis-0:10:0
[ +0.000004] pcieport 0000:0e:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ +0.000534] pcieport 0000:0e:00.0: AER: aer_uncor_severity: 0x044ef030
[ +0.000796] axl 0000:0f:00.0: axl_io_error_detected : Request a slot reset 00000000252f50e9:000000000c4eaacf
[ +0.381261] pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
[ +0.000270] axl 0000:0f:00.0: Unregister directory 0000:0f:00.0
[ +0.000351] axl 0000:0f:00.0: Unregistered metis-0:f:0 (0 0)
[ +0.000020] axl 0000:0f:00.0: Release dma mem metis-0:f:0
[ +0.002383] ------------[ cut here ]------------
[ +0.000003] axl 0000:0f:00.0: disabling already-disabled device
[ +0.000012] WARNING: CPU: 62 PID: 2898 at drivers/pci/pci.c:2387 pci_disable_device+0xc4/0xf0
[ +0.000010] Modules linked in: overlay qrtr cfg80211 sunrpc binfmt_misc xfs nls_iso8859_1 intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac skx_edac_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dell_wmi cmdlinepart video kvm dax_hmem ledtrig_audio cxl_acpi iaa_crypto spi_nor sparse_keymap ipmi_ssif cxl_port dell_smbios pmt_telemetry irqbypass mtd pmt_class intel_sdsi rapl dcdbas intel_cstate wmi_bmof dell_wmi_descriptor cxl_core i2c_i801 isst_if_mbox_pci mgag200 idxd mei_me isst_if_mmio spi_intel_pci i2c_algo_bit metis(OE) idxd_bus intel_vsec switchtec isst_if_common mei spi_intel i2c_ismt i2c_smbus ipmi_si acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler input_leds joydev mac_hid sch_fq_codel dm_multipath msr parport_pc ppdev lp parport efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
[ +0.000086] hid_generic usbhid hid crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel nvme sha256_ssse3 sha1_ssse3 megaraid_sas nvme_core ahci nvme_auth tg3 xhci_pci libahci xhci_pci_renesas wmi pinctrl_emmitsburg aesni_intel crypto_simd cryptd
[ +0.000023] CPU: 62 PID: 2898 Comm: axelera-multi-d Tainted: G OE 6.8.0-111-generic #111-Ubuntu
[ +0.000004] Hardware name: Dell Inc. PowerEdge R760/05H0JD, BIOS 2.6.3 03/26/2025
[ +0.000002] RIP: 0010:pci_disable_device+0xc4/0xf0
[ +0.000004] Code: 4d 85 e4 75 07 4c 8b a3 c8 00 00 00 48 8d bb c8 00 00 00 e8 5e e6 21 00 4c 89 e2 48 c7 c7 b8 1a 4c 9a 48 89 c6 e8 3c d9 77 ff <0f> 0b e9 57 ff ff ff 48 89 df e8 8d fe ff ff 80 a3 51 08 00 00 df
[ +0.000004] RSP: 0018:ff5adcfec77f3ba8 EFLAGS: 00010246
[ +0.000003] RAX: 0000000000000000 RBX: ff2b2e6dc86b8000 RCX: 0000000000000000
[ +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ +0.000002] RBP: ff5adcfec77f3bb8 R08: 0000000000000000 R09: 0000000000000000
[ +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: ff2b2e6dc7b20fb0
[ +0.000002] R13: 0000000000000000 R14: ff2b2e6dc86b8148 R15: 0000000000000080
[ +0.000001] FS: 0000795150e8c740(0000) GS:ff2b2e6da0180000(0000) knlGS:0000000000000000
[ +0.000003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000002] CR2: 0000795150e04650 CR3: 0000000106c72006 CR4: 0000000000f71ef0
[ +0.000002] PKRU: 55555554
[ +0.000002] Call Trace:
[ +0.000002] <TASK>
[ +0.000005] axl_aipu_remove+0xf6/0x110 [metis]
[ +0.000011] pci_device_remove+0x3e/0xb0
[ +0.000005] device_remove+0x40/0x80
[ +0.000004] device_release_driver_internal+0x20b/0x270
[ +0.000004] device_release_driver+0x12/0x20
[ +0.000003] pci_stop_bus_device+0x7a/0xb0
[ +0.000004] pci_stop_bus_device+0x30/0xb0
[ +0.000003] pci_stop_bus_device+0x41/0xb0
[ +0.000002] pci_stop_bus_device+0x41/0xb0
[ +0.000003] pci_stop_and_remove_bus_device_locked+0x1a/0x40
[ +0.000004] remove_store+0x8f/0xa0
[ +0.000005] dev_attr_store+0x14/0x40
[ +0.000004] sysfs_kf_write+0x3b/0x60
[ +0.000006] kernfs_fop_write_iter+0x163/0x1f0
[ +0.000005] vfs_write+0x2a1/0x470
[ +0.000005] ksys_write+0x73/0x100
[ +0.000004] __x64_sys_write+0x19/0x30
[ +0.000003] x64_sys_call+0x7e/0x25a0
[ +0.000004] do_syscall_64+0x7f/0x180
[ +0.000004] ? arch_exit_to_user_mode_prepare.isra.0+0x1a/0xe0
[ +0.000005] ? irqentry_exit_to_user_mode+0x38/0x1e0
[ +0.000005] ? irqentry_exit+0x43/0x50
[ +0.000004] ? exc_page_fault+0x94/0x1b0
[ +0.000003] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ +0.000007] RIP: 0033:0x795150d1c5a4
[ +0.000032] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d a5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[ +0.000002] RSP: 002b:00007fffd768d748 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ +0.000003] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 0000795150d1c5a4
[ +0.000002] RDX: 0000000000000002 RSI: 00005d8f2ecac450 RDI: 0000000000000001
[ +0.000001] RBP: 00007fffd768d770 R08: 0000000000000073 R09: 0000000000000000
[ +0.000002] R10: 00000000ffffffff R11: 0000000000000202 R12: 0000000000000002
[ +0.000001] R13: 00005d8f2ecac450 R14: 0000795150e045c0 R15: 0000795150e01ee0
[ +0.000004] </TASK>
[ +0.000001] ---[ end trace 0000000000000000 ]---

3 replies

Spanner
Axelera Team
Forum|alt.badge.img+3
  • Axelera Team
  • May 14, 2026

Hi ​@Lukasz ! This feels familiar. I think it might have come up elsewhere (or something similar anyway). This might sound obvious so you’ve probably tried already, but while I look into it, have you tried a full-on cold reboot? Power off completely kind of reboot (not a warm one)? If it’s similar to the other case I’m thinking of, as I recall that kicked things back into action...


  • Author
  • Cadet
  • May 14, 2026

I did try “cold reboot”, it did not work. What worked is changing PCIe slot, as well as changing host machine to some other model. From what I can tell from the logs, the switch on the card stopped working.

The TLP header in “Error 0” indicates that the Metis at 12:00.0 requested memory read from  fc:00.0 , which is mapped to a DMAR area , and it timed out resulting in Completion Timeout
 

[    7.776740] switchtec switchtec0: unregistered.
[ 7.979869] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 7.979872] {1}[Hardware Error]: event severity: recoverable
[ 7.979874] {1}[Hardware Error]: Error 0, type: fatal
[ 7.979875] {1}[Hardware Error]: section_type: PCIe error
[ 7.979875] {1}[Hardware Error]: port_type: 4, root port
[ 7.979876] {1}[Hardware Error]: version: 3.0
[ 7.979877] {1}[Hardware Error]: command: 0x0547, status: 0x4010
[ 7.979878] {1}[Hardware Error]: device_id: 0000:0c:01.0
[ 7.979904] axl 0000:12:00.0: Link down
[ 7.979921] axl 0000:11:00.0: Link down
[ 7.980030] axl 0000:12:00.0: Unregister directory 0000:12:00.0
[ 7.980121] axl 0000:12:00.0: Unregistered metis-0:12:0 (3 3)
[ 7.980488] {1}[Hardware Error]: slot: 0
[ 7.980489] {1}[Hardware Error]: secondary_bus: 0x0d
[ 7.980706] axl 0000:12:00.0: Release dma mem metis-0:12:0
[ 7.980947] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x352a
[ 7.980949] {1}[Hardware Error]: class_code: 060400
[ 7.980949] {1}[Hardware Error]: bridge: secondary_status: 0x6000, control: 0x0003
[ 7.980950] {1}[Hardware Error]: aer_uncor_status: 0x00004000, aer_uncor_mask: 0x03310000
[ 7.980951] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ 7.980951] {1}[Hardware Error]: TLP Header: 05800001 fc005e03 12000044 00000000
[ 7.980953] {1}[Hardware Error]: Error 1, type: fatal
[ 7.980953] {1}[Hardware Error]: section_type: PCIe error
[ 7.980954] {1}[Hardware Error]: port_type: 6, downstream switch port
[ 7.980954] {1}[Hardware Error]: version: 3.0
[ 7.980955] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 7.980956] {1}[Hardware Error]: device_id: 0000:0e:00.0
[ 7.984893] {1}[Hardware Error]: slot: 1
[ 7.985088] {1}[Hardware Error]: secondary_bus: 0x0f
[ 7.985281] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ 7.985475] {1}[Hardware Error]: class_code: 060400
[ 7.985665] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ 7.985858] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ 7.986051] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ 7.986242] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ 7.986437] {1}[Hardware Error]: Error 2, type: fatal
[ 7.986629] {1}[Hardware Error]: section_type: PCIe error
[ 7.986817] {1}[Hardware Error]: port_type: 6, downstream switch port
[ 7.987002] {1}[Hardware Error]: version: 3.0
[ 7.987182] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 7.987359] {1}[Hardware Error]: device_id: 0000:0e:01.0
[ 7.987534] {1}[Hardware Error]: slot: 2
[ 7.987704] {1}[Hardware Error]: secondary_bus: 0x10
[ 7.987871] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ 7.988038] {1}[Hardware Error]: class_code: 060400
[ 7.988203] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ 7.988371] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ 7.988539] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ 7.988703] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ 7.988868] {1}[Hardware Error]: Error 3, type: fatal
[ 7.989030] {1}[Hardware Error]: section_type: PCIe error
[ 7.989190] {1}[Hardware Error]: port_type: 6, downstream switch port
[ 7.989351] {1}[Hardware Error]: version: 3.0
[ 7.989507] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 7.989662] {1}[Hardware Error]: device_id: 0000:0e:02.0
[ 7.989818] {1}[Hardware Error]: slot: 3
[ 7.989972] {1}[Hardware Error]: secondary_bus: 0x11
[ 7.990127] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ 7.990284] {1}[Hardware Error]: class_code: 060400
[ 7.990441] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ 7.990601] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ 7.990764] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ 7.990926] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ 7.991092] {1}[Hardware Error]: Error 4, type: fatal
[ 7.991255] {1}[Hardware Error]: section_type: PCIe error
[ 7.991417] {1}[Hardware Error]: port_type: 6, downstream switch port
[ 7.991580] {1}[Hardware Error]: version: 3.0
[ 7.991740] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 7.991900] {1}[Hardware Error]: device_id: 0000:0e:03.0
[ 7.992059] {1}[Hardware Error]: slot: 4
[ 7.992217] {1}[Hardware Error]: secondary_bus: 0x12
[ 7.992375] {1}[Hardware Error]: vendor_id: 0x11f8, device_id: 0x8562
[ 7.992534] {1}[Hardware Error]: class_code: 060400
[ 7.992693] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[ 7.992855] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01a10000
[ 7.993020] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ 7.993184] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ 7.993379] pcieport 0000:0c:01.0: AER: aer_status: 0x00004000, aer_mask: 0x03310000
[ 7.993508] pcieport 0000:0c:01.0: [14] CmpltTO (First)
[ 7.993636] pcieport 0000:0c:01.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[ 7.993761] pcieport 0000:0c:01.0: AER: aer_uncor_severity: 0x044ef030
[ 7.994053] axl 0000:11:00.0: axl_io_error_detected : Request a slot reset 00000000be549c38:00000000a224ba97
[ 9.019816] pcieport 0000:0e:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
[ 10.021796] pcieport 0000:0e:02.0: retraining failed
[ 10.021904] pcieport 0000:0e:02.0: Data Link Layer Link Active not set in 100 msec
[ 10.124670] pci 0000:12:00.0: AER: can't recover (no error_detected callback)
[ 10.124673] pci 0000:0d:00.1: AER: can't recover (no error_detected callback)
[ 10.180822] axl 0000:11:00.0: Link up
[ 10.260812] pcieport 0000:0c:01.0: AER: Root Port link has been reset (0)
[ 10.260903] axl 0000:11:00.0: Unregister directory 0000:11:00.0
[ 10.260915] pcieport 0000:0c:01.0: AER: device recovery failed
[ 10.260939] pcieport 0000:0e:00.0: AER: aer_status: 0x00000020, aer_mask: 0x01a10000
[ 10.261040] axl 0000:11:00.0: Unregistered metis-0:11:0 (2 2)
[ 10.261419] axl 0000:11:00.0: Release dma mem metis-0:11:0
[ 10.261418] pcieport 0000:0e:00.0: [ 5] SDES (First)
[ 10.261717] pcieport 0000:0e:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 10.262101] pcieport 0000:0e:00.0: AER: aer_uncor_severity: 0x044ef030
[ 10.280480] axl 0000:0f:00.0: axl_io_error_detected : Request a slot reset 000000001187962f:00000000fb6d0c4c
[ 10.644795] pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
[ 10.644798] axl 0000:0f:00.0: Called after the pci bus has been reset
[ 10.644828] axl 0000:0f:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 10.645021] axl 0000:0f:00.0: Restore bars
[ 10.754783] workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 128 times, consider switching to WQ_UNBOUND
[ 10.763222] pcieport 0000:0e:00.0: AER: device recovery successful
[ 10.763232] pcieport 0000:0e:01.0: AER: aer_status: 0x00000020, aer_mask: 0x01a10000
[ 10.763383] pcieport 0000:0e:01.0: [ 5] SDES (First)
[ 10.763528] pcieport 0000:0e:01.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 10.763675] pcieport 0000:0e:01.0: AER: aer_uncor_severity: 0x044ef030
[ 10.782468] axl 0000:10:00.0: axl_io_error_detected : Request a slot reset 0000000029ed1e03:000000008555d27f
[ 11.013747] tg3 0000:01:00.0 eno8303: Link is up at 1000 Mbps, full duplex
[ 11.013754] tg3 0000:01:00.0 eno8303: Flow control is off for TX and off for RX
[ 11.013756] tg3 0000:01:00.0 eno8303: EEE is disabled
[ 11.148825] pcieport 0000:0e:01.0: AER: Downstream Port link has been reset (0)
[ 11.148832] axl 0000:10:00.0: Called after the pci bus has been reset
[ 11.148864] axl 0000:10:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 11.149061] axl 0000:10:00.0: Restore bars
[ 11.267609] pcieport 0000:0e:01.0: AER: device recovery successful
[ 11.267616] pcieport 0000:0e:02.0: AER: aer_status: 0x00000020, aer_mask: 0x01a10000
[ 11.267775] pcieport 0000:0e:02.0: [ 5] SDES (First)
[ 11.267927] pcieport 0000:0e:02.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 11.268079] pcieport 0000:0e:02.0: AER: aer_uncor_severity: 0x044ef030
[ 11.288991] axl 0000:11:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 11.289199] axl 0000:11:00.0: vmsi not available
[ 11.374076] pci 0000:11:00.0: AER: can't recover (no error_detected callback)
[ 11.508805] pcieport 0000:0e:02.0: AER: Downstream Port link has been reset (0)
[ 11.508814] pcieport 0000:0e:02.0: AER: device recovery failed
[ 11.508822] pci 0000:0e:03.0: AER: aer_status: 0x00000020, aer_mask: 0x01a10000
[ 11.508823] pci 0000:0e:03.0: [ 5] SDES (First)
[ 11.508826] pci 0000:0e:03.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 11.508827] pci 0000:0e:03.0: AER: aer_uncor_severity: 0x044ef030
[ 11.508828] pci 0000:12:00.0: AER: can't recover (no error_detected callback)
[ 11.509281] axl 0000:10:00.0: Unregister directory 0000:10:00.0
[ 11.509465] axl 0000:10:00.0: Unregistered metis-0:10:0 (1 1)
[ 11.509470] axl 0000:10:00.0: Release dma mem metis-0:10:0
[ 11.590582] axl 0000:0f:00.0: Unregister directory 0000:0f:00.0
[ 11.590685] axl 0000:0f:00.0: Unregistered metis-0:f:0 (0 0)
[ 11.590690] axl 0000:0f:00.0: Release dma mem metis-0:f:0
[ 11.644801] pci 0000:0e:03.0: AER: Downstream Port link has been reset (0)
[ 11.644822] pci 0000:0e:03.0: AER: device recovery failed
[ 11.679485] pci_bus 0000:0f: busn_res: [bus 0f] is released
[ 11.679897] pci_bus 0000:10: busn_res: [bus 10] is released
[ 11.680338] pci_bus 0000:11: busn_res: [bus 11] is released
[ 11.680617] pci_bus 0000:12: busn_res: [bus 12] is released
[ 11.680800] pci_bus 0000:0e: busn_res: [bus 0e-12] is released
[ 11.681172] pci_bus 0000:0d: busn_res: [bus 0d-12] is released

 


  • Author
  • Cadet
  • May 14, 2026

I am not sure if it is related, but when I first got the cards there was firmware 1.5.x on it, I performed firmware upgrade and it did upgrade the firmware, but it failed to update the board controller firmware. However, the devices seem to work… I mean I can run models, on them, but maybe not everything is as it suppose to? I did the 2 stage firmware update process through axdevice, here are the logs: