Skip to main content

I’m also not seeing the card recognised in `lspci`. I’m adding a post to this thread to keep all the info in one place for others, but let me know if it should be a different thread.

I’ve enabled `pci-check.service` and get the following when I check the status;

alsutton@svr220:~$ sudo systemctl status pcie-check.service

● pcie-check.service - Check for PCIe devices with vendor ID 1f9d and reboot once if not found
     Loaded: loaded (/etc/systemd/system/pcie-check.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2025-08-06 09:33:01 UTC; 11min ago
    Process: 1709 ExecStart=/usr/local/bin/check_pcie_device.sh (code=exited, status=0/SUCCESS)
   Main PID: 1709 (code=exited, status=0/SUCCESS)
        CPU: 6ms

Aug 06 09:33:01 svr220 systemd91]: Starting Check for PCIe devices with vendor ID 1f9d and reboot once if not found...
Aug 06 09:33:01 svr220 check_pcie_device.sh01709]: Reboot already performed; skipping further checks.
Aug 06 09:33:01 svr220 systemd91]: Finished Check for PCIe devices with vendor ID 1f9d and reboot once if not found.

`dmesg` contains the following lines;

   519.220595] triton: root directory for triton
e  844.197296] triton: debugfs root directory triton removed

`lspci -tv contains`;

 +-a0000:80]-+-02.0-o81]--
| +-02.3-r82]----00.0 Micron/Crucial Technology P2 nNick P2] / P3 / P3 Plus NVMe PCIe SSD (DRAM-less)

(The NVMe PCIe SSD adapter is on the same riser card as the Axelera card, and when the card isn’t present there is no n81] entry)

The metis module doesn’t autoload, so I’m loading it manually, and after doing that, any attempt to run `triton_multi_ctx` results in;

(venv) alsutton@svr220:~/Utils/voyager-sdk$ triton_multi_ctx --cold-boot 3
_libtriton_linux.c:985] Could not open directory '/sys/class/metis/': No such file or directory
Fail to get device name
(venv) alsutton@svr220:~/Utils/voyager-sdk$ sudo modprobe metis
ssudo] password for alsutton:
(venv) alsutton@svr220:~/Utils/voyager-sdk$ triton_multi_ctx --cold-boot 3
Fail to get device name

I’ve run `update-pciids`, have put `pcie_aspm=off` into the grub cmdline (and run `update-grub`), and tried reinstalling the kernel modules (`metis-dkms`), set `d3cold_allowed` to 0, and tried setting it to 1, triggering a pci rescan after each change.

Does anyone have any thoughts?

eMachine Details; Supermicro X10DRU-i+ with two Intel Xeon E5-2673 v4’s]

Welcome to the community ​@alsutton! Great to have you here.


Just to tick off a couple of obvious thought before digging deeper, but has Above 4G Decoding been enabled in the BIOS?

Any improvement after running axdevice --refresh in the SDK environment?


Thanks for the greeting 🙂.

Above 4G decoding has been enabled. The other potentially relevant settings (all under “Advanced” in the BIOS) are;

CPU Configuration > X2APIC = Enabled
CPU Configuration > Intel Virtualisation Technology = Disabled
Chipset > Northbridge > IIO > IOU2 = 4x4x4x4
Chipset > Northbridge > IIO > IIO2 > IIO Port Link Speed = Gen 3 (8 GT/s)
Chipset > Northbridge > IIO > IIO2 > IIO Port Non-Posted Prefetch = Disabled
Chipset > Northbridge > IIO > Intel VT for Directed I/O (VT-d] = Disabled
Chipset > PCIe/PCI/PnP Configuration > PCI PERR/SERR Support = Disabled
Chipset > PCIe/PCI/PnP Configuration > SR-IOV Support = Disabled
Chipset > PCIe/PCI/PnP Configuration > ASPM Support = Disabled


Hello!
I have tried both Intel and AMD environments, and I think the following conditions are necessary for recognition by lspci.


1. OS installation in UEFI mode (it could not be recognized under the conditions below with CSM Enable or Legacy Boot).

2. Above 4G settings (must be enabled in the BIOS or UEFI).

3. Resizable BAR might also be a possibility. If this setting cannot be configured in UEFI, it was not recognized by lspci.

Please check if it can be set in UEFI. You might also want to consider changing the motherboard.


axdevice -refresh output;

(venv) alsutton@svr220:~/Utils/voyager-sdk$ axdevice --refresh
ERROR: No target device found in lspci output
ERROR: AXR_ERROR_CONNECTION_ERROR: No target device found in lspci output
(venv) alsutton@svr220:~/Utils/voyager-sdk$ axdevice --refresh -v
INFO: Removing 0000:00:01.0
INFO: PCIE rescan
ERROR: No target device found in lspci output
Traceback (most recent call last):
File "/home/alsutton/.cache/axelera/venvs/83d579fa/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/alsutton/.cache/axelera/venvs/83d579fa/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 739, in entrypoint_main
main(args)
File "/home/alsutton/.cache/axelera/venvs/83d579fa/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 698, in main
found_devices = c.list_devices()
File "/home/alsutton/.cache/axelera/venvs/83d579fa/lib/python3.10/site-packages/axelera/runtime/objects.py", line 277, in list_devices
_maybe_raise_error(self)
File "/home/alsutton/.cache/axelera/venvs/83d579fa/lib/python3.10/site-packages/axelera/runtime/objects.py", line 91, in _maybe_raise_error
_raise_error(ctx, err_no)
File "/home/alsutton/.cache/axelera/venvs/83d579fa/lib/python3.10/site-packages/axelera/runtime/objects.py", line 83, in _raise_error
raise exc(f"{err}: {msg}")
ConnectionError: AXR_ERROR_CONNECTION_ERROR: No target device found in lspci output
(venv) alsutton@svr220:~/Utils/voyager-sdk$

 


Hello!
I have tried both Intel and AMD environments, and I think the following conditions are necessary for recognition by lspci.


1. OS installation in UEFI mode (it could not be recognized under the conditions below with CSM Enable or Legacy Boot).

Done.

2. Above 4G settings (must be enabled in the BIOS or UEFI).

Done.

3. Resizable BAR might also be a possibility. If this setting cannot be configured in UEFI, it was not recognized by lspci.

There’s no option for SAM/Resizable BAR in the BIOS/UEFI.

Please check if it can be set in UEFI. You might also want to consider changing the motherboard.

Changing motherboards isn’t an option, unfortunately. It’s more likely we’d go with a different accelerator card.

Thanks for your suggestions.


Here’s the manual for the system, which includes all the possible settings, for those who are interested; https://www.supermicro.com/manuals/motherboard/C612/MNL-1597.pdf


Above 4G decoding is supported. That's good to know.


I also tried to swap the expansion slots, but since your mother is dual socket, maybe you could try with 1 CPU?


It looks like you have the UEFI set up.


If possible, I would like to update the Metis AIPU firmware on other motherboards so that it behaves differently.


I popped the card out of a server and put it in a Dell T5810 desktop. It looks, to me, like the card is faulty.

I see another non-entry in `lspci -tv`;

 |           +-01.0-[01]--
| +-02.0-[02]--+-00.0 NVIDIA Corporation AD104GL [RTX 4000 Ada Generation]
| | \-00.1 NVIDIA Corporation AD104 High Definition Audio Controller

(The RTX4000 Ada is in the next slot, so I’d expect to see the card at `01.0`)

I’ve gone through the rescan, etc., etc., etc. process, but still no joy.

Here’s what we see in dmesg on the 5810;

<    1.509977] pci 0000:00:01.0: 08086:6f02] type 01 class 0x060400 PCIe Root Port
< 1.510026] pci 0000:00:01.0: PCI bridge to gbus 01]
< 1.510031] pci 0000:00:01.0: bridge window imem 0xf2000000-0xf40fffff]
< 1.510061] pci 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
< 2.506890] pci 0000:00:01.0: retraining failed
< 2.506960] pci 0000:00:01.0: PME# supported from D0 D3hot D3cold
< 2.513307] pci 0000:00:01.0: PCI bridge to gbus 01]
< 2.557694] pci 0000:00:01.0: PCI bridge to gbus 01]
< 2.557704] pci 0000:00:01.0: bridge window imem 0xf2000000-0xf40fffff]

Given we’re in the UK, and import tax has already been paid, I’m going to draw a line under this and write the card off as a learning experience.

Thanks to everyone for their help.


Hi ​@alsutton! Hmm, yeah, that’s a good test of the card, and it is pointing towards a hardware issue.

We can look at getting a replacement out to you ASAP - I’ll DM you so we can get any required details together 👍


Reply