Skip to main content

Hey everyone,

 

I’m having trouble getting my Axelera Metis PCIe AI Accelerator to be recognized by `lspci`(and my system in general). I tested the card on two different systems.

 

On my AMD setup, I am using an AMD 5950X with an ASUS B550-F. I tried installing the card in the slot I normally use for my RTX 3080 as well as in another slot that meets the specifications. In both cases, `lspci` does not list the card even though the fan spins and I know it is getting power. Also Voyager SDK dosn’t recognize the card.

 

I also tried it on an Intel system with an Intel i5-8500T on a Supermicro X11SCA-F motherboard. The same issue occurs. The card is powered (fan spin) but not recognized.

 

I noticed that the boot time increases significantly when the accelerator is installed. This makes me think that UEFI might be attempting to detect something, even though no error messages are shown. As I already tried a lot and I couldn’t get it running, I wonder if you have any ideas what I can test.

Could there be a problem with the BIOS on the card, similar to issues sometimes seen with GPUs (Video Bios)? If so, is there a way to flash or update it? I think I would also have the necessary tools to attach to the debug port if that helps.

 

Any insights or suggestions would be greatly appreciated. Thanks in advance!

Ah, interesting issue, thanks for bringing it to attention ​@tripton!

Bit out of my expertise, but let’s see what we can find out for you. 👍 Help is on its way!


Hi ​@tripton 
Thank you for your question. Indeed, sometimes the kernel driver for Metis gets messed up. Generally, the following commands help me in such a case:

sudo apt remove metis-dkms

sudo rmmod metis

After which installing the sdk again should help your lspci to recognize the card!

//EDIT do not get scared by not-found errors with these uninstall commands


Hello ​@tripton .
Please try the following:

The following kernel parameter should be added at boot:

  • On Intel CPUs, the intel_iommu=off
  • On AMD CPUs the amd_iommu=off 

For example, in Ubuntu installations, this may be done in the /etc/default/grub file by adding the parameter to the GRUB_CMDLINE_LINUX_DEFAULT line.

  • Then run sudo update-grub
  • Then do  sudo reboot

Let us know if that solved your issue.

Also, please be sure secure boot is disabled.

Kind regards,

Victor

Axelera AI, Customer Engineer


Thank you everyone for your suggestions, but unfortunately none of them helped. lspci still doesn't show the card. I also can't find any related output in dmesg (I'm not even sure what I should be looking for). Secure boot is definitely disabled and, as suggested, I also disabled IOMMU.

Any other ideas?


Hmm, sorry that’s not got things there yet. I’ve just been having a look around in some of the internal docs, to see if I can spot anything (bit out of my specialty). Running sudo dmesg should print the message buffer of the kernal, by the looks of things.

Maybe if you could try that, and post what you see in here? Might give us some ideas on what to try next?


Hey, thanks for looking into this. When I purchased the card, I knew it was a new company doing important work, so no worries at all. I appreciate the effort to diagnose the issue.

I ran sudo dmesg and got quite a lot of output. However, I received an error stating "The text may contain a maximum of 30000 characters." when trying to post. Because of that, I've uploaded the logs to Pastebin:

https://pastebin.com/YE3XYeBy

Let me know if you need any more information or if there is something else I can try.


Hey, thanks for looking into this. When I purchased the card, I knew it was a new company doing important work, so no worries at all. I appreciate the effort to diagnose the issue.

I ran sudo dmesg and got quite a lot of output. However, I received an error stating "The text may contain a maximum of 30000 characters." when trying to post. Because of that, I've uploaded the logs to Pastebin:

https://pastebin.com/YE3XYeBy

Let me know if you need any more information or if there is something else I can try.

Really appreciate that ​@tripton, and it’s more helpful than we can say, having early adopters like yourself putting the effort in and joining us so early in the journey!

And thanks for the dmesg output - let’s see what we can fathom from that 👍


Hi Tripton!
Have you run the install stript in your sdk again after running the remove dkms and rmmod metis calls?
I think I was unclear (edited now): you shouldn’t reinstall your card in a physical way, but you should reinstall the sdk.
Sorry for the confusion.

//EDIT in the meantime I’ll have a look into your Dmesg output, see if anything stands out


Hi Tripton,

I have a few more questions

  1. Did you try both reboot and power cycle? Sometimes one of the two can help with the lspci issues (they behave differently)
  2. Could you post your lspci -tv output to see if the link is detected during BIOS enumeration?
  3. If the link is being detected there, could you disable d3cold on the pcie slot (in your BIOS) and force a rescan?

By the way, you can rescan your pcie by going into sudo bash and running the following command

echo 1 > /sys/bus/pci/rescan

 


Hi everyone,

Sorry, my mistake—I only uninstalled the driver and forgot to reinstall it. I've reinstalled it, so if that affects the dmesg output, I've uploaded the current logs (https://pastebin.com/7zjnDCTE). Also, for lspci -tv, I still can’t see the card. But for completeness, I've included the lspci -tv output below:

-<0000:00]-+-00.0  Advanced Micro Devices, Inc. IAMD] Starship/Matisse Root Complex
           +-00.2  Advanced Micro Devices, Inc. oAMD] Starship/Matisse IOMMU
           +-01.0  Advanced Micro Devices, Inc. vAMD] Starship/Matisse PCIe Dummy Host Bridge
           +-01.1-d01]----00.0  Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD
           +-01.2-S02-0a]--+-00.0  Advanced Micro Devices, Inc. 2AMD] Device 43ee
           |               +-00.1  Advanced Micro Devices, Inc. AMD] Device 43eb
           |               \-00.2-r03-0a]--+-00.0- 04]--
           |                               +-01.0-05]--
           |                               +-02.0-06]----00.0  Device 1af2:a001
           |                               +-03.0-/07]----00.0  Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
           |                               +-04.0->08]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
           |                               +-08.0- 09]--
           |                               \-09.0- 0a]----00.0  Intel Corporation Ethernet Controller I225-V
           +-02.0  Advanced Micro Devices, Inc. iAMD] Starship/Matisse PCIe Dummy Host Bridge
           +-03.0  Advanced Micro Devices, Inc. AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-03.1-0b]--+-00.0  NVIDIA Corporation GA102 GeForce RTX 3080]
           |            \-00.1  NVIDIA Corporation GA102 High Definition Audio Controller
           +-04.0  Advanced Micro Devices, Inc. eAMD] Starship/Matisse PCIe Dummy Host Bridge
           +-05.0  Advanced Micro Devices, Inc. sAMD] Starship/Matisse PCIe Dummy Host Bridge
           +-07.0  Advanced Micro Devices, Inc. iAMD] Starship/Matisse PCIe Dummy Host Bridge
           +-07.1-0c]----00.0  Advanced Micro Devices, Inc. AMD] Starship/Matisse PCIe Dummy Function
           +-08.0  Advanced Micro Devices, Inc. .AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-08.1-0d]--+-00.0  Advanced Micro Devices, Inc. [amd] Starship/Matisse Reserved SPP
           |            +-00.1  Advanced Micro Devices, Inc. sAMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
           |            +-00.3  Advanced Micro Devices, Inc. eAMD] Matisse USB 3.0 Host Controller
           |            \-00.4  Advanced Micro Devices, Inc. vAMD] Starship/Matisse HD Audio Controller
           +-14.0  Advanced Micro Devices, Inc. -AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. eAMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0
           +-18.1  Advanced Micro Devices, Inc. AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1
           +-18.2  Advanced Micro Devices, Inc. AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2
           +-18.3  Advanced Micro Devices, Inc. AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3
           +-18.4  Advanced Micro Devices, Inc. 3AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4
           +-18.5  Advanced Micro Devices, Inc. ,AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5
           +-18.6  Advanced Micro Devices, Inc. cAMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6
           \-18.7  Advanced Micro Devices, Inc. AAMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7

I'm currently only testing on the AMD configuration. If needed, I can provide the corresponding logs for the Intel setup as well.

Thanks for letting me know how to rescan (I tried that with no luck). Please let me know if there’s anything else I should try!

Thanks again


Hi,

While we’re processing your response, I remembered that we never asked which ubuntu version you’re running. So, which version of ubuntu are you running?

We officially support version 22.04. Official support for 24.04 will be released later, but for now we recommend using Docker for version 24.04.

 

 


Hi ​@tripton can you try the following:

lspci -tv
echo 1 | sudo tee /sys/bus/pci/devices/0000\:xx\:yy.0/d3cold_allowed
echo 1 | sudo tee /sys/bus/pci/devices/0000\:xx\:yy.0/d3cold_allowed
echo 1 | sudo tee /sys/bus/pci/devices/0000\:xx\:yy.0/d3cold_allowed
lspci -tv
echo 1 | sudo tee /sys/bus/pci/rescan > /dev/null
lspci -tv

where xx and yy need to correspond to the three bridges (empty ports) that you see in lspci -tv.

  • If you have doubts of the ports numbers, please share lspci output with us.

Please share with us the exact same commands you have executed.


Hi Tripton,

I set up an amd machine and we ran into the same issue. What we saw was that the system had a setting for pcie to go into power saving mode.

To overcome this issue, we need to change the grub file again:

sudo vim /etc/default/grub

where we need to disable PCIe ASPM (Active State Power Management) by adding pcie_aspm=off. Your grub would, for example, look like this

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pcie_aspm=off"

Afterwards, save the file and update grub using sudo update-grub

And then reboot

sudo reboot now

Please let me know if this works for you!


Hi everyone,

Here’s an update and my responses to the questions and suggestions:

  1. Ubuntu Version:
    I am running Ubuntu 22.04. Docker support sounds promising for future tests on 24.04, but for now I’ll stick with the supported version. Is there already an install instruction? I just saw some files on GitHub. If it would help, I could completely reinstall Ubuntu, although it is already a relatively fresh install.

  2. lspci and d3cold Commands:
    Thanks, I tried the suggested commands. You can find a link with all the commands I executed and their corresponding output here: https://pastebin.com/q7U7yFkC. For context, I installed the card into the PCIe 3.0 x16 (x4) slot on the motherboard, the one connected via the chipset, since my GPU is directly connected to the CPU (should be fine?). I wasn’t sure which index to use, so I tried them all.

  3. PCIe ASPM Setting:
    Very interesting suggestion regarding disabling PCIe ASPM by editing /etc/default/grub and adding the option. I tried this (with and without amd_immuo=False as suggested earlier), but unfortunately it did not fix the issue.

  4. Remote Access:
    I am also open to providing remote access to my machine if that helps with troubleshooting. I have no problem testing by myself, so only if it helps you. I can hook up Nano KVM so you can even access the UEFI.

Please let me know if you need any further information or if there is any additional testing I should do!

Thanks again,

Tripton


Hi everyone,

Here’s an update and my responses to the questions and suggestions:

  1. Ubuntu Version:
    I am running Ubuntu 22.04. Docker support sounds promising for future tests on 24.04, but for now I’ll stick with the supported version. Is there already an install instruction? I just saw some files on GitHub. If it would help, I could completely reinstall Ubuntu, although it is already a relatively fresh install.

  2. lspci and d3cold Commands:
    Thanks, I tried the suggested commands. You can find a link with all the commands I executed and their corresponding output here: https://pastebin.com/q7U7yFkC. For context, I installed the card into the PCIe 3.0 x16 (x4) slot on the motherboard, the one connected via the chipset, since my GPU is directly connected to the CPU (should be fine?). I wasn’t sure which index to use, so I tried them all.

  3. PCIe ASPM Setting:
    Very interesting suggestion regarding disabling PCIe ASPM by editing /etc/default/grub and adding the option. I tried this (with and without amd_immuo=False as suggested earlier), but unfortunately it did not fix the issue.

  4. Remote Access:
    I am also open to providing remote access to my machine if that helps with troubleshooting. I have no problem testing by myself, so only if it helps you. I can hook up Nano KVM so you can even access the UEFI.

Please let me know if you need any further information or if there is any additional testing I should do!

Thanks again,

Tripton

 

Hi Tripton!

Thank you for the extensive answer. We do, in fact, see our device now:

Device 1f9d:1100

So it seems the d3cold_allowed calls worked.

To see the Axelera device name, please run the following 

sudo update-pciids

lspci -tv

 

So with this device recognized, please move forward with your installation and evaluation and let us know how it goes!


Hi everyone,

Sorry, I'm stupid. I didn't realize it was working since the device name was missing. I followed your instructions, and now I see the device:

04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)

I also tested without the grub flags, and I can still see the device. However, when I try to run the YOLO example it doesn't work. Here's what I get:

WARNING : 4PCI device count mismatch: lspci=1, triton=0
ERROR : No devices found

And when I run explicitly with PCIe:

./inference.py yolov5s-v7-coco media/traffic1_1080p.mp4 --metis pcie

I receive:

WARNING : 4PCI device count mismatch: lspci=1, triton=0
INFO : Failed to detect device: No devices found
WARNING : This model is restricted to deploy for single-core (but can be run using multiple cores).
INFO : Deploying model yolov5s-v7-coco for 1 core. This may take a while...
|████████████████████████████████████████| 2:58.8
ERROR : list index out of range

I also reinstalled everything and noticed the following at the end of the installation:

building operators
refreshing pcie and firmware
WARNING: Failed to refresh pcie and firmware

Any idea what could be causing this? Thanks for bringing me this far; I appreciate your help and guidance!


Hello ​@tripton ,

Let’s try two things:

1

Remove and rescan devices with axdevice Voyager SDK tool

Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate command.

Then run the following command:

axdevice --refresh

This command will:

  • remove all Axelera PCIE/M.2 devices and provoke a rescan

  • reload firmware for Axelera PCIE/M.2 devices

2

If 1 didn’t make it work, try the following.

Disable IOMMU

In some kernels, the intel_iommu=off or amd_iommu=off kernel parameters should be added at boot.

For example, in Ubuntu, the grub file needs to be modified:

sudo nano /etc/default/grub

where you need to disable IOMMU by adding intel_iommu=off or amd_iommu=off parameter:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=off"

or for AMD:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off"

Save the file, then run:

sudo update-grub 

and to apply the changes:

sudo reboot


@tripton 

If my previous message didn’t work, please do the following and try my previous message again. Please share with us the results for each step.

 

Remove old driver and install updated driver

Run the following command to check your driver version:

cat /sys/class/metis/version

It will display the version of the driver

0.07.16

Share this version and your Voyager SDK version with Axelera AI Support Team.

If an old driver version is displayed, you must remove the old driver:

sudo rmmod metis

Running Voyager SDK installer will install the latest driver for the current version of Voyager SDK:

./install.ah --all --YES


Sadly, neither option worked for me. I still get the same error. When I run axdevice -v I see:

INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0

It appears that the device is seen (at least by lspci), but it's still not recognized. I checked the driver version using cat /sys/class/metis/version and it shows:

0.07.16

which I believe is the latest version. For the Voyager SDK, I’m using release/v1.2.5 from GitHub.


Hello ​@tripton , 

Just to be sure, you tried both my instructions for “Remove and rescan devices with axdevice Voyager SDK tool” and for “Disable IOMMU”, right? Can you confirm that?

 

Can you try the following and share the exact output with us for each?:

 

Check Firmware version of your Metis card uuses triton_multi_ctx]

Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate command.

Execute the following command:

triton_multi_ctx --fwver

This will display the version of the Metis Firmware loaded in your card, for example,

Firmware version: v1.2.5

Share this version and your Voyager SDK version with Axelera AI Support Team.

 

Check Board Controller version of your Metis card ouses triton_multi_ctx]

Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate command.

Execute the following command:

triton_multi_ctx --bc-version

This will display the version of the Board Controller in your card, for example,

zephyr version: baf12c979030

app name: board_bringup

app version: v1.0

Share this version and your Voyager SDK version with Axelera AI Support Team.

 


Hello,

I confirm that I tried both instructions. Removing and rescanning the devices with axdevice and disabling IOMMU. Unfortunately, neither option helped. I even reinstalled the driver (as suggested before), but it also didn’t help. The Voyager SDK version (release/v1.2.5) and the driver version (0.07.16) remain unchanged from my previous posts.

Here is the output for the additional commands you suggested:

(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ triton_multi_ctx --fwver
Fail to get device name
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ triton_multi_ctx --bc-version
Fail to get device name
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice -v
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0

As it appears that no device is found with triton_multi_ctx, I’m including the output from axdevice again. The device is still found by lspci, but I don't get any driver version.

Any idea what to try next?

Thanks,

Tripton


Hi ​@tripton 

I am really sorry to hear that. 

Let’s try one more thing. Please run

axdevice --refresh -v

multiple times. Run it at least 3 or 4 times and let’s see if there is any difference.

Please share the exact output with us from the 4 runs of that command.

Kind regards,

Victor 


Hey Victor,

No worries, I'm still optimistic about getting it to work. Here is the output for four consecutive runs of the command:

(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice --refresh -v
INFO:axelera.runtime.axdevice:Removing 0000:03:00.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:04:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
Traceback (most recent call last):
File "/home/tripton/.cache/axelera/venvs/93f45ae3/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 625, in entrypoint_main
main(args)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 608, in main
devices = _find_devices(found_devices, device_id)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 187, in _find_devices
raise RuntimeError("No devices found, use -v for more information")
RuntimeError: No devices found, use -v for more information
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice --refresh -v
INFO:axelera.runtime.axdevice:Removing 0000:03:00.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:04:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
Traceback (most recent call last):
File "/home/tripton/.cache/axelera/venvs/93f45ae3/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 625, in entrypoint_main
main(args)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 608, in main
devices = _find_devices(found_devices, device_id)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 187, in _find_devices
raise RuntimeError("No devices found, use -v for more information")
RuntimeError: No devices found, use -v for more information
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice --refresh -v
INFO:axelera.runtime.axdevice:Removing 0000:03:00.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:04:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
Traceback (most recent call last):
File "/home/tripton/.cache/axelera/venvs/93f45ae3/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 625, in entrypoint_main
main(args)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 608, in main
devices = _find_devices(found_devices, device_id)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 187, in _find_devices
raise RuntimeError("No devices found, use -v for more information")
RuntimeError: No devices found, use -v for more information
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice --refresh -v
INFO:axelera.runtime.axdevice:Removing 0000:03:00.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:04:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
Traceback (most recent call last):
File "/home/tripton/.cache/axelera/venvs/93f45ae3/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 625, in entrypoint_main
main(args)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 608, in main
devices = _find_devices(found_devices, device_id)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 187, in _find_devices
raise RuntimeError("No devices found, use -v for more information")
RuntimeError: No devices found, use -v for more information
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$

It looks kind of weird that the PCIe device at 03:00.0 was removed while the device appears at 04:00.0. I have no clue if that matters, but I thought I would share the logs as they are.

Thanks, Tripton


Thank you for sharing the information ​@tripton .

  • 03:00.0 was removed because it intentionally removes the bridge into which the Metis card (at 04:00.0) is connected. So that is fine and expected.

Knowing the firmware and board controller version in the board the was not possible, and that would have helped.

  • I am suspecting that the card might have an old firmware or old board controller, as that would explain this behaviour. But it is only a suspicion I cannot confirm with the current information. 
  • Of course, if the card had and old firmware or board controller, it would have been a mistake.

I am checking internally with our experts how we can proceed or if there is anything else we can try.

I will update you as soon as I have any news. Again, we apologise for these difficulties.

Regards,

Victor


Hello ​@tripton ,

I have received further feedback.

Please run the following command and share the output with us:

cat /proc/cmdline
 

This shows the exact kernel parameters passed to the currently running kernel.

  • If you see pcie_aspm=off in the output, it means the parameter was successfully applied during boot.
  • If it is not in the output, the host may be using a different boot method like cloud-init.

Kind regards,

Victor


Reply