Ah, interesting issue, thanks for bringing it to attention @tripton!
Bit out of my expertise, but let’s see what we can find out for you.
Help is on its way!
Hi @tripton
Thank you for your question. Indeed, sometimes the kernel driver for Metis gets messed up. Generally, the following commands help me in such a case:
sudo apt remove metis-dkms
sudo rmmod metis
After which installing the sdk again should help your lspci to recognize the card!
//EDIT do not get scared by not-found errors with these uninstall commands
Hello @tripton .
Please try the following:
The following kernel parameter should be added at boot:
- On Intel CPUs, the
intel_iommu=off
- On AMD CPUs the
amd_iommu=off
For example, in Ubuntu installations, this may be done in the /etc/default/grub
file by adding the parameter to the GRUB_CMDLINE_LINUX_DEFAULT
line.
- Then run
sudo update-grub
- Then do
sudo reboot
Let us know if that solved your issue.
Also, please be sure secure boot is disabled.
Kind regards,
Victor
Axelera AI, Customer Engineer
Thank you everyone for your suggestions, but unfortunately none of them helped. lspci still doesn't show the card. I also can't find any related output in dmesg (I'm not even sure what I should be looking for). Secure boot is definitely disabled and, as suggested, I also disabled IOMMU.
Any other ideas?
Hmm, sorry that’s not got things there yet. I’ve just been having a look around in some of the internal docs, to see if I can spot anything (bit out of my specialty). Running sudo dmesg should print the message buffer of the kernal, by the looks of things.
Maybe if you could try that, and post what you see in here? Might give us some ideas on what to try next?
Hey, thanks for looking into this. When I purchased the card, I knew it was a new company doing important work, so no worries at all. I appreciate the effort to diagnose the issue.
I ran sudo dmesg and got quite a lot of output. However, I received an error stating "The text may contain a maximum of 30000 characters." when trying to post. Because of that, I've uploaded the logs to Pastebin:
https://pastebin.com/YE3XYeBy
Let me know if you need any more information or if there is something else I can try.
Hey, thanks for looking into this. When I purchased the card, I knew it was a new company doing important work, so no worries at all. I appreciate the effort to diagnose the issue.
I ran sudo dmesg and got quite a lot of output. However, I received an error stating "The text may contain a maximum of 30000 characters." when trying to post. Because of that, I've uploaded the logs to Pastebin:
https://pastebin.com/YE3XYeBy
Let me know if you need any more information or if there is something else I can try.
Really appreciate that @tripton, and it’s more helpful than we can say, having early adopters like yourself putting the effort in and joining us so early in the journey!
And thanks for the dmesg output - let’s see what we can fathom from that 
Hi Tripton!
Have you run the install stript in your sdk again after running the remove dkms and rmmod metis calls?
I think I was unclear (edited now): you shouldn’t reinstall your card in a physical way, but you should reinstall the sdk.
Sorry for the confusion.
//EDIT in the meantime I’ll have a look into your Dmesg output, see if anything stands out
Hi Tripton,
I have a few more questions
- Did you try both reboot and power cycle? Sometimes one of the two can help with the
lspci
issues (they behave differently) - Could you post your
lspci -tv
output to see if the link is detected during BIOS enumeration? - If the link is being detected there, could you disable d3cold on the pcie slot (in your BIOS) and force a rescan?
By the way, you can rescan your pcie by going into sudo bash
and running the following command
echo 1 > /sys/bus/pci/rescan
Hi everyone,
Sorry, my mistake—I only uninstalled the driver and forgot to reinstall it. I've reinstalled it, so if that affects the dmesg output, I've uploaded the current logs (https://pastebin.com/7zjnDCTE). Also, for lspci -tv, I still can’t see the card. But for completeness, I've included the lspci -tv output below:
-<0000:00]-+-00.0 Advanced Micro Devices, Inc. IAMD] Starship/Matisse Root Complex
+-00.2 Advanced Micro Devices, Inc. oAMD] Starship/Matisse IOMMU
+-01.0 Advanced Micro Devices, Inc. vAMD] Starship/Matisse PCIe Dummy Host Bridge
+-01.1-d01]----00.0 Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD
+-01.2-S02-0a]--+-00.0 Advanced Micro Devices, Inc. 2AMD] Device 43ee
| +-00.1 Advanced Micro Devices, Inc. AMD] Device 43eb
| \-00.2-r03-0a]--+-00.0- 04]--
| +-01.0-05]--
| +-02.0-06]----00.0 Device 1af2:a001
| +-03.0-/07]----00.0 Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
| +-04.0->08]----00.0 Sandisk Corp WD Blue SN550 NVMe SSD
| +-08.0- 09]--
| \-09.0- 0a]----00.0 Intel Corporation Ethernet Controller I225-V
+-02.0 Advanced Micro Devices, Inc. iAMD] Starship/Matisse PCIe Dummy Host Bridge
+-03.0 Advanced Micro Devices, Inc. AMD] Starship/Matisse PCIe Dummy Host Bridge
+-03.1-0b]--+-00.0 NVIDIA Corporation GA102 GeForce RTX 3080]
| \-00.1 NVIDIA Corporation GA102 High Definition Audio Controller
+-04.0 Advanced Micro Devices, Inc. eAMD] Starship/Matisse PCIe Dummy Host Bridge
+-05.0 Advanced Micro Devices, Inc. sAMD] Starship/Matisse PCIe Dummy Host Bridge
+-07.0 Advanced Micro Devices, Inc. iAMD] Starship/Matisse PCIe Dummy Host Bridge
+-07.1-0c]----00.0 Advanced Micro Devices, Inc. AMD] Starship/Matisse PCIe Dummy Function
+-08.0 Advanced Micro Devices, Inc. .AMD] Starship/Matisse PCIe Dummy Host Bridge
+-08.1-0d]--+-00.0 Advanced Micro Devices, Inc. [amd] Starship/Matisse Reserved SPP
| +-00.1 Advanced Micro Devices, Inc. sAMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
| +-00.3 Advanced Micro Devices, Inc. eAMD] Matisse USB 3.0 Host Controller
| \-00.4 Advanced Micro Devices, Inc. vAMD] Starship/Matisse HD Audio Controller
+-14.0 Advanced Micro Devices, Inc. -AMD] FCH SMBus Controller
+-14.3 Advanced Micro Devices, Inc. AMD] FCH LPC Bridge
+-18.0 Advanced Micro Devices, Inc. eAMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0
+-18.1 Advanced Micro Devices, Inc. AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1
+-18.2 Advanced Micro Devices, Inc. AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2
+-18.3 Advanced Micro Devices, Inc. AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3
+-18.4 Advanced Micro Devices, Inc. 3AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4
+-18.5 Advanced Micro Devices, Inc. ,AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5
+-18.6 Advanced Micro Devices, Inc. cAMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6
\-18.7 Advanced Micro Devices, Inc. AAMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7
I'm currently only testing on the AMD configuration. If needed, I can provide the corresponding logs for the Intel setup as well.
Thanks for letting me know how to rescan (I tried that with no luck). Please let me know if there’s anything else I should try!
Thanks again
Hi,
While we’re processing your response, I remembered that we never asked which ubuntu version you’re running. So, which version of ubuntu are you running?
We officially support version 22.04. Official support for 24.04 will be released later, but for now we recommend using Docker for version 24.04.
Hi @tripton can you try the following:
lspci -tv
echo 1 | sudo tee /sys/bus/pci/devices/0000\:xx\:yy.0/d3cold_allowed
echo 1 | sudo tee /sys/bus/pci/devices/0000\:xx\:yy.0/d3cold_allowed
echo 1 | sudo tee /sys/bus/pci/devices/0000\:xx\:yy.0/d3cold_allowed
lspci -tv
echo 1 | sudo tee /sys/bus/pci/rescan > /dev/null
lspci -tv
where xx and yy need to correspond to the three bridges (empty ports) that you see in lspci -tv.
- If you have doubts of the ports numbers, please share lspci output with us.
Please share with us the exact same commands you have executed.
Hi Tripton,
I set up an amd machine and we ran into the same issue. What we saw was that the system had a setting for pcie to go into power saving mode.
To overcome this issue, we need to change the grub
file again:
sudo vim /etc/default/grub
where we need to disable PCIe ASPM (Active State Power Management) by adding pcie_aspm=off
. Your grub would, for example, look like this
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pcie_aspm=off"
Afterwards, save the file and update grub using sudo update-grub
And then reboot
sudo reboot now
Please let me know if this works for you!
Hi everyone,
Here’s an update and my responses to the questions and suggestions:
-
Ubuntu Version:
I am running Ubuntu 22.04. Docker support sounds promising for future tests on 24.04, but for now I’ll stick with the supported version. Is there already an install instruction? I just saw some files on GitHub. If it would help, I could completely reinstall Ubuntu, although it is already a relatively fresh install.
-
lspci and d3cold Commands:
Thanks, I tried the suggested commands. You can find a link with all the commands I executed and their corresponding output here: https://pastebin.com/q7U7yFkC. For context, I installed the card into the PCIe 3.0 x16 (x4) slot on the motherboard, the one connected via the chipset, since my GPU is directly connected to the CPU (should be fine?). I wasn’t sure which index to use, so I tried them all.
-
PCIe ASPM Setting:
Very interesting suggestion regarding disabling PCIe ASPM by editing /etc/default/grub
and adding the option. I tried this (with and without amd_immuo=False
as suggested earlier), but unfortunately it did not fix the issue.
-
Remote Access:
I am also open to providing remote access to my machine if that helps with troubleshooting. I have no problem testing by myself, so only if it helps you. I can hook up Nano KVM so you can even access the UEFI.
Please let me know if you need any further information or if there is any additional testing I should do!
Thanks again,
Tripton
Hi everyone,
Here’s an update and my responses to the questions and suggestions:
-
Ubuntu Version:
I am running Ubuntu 22.04. Docker support sounds promising for future tests on 24.04, but for now I’ll stick with the supported version. Is there already an install instruction? I just saw some files on GitHub. If it would help, I could completely reinstall Ubuntu, although it is already a relatively fresh install.
-
lspci and d3cold Commands:
Thanks, I tried the suggested commands. You can find a link with all the commands I executed and their corresponding output here: https://pastebin.com/q7U7yFkC. For context, I installed the card into the PCIe 3.0 x16 (x4) slot on the motherboard, the one connected via the chipset, since my GPU is directly connected to the CPU (should be fine?). I wasn’t sure which index to use, so I tried them all.
-
PCIe ASPM Setting:
Very interesting suggestion regarding disabling PCIe ASPM by editing /etc/default/grub
and adding the option. I tried this (with and without amd_immuo=False
as suggested earlier), but unfortunately it did not fix the issue.
-
Remote Access:
I am also open to providing remote access to my machine if that helps with troubleshooting. I have no problem testing by myself, so only if it helps you. I can hook up Nano KVM so you can even access the UEFI.
Please let me know if you need any further information or if there is any additional testing I should do!
Thanks again,
Tripton
Hi Tripton!
Thank you for the extensive answer. We do, in fact, see our device now:
Device 1f9d:1100
So it seems the d3cold_allowed calls worked.
To see the Axelera device name, please run the following
sudo update-pciids
lspci -tv
So with this device recognized, please move forward with your installation and evaluation and let us know how it goes!
Hi everyone,
Sorry, I'm stupid. I didn't realize it was working since the device name was missing. I followed your instructions, and now I see the device:
04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
I also tested without the grub flags, and I can still see the device. However, when I try to run the YOLO example it doesn't work. Here's what I get:
WARNING : 4PCI device count mismatch: lspci=1, triton=0
ERROR : No devices found
And when I run explicitly with PCIe:
./inference.py yolov5s-v7-coco media/traffic1_1080p.mp4 --metis pcie
I receive:
WARNING : 4PCI device count mismatch: lspci=1, triton=0
INFO : Failed to detect device: No devices found
WARNING : This model is restricted to deploy for single-core (but can be run using multiple cores).
INFO : Deploying model yolov5s-v7-coco for 1 core. This may take a while...
|████████████████████████████████████████| 2:58.8
ERROR : list index out of range
I also reinstalled everything and noticed the following at the end of the installation:
building operators
refreshing pcie and firmware
WARNING: Failed to refresh pcie and firmware
Any idea what could be causing this? Thanks for bringing me this far; I appreciate your help and guidance!
Hello @tripton ,
Let’s try two things:
1
Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate
command.
Then run the following command:
axdevice --refresh
This command will:
2
If 1 didn’t make it work, try the following.
Disable IOMMU
In some kernels, the intel_iommu=off
or amd_iommu=off
kernel parameters should be added at boot.
For example, in Ubuntu, the grub
file needs to be modified:
sudo nano /etc/default/grub
where you need to disable IOMMU by adding intel_iommu=off
or amd_iommu=off
parameter:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=off"
or for AMD:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off"
Save the file, then run:
sudo update-grub
and to apply the changes:
sudo reboot
@tripton
If my previous message didn’t work, please do the following and try my previous message again. Please share with us the results for each step.
Remove old driver and install updated driver
Run the following command to check your driver version:
cat /sys/class/metis/version
It will display the version of the driver
0.07.16
Share this version and your Voyager SDK version with Axelera AI Support Team.
If an old driver version is displayed, you must remove the old driver:
sudo rmmod metis
Running Voyager SDK installer will install the latest driver for the current version of Voyager SDK:
./install.ah --all --YES
Sadly, neither option worked for me. I still get the same error. When I run axdevice -v
I see:
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
It appears that the device is seen (at least by lspci), but it's still not recognized. I checked the driver version using cat /sys/class/metis/version
and it shows:
0.07.16
which I believe is the latest version. For the Voyager SDK, I’m using release/v1.2.5 from GitHub.
Hello @tripton ,
Just to be sure, you tried both my instructions for “Remove and rescan devices with axdevice Voyager SDK tool” and for “Disable IOMMU”, right? Can you confirm that?
Can you try the following and share the exact output with us for each?:
Check Firmware version of your Metis card uuses triton_multi_ctx]
Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate
command.
Execute the following command:
triton_multi_ctx --fwver
This will display the version of the Metis Firmware loaded in your card, for example,
Firmware version: v1.2.5
Share this version and your Voyager SDK version with Axelera AI Support Team.
Check Board Controller version of your Metis card ouses triton_multi_ctx]
Go to your voyager-sdk folder and activate the virtual environment with source venv/bin/activate
command.
Execute the following command:
triton_multi_ctx --bc-version
This will display the version of the Board Controller in your card, for example,
zephyr version: baf12c979030
app name: board_bringup
app version: v1.0
Share this version and your Voyager SDK version with Axelera AI Support Team.
Hello,
I confirm that I tried both instructions. Removing and rescanning the devices with axdevice and disabling IOMMU. Unfortunately, neither option helped. I even reinstalled the driver (as suggested before), but it also didn’t help. The Voyager SDK version (release/v1.2.5) and the driver version (0.07.16) remain unchanged from my previous posts.
Here is the output for the additional commands you suggested:
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ triton_multi_ctx --fwver
Fail to get device name
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ triton_multi_ctx --bc-version
Fail to get device name
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice -v
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
As it appears that no device is found with triton_multi_ctx, I’m including the output from axdevice again. The device is still found by lspci, but I don't get any driver version.
Any idea what to try next?
Thanks,
Tripton
Hi @tripton
I am really sorry to hear that.
Let’s try one more thing. Please run
axdevice --refresh -v
multiple times. Run it at least 3 or 4 times and let’s see if there is any difference.
Please share the exact output with us from the 4 runs of that command.
Kind regards,
Victor
Hey Victor,
No worries, I'm still optimistic about getting it to work. Here is the output for four consecutive runs of the command:
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice --refresh -v
INFO:axelera.runtime.axdevice:Removing 0000:03:00.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:04:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
Traceback (most recent call last):
File "/home/tripton/.cache/axelera/venvs/93f45ae3/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 625, in entrypoint_main
main(args)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 608, in main
devices = _find_devices(found_devices, device_id)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 187, in _find_devices
raise RuntimeError("No devices found, use -v for more information")
RuntimeError: No devices found, use -v for more information
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice --refresh -v
INFO:axelera.runtime.axdevice:Removing 0000:03:00.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:04:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
Traceback (most recent call last):
File "/home/tripton/.cache/axelera/venvs/93f45ae3/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 625, in entrypoint_main
main(args)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 608, in main
devices = _find_devices(found_devices, device_id)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 187, in _find_devices
raise RuntimeError("No devices found, use -v for more information")
RuntimeError: No devices found, use -v for more information
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice --refresh -v
INFO:axelera.runtime.axdevice:Removing 0000:03:00.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:04:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
Traceback (most recent call last):
File "/home/tripton/.cache/axelera/venvs/93f45ae3/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 625, in entrypoint_main
main(args)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 608, in main
devices = _find_devices(found_devices, device_id)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 187, in _find_devices
raise RuntimeError("No devices found, use -v for more information")
RuntimeError: No devices found, use -v for more information
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$ axdevice --refresh -v
INFO:axelera.runtime.axdevice:Removing 0000:03:00.0
INFO:axelera.runtime.axdevice:PCIE rescan
0000:04:00.0 : Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found PCI device: 04:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
INFO:axelera.runtime:Found AIPU driver: metis 90112 0
WARNING:axelera.runtime:4PCI device count mismatch: lspci=1, triton=0
Traceback (most recent call last):
File "/home/tripton/.cache/axelera/venvs/93f45ae3/bin/axdevice", line 8, in <module>
sys.exit(entrypoint_main())
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 625, in entrypoint_main
main(args)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 608, in main
devices = _find_devices(found_devices, device_id)
File "/home/tripton/.cache/axelera/venvs/93f45ae3/lib/python3.10/site-packages/axelera/runtime/axdevice.py", line 187, in _find_devices
raise RuntimeError("No devices found, use -v for more information")
RuntimeError: No devices found, use -v for more information
(venv) tripton@tripton-ubuntu:~/repos/voyager-sdk$
It looks kind of weird that the PCIe device at 03:00.0 was removed while the device appears at 04:00.0. I have no clue if that matters, but I thought I would share the logs as they are.
Thanks, Tripton
Thank you for sharing the information @tripton .
- 03:00.0 was removed because it intentionally removes the bridge into which the Metis card (at 04:00.0) is connected. So that is fine and expected.
Knowing the firmware and board controller version in the board the was not possible, and that would have helped.
- I am suspecting that the card might have an old firmware or old board controller, as that would explain this behaviour. But it is only a suspicion I cannot confirm with the current information.
- Of course, if the card had and old firmware or board controller, it would have been a mistake.
I am checking internally with our experts how we can proceed or if there is anything else we can try.
I will update you as soon as I have any news. Again, we apologise for these difficulties.
Regards,
Victor
Hello @tripton ,
I have received further feedback.
Please run the following command and share the output with us:
cat /proc/cmdline
This shows the exact kernel parameters passed to the currently running kernel.
- If you see
pcie_aspm=off
in the output, it means the parameter was successfully applied during boot. - If it is not in the output, the host may be using a different boot method like cloud-init.
Kind regards,
Victor