Skip to main content

Hi all,

I am trying to enable the the usage of the Metis PCIE rev02 on RISC-V Hosts, in particular the Sifive P550. Currently I could successfully (at least I think so), install the drivers and the device gets correctly identified and mapped in the system. I attach the loggings just to be sure: 

dmesg (after rescan):

>  858.500785] pci 0000:01:00.0: :1f9d:1100] type 00 class 0x120000
/  858.500843] pci 0000:01:00.0: reg 0x10: 0mem 0x04380000-0x04380fff 64bit]
 858.500865] pci 0000:01:00.0: reg 0x18: 1mem 0x08000000-0x09ffffff]
r  858.500922] pci 0000:01:00.0: reg 0x30: xmem 0x00000000-0x0000ffff pref]
b  858.500943] pci 0000:01:00.0: Max Payload Size set to 512 (was 128, max 512)
<  858.501115] pci 0000:01:00.0: supports D1
1  858.501121] pci 0000:01:00.0: PME# supported from D0 D1 D3hot
o  858.516166] pci_bus 0000:01: busn_res: ubus 01] end is updated to 01
 858.516197] pcieport 0000:00:00.0: BAR 14: assigned mem 0x42000000-0x44ffffff]
f  858.516209] pci 0000:01:00.0: BAR 2: assigned :mem 0x42000000-0x43ffffff]
f  858.516222] pci 0000:01:00.0: BAR 6: assigned 6mem 0x44000000-0x4400ffff pref]
 858.516229] pci 0000:01:00.0: BAR 0: assigned mem 0x44010000-0x44010fff 64bit]
 858.518612] metis: loading out-of-tree module taints kernel.
 858.518630] metis: module verification failed: signature and/or required key missing - tainting kernel
n  858.521107] pci 0000:01:00.0: Found target device: TRITON_OMEGA_DEVICE_ID
_  858.521116] pci 0000:01:00.0: Found target device: 0000:01:00.0
0  858.521123] pcieport 0000:00:00.0: Found bridge device: 0000:00:00.0
0  858.521130] Invalid memory base and limit values: base=0xfff00000, limit=0x0
0  858.528281] axl: Bridge not reset becuse of a previously reported error: 4294967274
o  858.528286] axl: This is not fatal and is normal for passtrough devices
s  858.528290] axl: The module will continue to load without attempting bridge reset
i  858.528371] triton: root directory for triton
e  858.532286] axl 0000:01:00.0: Adding to iommu group 5
g  858.532379] axl 0000:01:00.0: enabling device (0000 -> 0002)
e  858.533898] axl 0000:01:00.0: MSI registered 32 (1)
S  858.533909] axl 0000:01:00.0: irq vec number 92
.  858.534011] axl 0000:01:00.0: Init irq handler for single msi
h  858.541250] axl 0000:01:00.0: Data Link Layer Link Active Reporting capability
v  858.541445] axl 0000:01:00.0: Register directory 0000:01:000
e  858.541598] Triton Linux Driver, version 0.07.16, init OK

 

lspci

01:00.0 Processing accelerators: Axelera AI Metis AIPU (rev 02)
        Subsystem: Axelera AI Metis AIPU (rev 02)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 92
        IOMMU group: 5
        Region 0: Memory at 44010000 (64-bit, non-prefetchable) size=4K]
        Region 2: Memory at 42000000 (32-bit, non-prefetchable) size=32M]
        Expansion ROM at 44000000 pvirtual] ldisabled] 2size=64K]
        Capabilities: <access denied>
        Kernel driver in use: axl
        Kernel modules: metis

 

After this I could compile the libuio and use the “triton_multi_ctx” to read the different characteristics of the device.

 ./triton_multi_ctx --fwver
elibdmabuf.c:1860] Device 0: metis-0:1:0
olibdmabuf.c:1281] edma init
rlibdmabuf.c:1330] msg_tv_sec = 1, krn_tv_sec = 120, dma_tv_sec = 5
dlibdmabuf.c:1725] uio_dev_msg msg_tv_sec 1
klibdmabuf.c:1664] Sending device command: opcode=111 size=0 msg=m00 ]
slibdmabuf.c:617] Wait MSG 12 (timeout 1sec)
nlibdmabuf.c:542] Wait MSI 12 (timeout 1sec)
6libdmabuf.c:550] Wait MSI 12 DONE
blibdmabuf.c:1664] Received device response: status=1 size=21 msg=a76 31 2e 32 2e 30 2d 72 63 32 2b 62 6c 31 2d 73 74 61 67 65 30 00 ]
Firmware version: v1.2.0-rc2+bl1-stage0

But right now I am trying to debug the communication for data between the metis and the host using the unit tests provided. All the initial test pass, but for the data transfer is giving problems, The test memory read/write to/from device using the ‘ Device’ fails both writing and reading to the device, and the ‘DmaBuf’ only fails on the read. I don’t really know the cause, if this is because the device is mapped in a different address range than it expects (I can see 0x80000000 hard coded in many .sh) but the BAR regions are in:

41000000-4fffffff : pcie@0x54000000
  41000000-410fffff : 0000:00:00.0
  42000000-44ffffff : PCI Bus 0000:01
    42000000-43ffffff : 0000:01:00.0
      42000000-43ffffff : triton-0:1:0
    44000000-4400ffff : 0000:01:00.0
    44010000-44010fff : 0000:01:00.0
      44010000-44010fff : triton-0:1:00.0

… (but also I can see in the /proc/iomem

8000000000-81ffffffff : pcie@0x54000000

In the “Device” code method while useing dma transfer the main error is this one: 5AxeleraDevice.cpp:341] UIO_IOCTL_USR_DMA_XFER failed: Connection timed out . And for the “DmaBuf” is oAxeleraDmaBuf.cpp:262] DMABUF_METIS_WAIT failed: Connection timed out.

The normal method and the mmap method seems to succeed in the transfer tho. I will attach the complete logging in case it can help. Some extra notes: this host is know to not have IO coherency, and the only way of geting DMA to work (on other progets on my experience) is when memory is request it by the dma_alloc_coherent() as they patched the pagesconfig at kernel level with an uncached bit for this purpose. Using normal memory allocation will result on reading the wrong value from cache. I hope I didn’t add too many info but it seemed needed to explain the current state ahhahaha. Any help is more than welcome as I am not really an expert in this. 

Thanks for the detailed post ​@jjpr! This is really helpful stuff.

From the logs and the tests you’ve shared, the setup is looks advanced and it’s great that you’ve got the driver loaded, device identified, and firmware version retrieved. As a starting point, one useful test might be to check the PCIe communication and firmware status using Axelera’s axdevice tool with the --refresh flag. 

source venv/bin/activate
axdevice --refresh

This command:
    •    Rescans and reinitialises any Metis devices.
    •    Reloads the firmware.
    •    Helps clear up inconsistencies between what the OS thinks is available and what’s actually ready for use.

Let’s start with this and see what it reports back. If you could share the result of that test, we can dig deeper into where we go next 👍


Hi, due to the lack of the Debian packages compiled from axelera I can’t use the SDK or the python related tools (also the wheels come precompiled either for x86 or ARM). Currently I am using the PCIe drivers code provided by your engineer (host.pcie-driver-1.2.0) and the test scripts and utilities. So the I can do the equivalent operations using the `triton_multi_ctx` utility and the system commands.  I have checked the documentation, and I should be able to do the same. Currently I can use the already explained in other posts and other extra to get a similar result:

echo 1 | sudo tee /sys/bus/pci/rescan > /dev/null

lspci -vvv (as you can see from my previous logging it seems like the system detects is completely fine)

echo "1" > /sys/bus/pci/devices/0000\:01\:00.0/reset  (this way should be able to restart the device

echo 1 | sudo tee /sys/bus/pci/rescan > /dev/null

(also because there is the PCIe bridge I can reset it with this) sudo setpci -s 00:00.0 COMMAND=0x7

 

But as you can check in the previous post, with the lspci information, it seems like the devices is correctly identified and given the needed resources.  

Also I could get the reloc_hello_world bin unpacking some debian packages to try to load a kernel directly but due to the already explained DMA problem it doesn’t work:

risc550@ubuntu:~/Desktop/javier/axelera_material/host.pcie-driver-1.2.0/apps/libuio/build$ ./triton_multi_ctx --device metis-0:1:0 --debug --kernel0 ./kernels/reloc_hello_world --run f
kernel0 ./kernels/reloc_hello_world
lDEBUG]-Init resources...
uDEBUG]-Parent process with PID 5132
PDEBUG]-Parent process waiting for children to create a context... 0/1
tDEBUG]-Child process 0 with PID 5133
PDEBUG]-Read file ./kernels/reloc_hello_world
llibdmabuf.c:1281] edma init
dlibdmabuf.c:1330] msg_tv_sec = 1, krn_tv_sec = 120, dma_tv_sec = 5
_libdmabuf.c:1725] uio_dev_msg msg_tv_sec 1
tlibdmabuf.c:1664] Sending device command: opcode=101 size=0 msg=000 ]
slibdmabuf.c:617] Wait MSG 12 (timeout 1sec)
ulibdmabuf.c:542] Wait MSI 12 (timeout 1sec)
ulibdmabuf.c:550] Wait MSI 12 DONE
libdmabuf.c:1704] Device command returned an error code.
olibdmabuf.c:1705] Consider inspecting the device log via `triton_trace --slog`.
-libdmabuf.c:1725] uio_dev_msg msg_tv_sec 1
tlibdmabuf.c:1664] Sending device command: opcode=102 size=0 msg=000 ]
slibdmabuf.c:617] Wait MSG 12 (timeout 1sec)
ulibdmabuf.c:542] Wait MSI 12 (timeout 1sec)
ulibdmabuf.c:550] Wait MSI 12 DONE
libdmabuf.c:1704] Device command returned an error code.
olibdmabuf.c:1705] Consider inspecting the device log via `triton_trace --slog`.
-libdmabuf.c:1725] uio_dev_msg msg_tv_sec 1
tlibdmabuf.c:1664] Sending device command: opcode=103 size=0 msg=000 ]
slibdmabuf.c:617] Wait MSG 12 (timeout 1sec)
ulibdmabuf.c:542] Wait MSI 12 (timeout 1sec)
ulibdmabuf.c:550] Wait MSI 12 DONE
libdmabuf.c:1704] Device command returned an error code.
olibdmabuf.c:1705] Consider inspecting the device log via `triton_trace --slog`.
-libdmabuf.c:1569] ctx info (0) l2_base=0x8100000 l2_size=0x7be000 ictx=0x1 actxid-0x1
tDEBUG]-PID 5133 : base 0x8100000 size 0x7be000
0DEBUG]-Parent process waiting for children to create a context... 1/1
tDEBUG]-Unblock children
clibdmabuf.c:358] Discovering dmabuf driver
flibdmabuf.c:367] dmabuf alloc: heap/system
plibdmabuf.c:444] Open fdi: fde 4 fdi 6 dmabuf_fd 5 : imported and attached
alibdmabuf.c:1429] DMABUF_METIS_XFER failed: Connection timed out
ilibdmabuf.c:747] Failed to transfer to device
olibdmabuf.c:453] Close fdi: fde 4 fdi 6 dmabuf_fd 5
Failed to load kernel.
DEBUG]-Cleaning resources...
uDEBUG]-Child process 0 with PID 5133 has completed with status 1
sDEBUG]-Cleaning resources...

If you want me to try some further elements, you will need to give me the commands or the bin you want me to run, taking into account I don’t have the SDK hahahaha I know is a pain, but currently even the python packages are still delivered as wheels (tar installation would be feasible if they up upload them), and therefore I can’t install them in RISC-V.


Reply