Skip to main content

Hello,

 

I try to use a YOLO model on 8 camera streams in parallel on the Metis PCIe card. 20-30 fps per camera stream is all I need. With the 548 fps end-to-end stated for YOLOv8s in https://axelera.ai/metis-aipu-benchmarks, it should be possible to reach ~68 fps per camera.

 

As a small test I wrote a python script, which creates a inference stream with 8 videos as input:

from axelera.app import config, display, inf_tracers
from axelera.app.stream import create_inference_stream

def run(window, stream):
for frame_result in stream:
window.show(frame_result.image, frame_result.meta, frame_result.stream_id)

fps = stream.get_all_metrics()s'end_to_end_fps']
print(fps.value)

def main():
tracers = inf_tracers.create_tracers('core_temp', 'end_to_end_fps', 'cpu_usage')
stream = create_inference_stream(
network="yolov5s-v7-coco",
sources=e
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
],
tracers=tracers,
)
print(stream.sources)

with display.App(visible=True) as app:
wnd = app.create_window("App", (900, 600))
app.start_thread(run, (wnd, stream), name='InferenceThread')
app.run(interval=1 / 10)
stream.stop()


if __name__ == "__main__":
main()

The tracers report end-to-end fps of ~80 fps, similar to when I run it only with one video as an input. But the output shown in the window is very shaky, I guess less than 10 fps per camera.

Probably I am using the stream interface of the python API wrong. Can someone please help me with this issue? What is the recommended way to run inference on multiple camera streams (or videos as a test) in parallel?

Thanks
Maximilian

Hi ​@maximiliankir, welcome on board! Love this project.

I’ll need to check with some of the team, but I wonder if a multi-stream approach would help, rather than eight sources on a single inference stream?

In the meantime, these may help with looking into this approach:

But I’ll also come back to you with some thoughts from the Axelera crew!


Hi ​@Spanner,

thanks for discussing this issue with the team. What would ultimately be the very best solution for me would be a complete pipeline via Gstreamer. I receive the camera frames via a GStreamer interface anyway. Can you please point me to an example on how to connect these GStreamer inputs with a Yolo Inference Pipeline? Maybe this would also help achive the requiered performance of 8 streams with 30 fps each.


@maximiliankir 
Can you please let us know the host machine you are using? 
Thanks!


@maximiliankir 

Adding multiple sources should be fairly easy with inference.py, and you are right that each stream in an 8-parallel stream setup should be able to run at >= 60 FPS if we can reach ~548 FPS with a single stream using YOLOv8s. That being said, inference.py uses GStreamer under the hood, so it does not matter if we set up an 8-stream pipeline directly using GStreamer or indirectly via inference.py - the performance should (ideally) be the same. With inference.py, we can set up 8 parallel streams (each running at 60 FPS 01] and verify end-to-end FPS of ~548) as follows:

> ./inference.py                    \
--pipe="gst" \
--aipu-cores=4 \
--disable-vaapi \
--disable-opencl \
--frame-rate 0 \
--frames 5000 \
--show-stats \
--no-display \
yolov8s-coco-onnx \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test | grep -e "Element" -e "End-to-end"

The above should show “End-to-End” FPS reachable by Metis as follows:

Element                                         Latency(us)   Effective FPS
End-to-end average measurement 552.5

Hope this helps with your use case, if you still require custom integration of axinferencenet
with custom Gstream pipelines, then please take a look at k2].

Please feel free to reach out if you have anymore questions, comments or suggestions!
Thanks!

---
r1] In the above we run an rtsp-based stream at 60 FPS via an rtsp server, which can be setup by following:

> wget https://github.com/GStreamer/gst-rtsp-server/raw/refs/tags/1.14.5/examples/test-launch.c
> sudo apt-get install libgstrtspserver-1.0 libgstreamer1.0-dev
> gcc test-launch.c -o test-launch $(pkg-config --cflags --libs gstreamer-1.0 gstreamer-rtsp-server-1.0)
> ./test-launch -p 8551 "\
filesrc location=<path-to>/media/output.mp4 \
! qtdemux \
! h264parse \
! avdec_h264 \
! videorate ! video/x-raw,framerate=60/1 \
! videoscale ! video/x-raw,width=640,height=640 \
! queue \
! videoconvert \
! x264enc \
! h264parse \
! rtph264pay name=pay0 pt=96"

/2] https://community.axelera.ai/voyager-sdk-2/raspberry-pi-5-metis-inference-with-gstreamer-pipeline-221?postid=684#post684


Thank you very much for the detailed answer. I will try to run parallel inference via inference.py as soon as possible. I will also try to run it from the rtsp streams instead of video files directly, to see if this already gives an performance improvement.

 

To answer the question about the host machine: 

  • I tried it in a x86 machine with an AMD Ryzen 9 3900X and 64 GB RAM. I dont think that this should limit the performance.
  • Our target platform in the end will have a Ampere Altra ARM CPU with 128 cores. I also tested in this machine, but performance is even worse, probably due to very limited single core performance.

I have another question, what does the host fps metric mean, which is shown when I use inference.py? It shows something >800 FPS. But what does that mean?

 

I am now out of office for 2 weeks, but I will try your suggestions as soon as I am back and come back to you with feedback.


@Habib 

I finally got to test your suggestions. On the x86 machine I can achieve the desired performance. The important flag was --disable-opencl. Disabling openCL seems to improve the performance alot, but also increases the CPU load. Adding --enable-hardware-codec improved that a bit. In the end I achieve ~400 fps, which is 50 fps per stream. That is sufficient for our use-case.

 

But as I wrote before: Our target platform is a Ampere Altra ARM CPU. Here the performance is much worse. Here I can only reach 23 fps using a single input stream on Yolov5n
When I start the inference.py I get the following warning:
```

WARNING : Failed to get OpenCL platforms : clGetPlatformIDs failed: PLATFORM_NOT_FOUND_KHR
WARNING : Please check the documentation for installation instructions

```

Adding the parameter --disable-opencv makes the performance even worse.

The stats look like this:
```
===========================================================================                                                                                                                   
Element                                         Latency(us)   Effective FPS
===========================================================================
qtdemux0                                                 81        12,299.9
h264parse0                                              276         3,622.4
capsfilter0                                             139         7,179.7
axinplace-addstreamid0                                  166         5,991.3
videoconvert0                                        39,781            25.1
inference-task0:libtransform_resize_0                 2,358           424.0
inference-task0:libinplace_normalize_0               13,194            75.8
inference-task0:libtransform_padding_0                  562         1,777.1
inference-task0:Preprocessing latency                17,942             n/a
inference-task0:inference                            42,389            23.6
└─ Metis                                               884         1,131.2
└─ Host                                                954         1,047.4
inference-task0:Inference latency                   695,433             n/a
inference-task0:libdecode_yolov5_0                      617         1,618.9
inference-task0:libinplace_nms_0                        111         8,994.7
inference-task0:Postprocessing latency                2,453             n/a
inference-task0:Total latency                       715,853             n/a
===========================================================================
End-to-end average measurement                                         22.8
===========================================================================

```

videoconvert0 seems to limit the performance. And the inference latency is very high. Where does this come from?

Can you please help me debug this further. How can I dig deeper into the performance bottleneck?


I notice the warning about missing OpenCL support. Could that mean that there’s an issue with the GPU driver, and therefore all the processing is being pushed to the CPU instead? That could potentially cause quite a bottleneck around videoconverter0, I guess.


Hi,

we use a very small AMD GPU, which doesn’t support openCL in our system and we want to do all pre- or postprocessing on the CPU.

We ran the inference with multiple input streams and got the following stats:

```
===========================================================================
qtdemux0                                                 76        13,050.3
qtdemux1                                                 76        13,076.9
h264parse0                                              209         4,782.1
capsfilter0                                              73        13,647.8
h264parse3                                              243         4,104.8
capsfilter3                                              73        13,654.7
qtdemux3                                                 76        13,092.5
qtdemux2                                                 78        12,794.8
h264parse1                                              221         4,521.8
capsfilter1                                              66        14,953.4
h264parse2                                              198         5,030.8
capsfilter2                                              63        15,641.7
axinplace-addstreamid1                                   88        11,351.9
axinplace-addstreamid2                                   93        10,701.3
axinplace-addstreamid0                                   89        11,200.5
axinplace-addstreamid3                                   80        12,345.7
videoconvert1                                        28,582            35.0
videoconvert2                                        28,592            35.0
videoconvert0                                        28,675            34.9
videoconvert3                                        28,693            34.9
inference-task0:libtransform_resize_0                 2,123           471.0
inference-task0:libinplace_normalize_0               11,631            86.0
inference-task0:libtransform_padding_0                  466         2,143.4
inference-task0:Preprocessing latency                86,443             n/a
inference-task0:inference                            14,120            70.8
└─ Metis                                             1,242           805.0
└─ Host                                              1,344           743.9
inference-task0:Inference latency                   232,751             n/a
inference-task0:libdecode_yolov8_0                      535         1,866.1
inference-task0:libinplace_nms_0                         64        15,543.5
inference-task0:Postprocessing latency                1,343             n/a
inference-task0:Total latency                       320,564             n/a
===========================================================================
End-to-end average measurement                                         69.7
===========================================================================
```

The videoconvert stage is executed in parallel for each input stream and reaches 35 fps per stream, which is fine for our use-case. However the libinplace_normalize_0 stage seems to be the bottleneck. As it is only running once for all streams with 86 fps. Does the fps figure stated for the inference-task0:inference step include the preprocessing steps? It does not seem to agree with the FPS values for Metis and Host.

We looked into the normalize implementation in the voyager sdk and it should use SIMD through the libsimde library. However when we disable the simd:avx2 flag in the operators/mega.py file, the performance stays exactly the same. We assume no SIMD is being used, either because the inplace function in AxInPlaceNormalize.cpp is not compiled with support for NEON instructions or because the libsimde library in the Ubuntu 22.04 repositories is old and has only partial NEON support.

Can you please check how the 
libinplace_normalize.so has been build for ARM? Can you tell us how to recompile the library with NEON support?
Or do you see another way to run the normalize stage separately for each input stream?


Reply