Hello,
I try to use a YOLO model on 8 camera streams in parallel on the Metis PCIe card. 20-30 fps per camera stream is all I need. With the 548 fps end-to-end stated for YOLOv8s in https://axelera.ai/metis-aipu-benchmarks, it should be possible to reach ~68 fps per camera.
As a small test I wrote a python script, which creates a inference stream with 8 videos as input:
from axelera.app import config, display, inf_tracers
from axelera.app.stream import create_inference_stream
def run(window, stream):
for frame_result in stream:
window.show(frame_result.image, frame_result.meta, frame_result.stream_id)
fps = stream.get_all_metrics()s'end_to_end_fps']
print(fps.value)
def main():
tracers = inf_tracers.create_tracers('core_temp', 'end_to_end_fps', 'cpu_usage')
stream = create_inference_stream(
network="yolov5s-v7-coco",
sources=e
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
str(config.env.framework / "media/traffic1_1080p.mp4"),
],
tracers=tracers,
)
print(stream.sources)
with display.App(visible=True) as app:
wnd = app.create_window("App", (900, 600))
app.start_thread(run, (wnd, stream), name='InferenceThread')
app.run(interval=1 / 10)
stream.stop()
if __name__ == "__main__":
main()
The tracers report end-to-end fps of ~80 fps, similar to when I run it only with one video as an input. But the output shown in the window is very shaky, I guess less than 10 fps per camera.
Probably I am using the stream interface of the python API wrong. Can someone please help me with this issue? What is the recommended way to run inference on multiple camera streams (or videos as a test) in parallel?
Thanks
Maximilian
Hi @maximiliankir, welcome on board! Love this project.
I’ll need to check with some of the team, but I wonder if a multi-stream approach would help, rather than eight sources on a single inference stream?
In the meantime, these may help with looking into this approach:
But I’ll also come back to you with some thoughts from the Axelera crew!
Hi @Spanner,
thanks for discussing this issue with the team. What would ultimately be the very best solution for me would be a complete pipeline via Gstreamer. I receive the camera frames via a GStreamer interface anyway. Can you please point me to an example on how to connect these GStreamer inputs with a Yolo Inference Pipeline? Maybe this would also help achive the requiered performance of 8 streams with 30 fps each.
@maximiliankir
Can you please let us know the host machine you are using?
Thanks!
@maximiliankir
Adding multiple sources should be fairly easy with inference.py, and you are right that each stream in an 8-parallel stream setup should be able to run at >= 60 FPS if we can reach ~548 FPS with a single stream using YOLOv8s. That being said, inference.py uses GStreamer under the hood, so it does not matter if we set up an 8-stream pipeline directly using GStreamer or indirectly via inference.py - the performance should (ideally) be the same. With inference.py, we can set up 8 parallel streams (each running at 60 FPS 01] and verify end-to-end FPS of ~548) as follows:
> ./inference.py \
--pipe="gst" \
--aipu-cores=4 \
--disable-vaapi \
--disable-opencl \
--frame-rate 0 \
--frames 5000 \
--show-stats \
--no-display \
yolov8s-coco-onnx \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test \
rtsp://127.0.0.1:8551/test | grep -e "Element" -e "End-to-end"
The above should show “End-to-End” FPS reachable by Metis as follows:
Element Latency(us) Effective FPS
End-to-end average measurement 552.5
Hope this helps with your use case, if you still require custom integration of axinferencenet
with custom Gstream pipelines, then please take a look at k2].
Please feel free to reach out if you have anymore questions, comments or suggestions!
Thanks!
---
r1] In the above we run an rtsp-based stream at 60 FPS via an rtsp server, which can be setup by following:
> wget https://github.com/GStreamer/gst-rtsp-server/raw/refs/tags/1.14.5/examples/test-launch.c
> sudo apt-get install libgstrtspserver-1.0 libgstreamer1.0-dev
> gcc test-launch.c -o test-launch $(pkg-config --cflags --libs gstreamer-1.0 gstreamer-rtsp-server-1.0)
> ./test-launch -p 8551 "\
filesrc location=<path-to>/media/output.mp4 \
! qtdemux \
! h264parse \
! avdec_h264 \
! videorate ! video/x-raw,framerate=60/1 \
! videoscale ! video/x-raw,width=640,height=640 \
! queue \
! videoconvert \
! x264enc \
! h264parse \
! rtph264pay name=pay0 pt=96"
/2] https://community.axelera.ai/voyager-sdk-2/raspberry-pi-5-metis-inference-with-gstreamer-pipeline-221?postid=684#post684
Thank you very much for the detailed answer. I will try to run parallel inference via inference.py as soon as possible. I will also try to run it from the rtsp streams instead of video files directly, to see if this already gives an performance improvement.
To answer the question about the host machine:
- I tried it in a x86 machine with an AMD Ryzen 9 3900X and 64 GB RAM. I dont think that this should limit the performance.
- Our target platform in the end will have a Ampere Altra ARM CPU with 128 cores. I also tested in this machine, but performance is even worse, probably due to very limited single core performance.
I have another question, what does the host fps metric mean, which is shown when I use inference.py? It shows something >800 FPS. But what does that mean?
I am now out of office for 2 weeks, but I will try your suggestions as soon as I am back and come back to you with feedback.
@Habib
I finally got to test your suggestions. On the x86 machine I can achieve the desired performance. The important flag was --disable-opencl. Disabling openCL seems to improve the performance alot, but also increases the CPU load. Adding --enable-hardware-codec improved that a bit. In the end I achieve ~400 fps, which is 50 fps per stream. That is sufficient for our use-case.
But as I wrote before: Our target platform is a Ampere Altra ARM CPU. Here the performance is much worse. Here I can only reach 23 fps using a single input stream on Yolov5n
When I start the inference.py I get the following warning:
```
WARNING : Failed to get OpenCL platforms : clGetPlatformIDs failed: PLATFORM_NOT_FOUND_KHR
WARNING : Please check the documentation for installation instructions
```
Adding the parameter --disable-opencv makes the performance even worse.
The stats look like this:
```
===========================================================================
Element Latency(us) Effective FPS
===========================================================================
qtdemux0 81 12,299.9
h264parse0 276 3,622.4
capsfilter0 139 7,179.7
axinplace-addstreamid0 166 5,991.3
videoconvert0 39,781 25.1
inference-task0:libtransform_resize_0 2,358 424.0
inference-task0:libinplace_normalize_0 13,194 75.8
inference-task0:libtransform_padding_0 562 1,777.1
inference-task0:Preprocessing latency 17,942 n/a
inference-task0:inference 42,389 23.6
└─ Metis 884 1,131.2
└─ Host 954 1,047.4
inference-task0:Inference latency 695,433 n/a
inference-task0:libdecode_yolov5_0 617 1,618.9
inference-task0:libinplace_nms_0 111 8,994.7
inference-task0:Postprocessing latency 2,453 n/a
inference-task0:Total latency 715,853 n/a
===========================================================================
End-to-end average measurement 22.8
===========================================================================
```
videoconvert0 seems to limit the performance. And the inference latency is very high. Where does this come from?
Can you please help me debug this further. How can I dig deeper into the performance bottleneck?
I notice the warning about missing OpenCL support. Could that mean that there’s an issue with the GPU driver, and therefore all the processing is being pushed to the CPU instead? That could potentially cause quite a bottleneck around videoconverter0, I guess.
Hi,
we use a very small AMD GPU, which doesn’t support openCL in our system and we want to do all pre- or postprocessing on the CPU.
We ran the inference with multiple input streams and got the following stats:
```
===========================================================================
qtdemux0 76 13,050.3
qtdemux1 76 13,076.9
h264parse0 209 4,782.1
capsfilter0 73 13,647.8
h264parse3 243 4,104.8
capsfilter3 73 13,654.7
qtdemux3 76 13,092.5
qtdemux2 78 12,794.8
h264parse1 221 4,521.8
capsfilter1 66 14,953.4
h264parse2 198 5,030.8
capsfilter2 63 15,641.7
axinplace-addstreamid1 88 11,351.9
axinplace-addstreamid2 93 10,701.3
axinplace-addstreamid0 89 11,200.5
axinplace-addstreamid3 80 12,345.7
videoconvert1 28,582 35.0
videoconvert2 28,592 35.0
videoconvert0 28,675 34.9
videoconvert3 28,693 34.9
inference-task0:libtransform_resize_0 2,123 471.0
inference-task0:libinplace_normalize_0 11,631 86.0
inference-task0:libtransform_padding_0 466 2,143.4
inference-task0:Preprocessing latency 86,443 n/a
inference-task0:inference 14,120 70.8
└─ Metis 1,242 805.0
└─ Host 1,344 743.9
inference-task0:Inference latency 232,751 n/a
inference-task0:libdecode_yolov8_0 535 1,866.1
inference-task0:libinplace_nms_0 64 15,543.5
inference-task0:Postprocessing latency 1,343 n/a
inference-task0:Total latency 320,564 n/a
===========================================================================
End-to-end average measurement 69.7
===========================================================================
```
The videoconvert stage is executed in parallel for each input stream and reaches 35 fps per stream, which is fine for our use-case. However the libinplace_normalize_0 stage seems to be the bottleneck. As it is only running once for all streams with 86 fps. Does the fps figure stated for the inference-task0:inference step include the preprocessing steps? It does not seem to agree with the FPS values for Metis and Host.
We looked into the normalize implementation in the voyager sdk and it should use SIMD through the libsimde library. However when we disable the simd:avx2 flag in the operators/mega.py file, the performance stays exactly the same. We assume no SIMD is being used, either because the inplace function in AxInPlaceNormalize.cpp is not compiled with support for NEON instructions or because the libsimde library in the Ubuntu 22.04 repositories is old and has only partial NEON support.
Can you please check how the libinplace_normalize.so has been build for ARM? Can you tell us how to recompile the library with NEON support?
Or do you see another way to run the normalize stage separately for each input stream?