Skip to main content
Question

Aetina board with M2 Metis: low fps and high latency running unet_fcn_512-cityscapes

  • September 7, 2025
  • 15 replies
  • 379 views

Hello,

I’m running inference with model  unet_fcn_512-cityscapes with pipe torch-aipu on the aetina eval board. It runs at 1.8 fps system and with 500ms of latency although device fps shows  11.5 fps capability. In the doc it is also mentioned that it should reach 18fps. I originally thought it was an issue with time wasted to load and decode png images from the SD card so I put them in shared memory but result are identical.
I also tested yolov5s-v7-coco that should reach 805 fps but I can only achieve 214fps. Here are the output of:
 

AXELERA_USE_CL_DOUBLE_BUFFER=0 ./inference.py yolov5s-v7-coco media/traffic3_720p.mp4 --show-stats --no-display

INFO : Deploying model yolov5s-v7-coco for 4 cores. This may take a while...
|████████████████████████████████████████| 12:41.1
arm_release_ver: g13p0-01eac0, rk_so_ver: 9
========================================================================
Element Time(𝜇s) Effective FPS
========================================================================
qtdemux0 319 3,126.4
h264parse0 3,094 323.2
capsfilter0 259 3,851.4
mppvideodec0 9,563 104.6
decodebin-link0 91 10,922.0
axtransform-colorconvert0 3,404 293.8
inference-task0:libtransform_resize_cl_0 4,090 244.4
inference-task0:libtransform_padding_0 1,816 550.5
inference-task0:inference 4,405 227.0
inference-task0:Inference latency 94,835 n/a
inference-task0:libdecode_yolov5_0 991 1,008.3
inference-task0:libinplace_nms_0 130 7,679.8
inference-task0:Postprocessing latency 952 n/a
inference-task0:Total latency 110,383 n/a
========================================================================
End-to-end average measurement 214.0
========================================================================

Is there anything I can tune to reduce the latency and increase the fps?

Voyager SDK release v1.3.3, Ubuntu 22.04

Thanks in advance

15 replies

Spanner
Axelera Team
Forum|alt.badge.img+2
  • Axelera Team
  • September 8, 2025

Thanks for sharing this ​@npi - I’ll ask around the team, see what we can see 👍

Might be worth quickly bumping it up to the latest v1.4 either way, in the meantime?

https://github.com/axelera-ai-hub/voyager-sdk/blob/release/v1.4/docs/tutorials/firmware_flash_update.md


  • Author
  • Ensign
  • September 8, 2025

Hi ​@Spanner 
Thanks for the suggestion however the installation failed at the end with:

[138/140] Install axelera-runtime-1.4.0
[138/140] E: Unable to locate package axelera-runtime-1.4.0
[138/140] E: Couldn't find any package by glob 'axelera-runtime-1.4.0'
[138/140] E: Couldn't find any package by regex 'axelera-runtime-1.4.0'
ERROR: Failed to install axelera-runtime-1.4.0

I had to rollback to v1.3.3


  • Axelera Team
  • September 9, 2025

Hello ​@npi 

The torch-aipu flow is not optimised for performance, it uses PyTorch for pre and postprocessing operators, and only a single core of the Axelera AIPU is used for inference.  It is also not pipelined at all so latency and throughput is poor.  It is intended to be used to compare the accuracy of the model against a pure PyTorch pipeline, so that you can measure the impact of quantisation on accuracy.  Since your Aetina does not have a GPU it will be using a CPU base PyTorch back end.

Consequently the 1.5fps is not very surprising.

For the yolov5s-v7 pipeline you are using the GStreamer pipe (--pipe=gst).  This is pipelined and utilises all 4 cores, but it is still limited by the performance of the slowest element in the pipeline, in this case it is 

inference-task0:libtransform_resize_cl_0 4,090 244.4

This is the element that resizes and letterboxes the 720p input frame to 640x640 for the yolo model.  The gap between this 244fps and 214fps will be partly caused by using `--show-stats` which on all hosts as a non-zero impact, and on the Aetina rock chip more so, and additionally there is other overhead in the pipeline management - we are always working on closing this gap.

What performance do you get for unet_fcn_512-cityscapes using the gst pipe?

Regarding the installation issue with 1.4, I have asked someone more familiar with the installer to investigate a likely cause of this.  I suggest we do pursue this because aside from various performance improvements, and bug fixes, 1.4 also makes it easier to use images from a directory of images and possible to get images from a python generator, as shown in this example:

https://github.com/axelera-ai-hub/voyager-sdk/blob/release/v1.4/examples/data_source.py

 


  • Author
  • Ensign
  • September 9, 2025

Hello ​@SamP ,

Thanks for your quick and detailed answer.

I ran unet with the gst pipe and get 7.8fps, much better but still lower than expected.

./inference.py  unet_fcn_512-cityscapes dataset --pipe gst --no-display --show-stats

=========================================================================                                                                                                                       
Element Time(𝜇s) Effective FPS
=========================================================================
axinplace-addstreamid0 159 6,270.2
axtransform-colorconvert0 160,019 6.2
inference-task0:libtransform_resize_cl_0 2,403 416.1
inference-task0:libtransform_padding_0 1,509 662.7
inference-task0:inference 95,637 10.5
inference-task0:Inference latency 2,455,298 n/a
inference-task0:libtransform_paddingdequantize_0
17,239 58.0
inference-task0:libdecode_semantic_seg_0 14,789 67.6
inference-task0:Postprocessing latency 123,824 n/a
inference-task0:Total latency 3,157,814 n/a
=========================================================================
End-to-end average measurement 7.8
=========================================================================

And also got during inference this warning:
 

WARNING : New inference data is ready, but the InferencedStream is not being processed fast enough (backlog=10)                                                                                 
INFO : InferencedStream is being processed quickly enough again (backlog=1)

With --enable-hardware-codec I get 9.7 fps

=========================================================================                                                                                                                       
Element Time(𝜇s) Effective FPS
=========================================================================
axinplace-addstreamid0 150 6,660.3
axtransform-colorconvert0 116,649 8.6
inference-task0:libtransform_resize_cl_0 1,818 549.8
inference-task0:libtransform_padding_0 1,244 803.6
inference-task0:inference 99,256 10.1
inference-task0:Inference latency 1,679,149 n/a
inference-task0:libtransform_paddingdequantize_0
15,439 64.8
inference-task0:libdecode_semantic_seg_0 12,416 80.5
inference-task0:Postprocessing latency 19,759 n/a
inference-task0:Total latency 1,875,842 n/a
=========================================================================
End-to-end average measurement 9.7
=========================================================================

And no warning

Adding --disable-opencl I get 10.5fps

=========================================================================                                                                                                                       
Element Time(𝜇s) Effective FPS
=========================================================================
axinplace-addstreamid0 169 5,895.2
videoconvert0 9,219 108.5
capsfilter0 238 4,186.0
inference-task0:libtransform_resize_0 6,576 152.1
inference-task0:libtransform_totensor_0 333 2,996.9
inference-task0:libinplace_normalize_0 20,924 47.8
inference-task0:libtransform_padding_0 1,089 917.5
inference-task0:inference 82,273 12.2
inference-task0:Inference latency 1,743,865 n/a
inference-task0:libtransform_paddingdequantize_0
15,887 62.9
inference-task0:libdecode_semantic_seg_0 13,069 76.5
inference-task0:Postprocessing latency 62,064 n/a
inference-task0:Total latency 2,079,652 n/a
=========================================================================
End-to-end average measurement 10.5
=========================================================================

But warning came back

disabling opengl or/and vaapi did not improve/degrade fps

Is there anything else I can tune to reach the 18fps ?

Thanks


  • Author
  • Ensign
  • September 17, 2025

Hello,

Could anyone help to answer for the high latency. This is crucial for me to reduce it to few ms. Is it due to gstreamer pipeline used?


Spanner
Axelera Team
Forum|alt.badge.img+2
  • Axelera Team
  • September 17, 2025

Hi ​@npi! Can I check - are you wanting to optimise for higher FPS (throughput) or lower latency? These can kinda work against each other...
 
Even at 10fps the minimum possible latency is 100ms (1000ms ÷ 10fps). So even at 18fps we're looking at 55ms.
 
That 1-3 second latency includes a fair amount of pipeline buffering on top of the actual frame processing time. The GStreamer pipeline contributes to this too, but it's not the only factor.
 
To help us move in the right direction, can you share:

  1. What's the ideal use case? Do you need real-time control or just smooth playback?
  2. Are you willing to trade FPS for lower latency?
  3. What latency would actually be acceptable for your application?

  • Author
  • Ensign
  • September 17, 2025

Hi ​@Spanner,

Thanks for the explanation of the latency numbers from the stats.

Yes I can trade fps for lower latency and also I don’t need GStreamer pipeline.

Use case is inference on still raw images, that are not necessarily coming at regular fps, and react quickly (real-time) on the output of the inference. So inference time + latency must be as small as possible.

With SDK 1.4, as ​@SamP mentioned, I could try image inference. This may help.
Will let you know.


Spanner
Axelera Team
Forum|alt.badge.img+2
  • Axelera Team
  • September 17, 2025

Sounds good, do keep us posted! For still image inference without GStreamer, you should see better latency, yeah. The pipeline overhead was likely adding a delay.

With SDK 1.4's image inference capabilities, you'll bypass all the video pipeline complexity and get much closer to the raw inference time we’re looking for 👍 Let me know how it goes!


  • Axelera Team
  • September 17, 2025

Latency is something that we are addressing at the moment, until relatively recently we have been focussed mostly on throughput.  

In 1.4 of the SDK there is a new mode that can be enabled using an env var AXELERA_LOW_LATENCY=1 which will generally enable all options that we know favour throughput over latency.  You can see docs on env vars if you run AXELERA_HELP=1 ./inference.py

The way that we currently utilise the multiple cores in Metis means that the more cores involved hurts latency when the incoming frame rate is low (e.g. on a usb/rtsp source). It is best therefore to use as few cores as possible.  However it is also best not to let any of the queues in the pipeline get filled as this will also hurts latency so using fewer cores than required is suboptimal. So you need to establish the optimal number of Metis cores you need. I use axrunmodel to do this as it takes the preprocessing out of the equation allowing you to focus on just the pure inference performance

.../framework$ axrunmodel build/unet_fcn_512-cityscapes/unet_fcn_512-cityscapes/1/model.json --aipu-cores=2 -d0
build/unet_fcn_512-cityscapes/unet_fcn_512-cityscapes/1/model.json ... dev:21.9 host:21.8 system:21.8fps PASS
.../framework$ axrunmodel build/unet_fcn_512-cityscapes/unet_fcn_512-cityscapes/1/model.json --aipu-cores=3 -d0
build/unet_fcn_512-cityscapes/unet_fcn_512-cityscapes/1/model.json ... dev:29.8 host:29.8 system:29.8fps PASS
.../framework$ axrunmodel build/unet_fcn_512-cityscapes/unet_fcn_512-cityscapes/1/model.json --aipu-cores=4 -d0
build/unet_fcn_512-cityscapes/unet_fcn_512-cityscapes/1/model.json ... dev:32.0 host:31.9 system:32.0fps PASS

I use -d0 to ensure I only use one of the PCIe devices, and I establish that for my 24 fps source 3 metis cores is optimal.  Your numbers will vary for the M.2.

Thus I can then run like this:

AXELERA_LOW_LATENCY=1 ./inference.py unet_fcn_512-cityscapes rtsp://127.0.0.1:8554/0  --show-latency --no-display -d0 --aipu-cores=3
INFO : Selected device metis-0:1:0 (deselected 1 other device)
INFO : Enabling low latency mode for inference, at the cost of performance
Detecting... : | | 1339/0 [00:55<00:00, 24.20frames/s]INFO : Interrupting the inference stream
Core Temp : 47.0°C
CPU % : 1.1%
End-to-end : 24.0fps
Latency : 130.2ms (min:106.6 max:260.0 σ:8.2 x̄:133.7)ms

The latency here is still higher than I would like, this is something we will improve in the next release.

Oh, one other point when measuring performance it’s generally best not to use a dataset, as parsing jpegs from disk in our dataset adapter is typically slower than decoding a video source, you can use fakevideo or an mp4 file.
 

 


  • Author
  • Ensign
  • September 17, 2025

Ok, here we are:

I have implemented image inference python script and limited the throughput between 1 and 5 fps. I have also preloaded in memory the raw PIL images already resized to 512x512 and commented out the ‘resize’ preprocessing in the yaml to limit any preprocessing overhead. I also wanted to pre-normalize the image but when commenting out the section ‘normalize’ in the yaml file I got a strange error:

ERROR   : PermuteChannels is not implemented for gst pipeline

So I kept it.

Here are the stats at 1 fps:

=========================================================================
Element Time(𝜇s) Effective FPS
=========================================================================
decodebin-link0 506 1,973.2
axtransform-colorconvert-cl0 1,012,609 1.0
inference-task0:libtransform_resize_cl_0 4,450 224.7
inference-task0:libtransform_padding_0 2,947 339.3
inference-task0:inference 1,007,822 1.0
inference-task0:Inference latency 4,038,389 n/a
inference-task0:libtransform_paddingdequantize_0
34,140 29.3
inference-task0:libdecode_semantic_seg_0 30,763 32.5
inference-task0:Postprocessing latency 31,090 n/a
inference-task0:Total latency 5,117,034 n/a
=========================================================================
End-to-end average measurement 1.0
=========================================================================
Core Temp : 44.0°C
CPU % : 8.4%
End-to-end : 1.0fps
Latency : 0.0ms (min:inf max:-inf σ:0.0 x̄:0.0)ms

So my understanding is that inference-task0:inference = 1,007,822 is actually not the time to infer the image but the time between 2 inferences. Same for axtransform-colorconvert-cl0 = 1,012,609. While for the others it seams to be the real time to perform the said task. Am I right ?

Now speaking about latencies. There is inference-task0:Inference latency = 4,038,389 which I have no idea of its meaning. And finally at the bottom there is Latency: 0.0ms (I love this one 😉) but does not seams to correlate with the other latencies and not sure how reliable the value is because as you can see below it jumps up to 2109.8ms at 3 fps to drop down to 0 again at 4 fps.

I have repeated several time the experiments at each fps and figures are every time very close.

Below are the results for 2,3,4,5 fps:

2 FPS

=========================================================================
Element Time(𝜇s) Effective FPS
=========================================================================
decodebin-link0 458 2,180.2
axtransform-colorconvert-cl0 511,596 2.0
inference-task0:libtransform_resize_cl_0 3,864 258.8
inference-task0:libtransform_padding_0 2,673 374.0
inference-task0:inference 504,555 2.0
inference-task0:Inference latency 2,035,085 n/a
inference-task0:libtransform_paddingdequantize_0
34,771 28.8
inference-task0:libdecode_semantic_seg_0 30,436 32.9
inference-task0:Postprocessing latency 30,767 n/a
inference-task0:Total latency 2,613,026 n/a
=========================================================================
End-to-end average measurement 2.0
=========================================================================
Core Temp : 45.0°C
CPU % : 9.0%
End-to-end : 2.0fps
Latency : 0.0ms (min:inf max:-inf σ:0.0 x̄:0.0)ms

3 FPS

=========================================================================
Element Time(𝜇s) Effective FPS
=========================================================================
decodebin-link0 384 2,601.7
axtransform-colorconvert-cl0 339,694 2.9
inference-task0:libtransform_resize_cl_0 2,933 340.9
inference-task0:libtransform_padding_0 2,399 416.8
inference-task0:inference 335,063 3.0
inference-task0:Inference latency 1,359,281 n/a
inference-task0:libtransform_paddingdequantize_0
35,002 28.6
inference-task0:libdecode_semantic_seg_0 30,929 32.3
inference-task0:Postprocessing latency 31,252 n/a
inference-task0:Total latency 1,768,174 n/a
=========================================================================
End-to-end average measurement 3.0
=========================================================================
Core Temp : 45.0°C
CPU % : 10.0%
End-to-end : 3.0fps
Latency : 2109.8ms (min:2048.0 max:2130.2 σ:21.7 x̄:2104.3)ms

4 FPS

=========================================================================
Element Time(𝜇s) Effective FPS
=========================================================================
decodebin-link0 326 3,059.9
axtransform-colorconvert-cl0 255,848 3.9
inference-task0:libtransform_resize_cl_0 2,148 465.5
inference-task0:libtransform_padding_0 1,993 501.7
inference-task0:inference 247,873 4.0
inference-task0:Inference latency 1,017,325 n/a
inference-task0:libtransform_paddingdequantize_0
34,442 29.0
inference-task0:libdecode_semantic_seg_0 29,487 33.9
inference-task0:Postprocessing latency 30,163 n/a
inference-task0:Total latency 1,340,274 n/a
=========================================================================
End-to-end average measurement 4.0
=========================================================================
Core Temp : 46.0°C
CPU % : 12.5%
End-to-end : 4.0fps
Latency : 0.0ms (min:inf max:-inf σ:0.0 x̄:0.0)ms

5 FPS

=========================================================================
Element Time(𝜇s) Effective FPS
=========================================================================
decodebin-link0 322 3,104.8
axtransform-colorconvert-cl0 207,562 4.8
inference-task0:libtransform_normalize_cl_0 2,283 437.9
inference-task0:libtransform_padding_0 2,084 479.8
inference-task0:inference 201,393 5.0
inference-task0:Inference latency 826,500 n/a
inference-task0:libtransform_paddingdequantize_0
33,833 29.6
inference-task0:libdecode_semantic_seg_0 27,540 36.3
inference-task0:Postprocessing latency 31,434 n/a
inference-task0:Total latency 1,106,963 n/a
=========================================================================
End-to-end average measurement 4.9
=========================================================================
Core Temp : 47.0°C
CPU % : 13.6%
End-to-end : 4.9fps
Latency : 1313.9ms (min:1250.0 max:1356.8 σ:20.8 x̄:1313.2)ms

One last question: How can I make the pipeline to avoid having ‘axtransform-colorconvert-cl0’ ?

 

Sorry for the long answer.

Hoping to get some help on understanding these timings.

Best


  • Author
  • Ensign
  • September 18, 2025

Hi ​@SamP ,

Thanks very much for the tricks to reduce latency and your work on that topic which is very important in my use case.

I did not get your message before I sent mine (just above) although yours seams to have been sent before mine. Anyway.

Using axrunmodel on my metis m2 I figured out that aipu-cores = 4 was better than with less:

1 aipu-core: dev:11.3 host:11.1 system:11.0fps PASS
2 aipu-core: dev:15.5 host:15.1 system:15.0fps PASS
3 aipu-core: dev:18.4 host:18.2 system:18.2fps PASS
4 aipu-core: dev:18.5 host:18.3 system:18.5fps PASS

So I reiterated image inference experiments with AXELERA_LOW_LATENCY=1 and -d0 --aipu-cores 4

Here are the latencies I got (BTW the 0ms latency I was getting was because I did not wait enough (not enough samples processed to have meaningful stats on latency)
1 FPS

Inferred 1000 images
CPU % : 4.3%
End-to-end : 1.0fps
Latency : 1098.9ms (min:1042.9 max:1138.2 σ:16.5 x̄:1099.2)ms

2 FPS

Inferred 1000 images
CPU % : 6.5%
End-to-end : 2.0fps
Latency : 598.6ms (min:542.5 max:645.0 σ:17.2 x̄:597.9)ms

3 FPS

Inferred 1000 images
CPU % : 8.4%
End-to-end : 3.0fps
Latency : 428.6ms (min:360.5 max:469.1 σ:16.5 x̄:427.4)ms

4 FPS

Inferred 1000 images
CPU % : 9.9%
End-to-end : 3.9fps
Latency : 341.5ms (min:290.2 max:380.0 σ:16.1 x̄:340.6)ms

5 FPS

Inferred 1000 images
CPU % : 11.0%
End-to-end : 4.9fps
Latency : 282.6ms (min:231.0 max:399.4 σ:18.0 x̄:283.8)ms

If I understand properly the Latency values is the in-out time for a frame to be processed so if I remove the inference latency that is 1 / fps I got a delayed output of ~100ms. Is that correct ?


  • Axelera Team
  • September 18, 2025

> BTW the 0ms latency I was getting was because I did not wait enough (not enough samples processed to have meaningful stats on latency

Yes we ignore the first 200 frames because gstreamer does pre rolling and filling buffers that makes the first 200 frames pretty meaningless.  Then we use a rolling 10000 frame buffer for the stats.

> If I understand properly the Latency values is the in-out time for a frame to be processed

The measured latency is from the moment the frame leaves the video decoder until it arrives at the main inference loop (for frame_result in stream).  We skip the decoder because it's hard for us to measure and is very source format dependent (e.g. num key frames etc).

The reason you see the latency decrease as the fps increases is that our current executor effectively requires aipu-cores-1 frames to enter the inference element before the first frame is emitted.  So that time is reduced if the frame rate is increased, and reduced if num aipu_cores is reduced.  I would suggest that the m.2 is bandwidth limited between 3 → 4 cores on that model and so you may want to consider only using 3 cores if latency is your priority, since it is only achieving 0.3fps throughput increase.

The same is also true of OpenCL elements, where we use a similar algorithm to swap buffers. This is why the latency in your experiments above are `1,000,000` at 1fps.  If you set the env variable 

AXELERA_USE_CL_DOUBLE_BUFFER=0

then we disable OpenCL buffering and you will see I think the numbers make a lot more sense.  Another factor with OpenCL double buffering is that the measured latencies are usually attributed to the wrong element, because more often than not it is the buffer transfer from host<->gpu that is the biggest bottleneck, and it is the element AFTER the offending element that actually takes that hit. TLDR disable CL double buffering when measuring. or use AXELERA_LOW_LATENCY=1 which also disables OpenCL double buffering. Note in the next release we use a better approach to pipelining OpenCL computation.

> so if I remove the inference latency that is 1 / fps I got a delayed output of ~100ms. Is that correct

not quite sure I understand the question.  The inference latency still contributes to your delay. Or do you mean extra on top of the inference latency?

 

 


  • Author
  • Ensign
  • September 18, 2025

Hi ​@SamP ,

Thanks again, this is very helping me understanding how you are measuring the latency.

I have also tried with aipu-cores 1,2,3 and it makes actually no difference in latency

I’m using AXELERA_LOW_LATENCY=1 thus, as you mention, I don’t need to set also AXELERA_USE_CL_DOUBLE_BUFFER=0


As my images are pre-decoded and stored in RAM as PIL images and are also already resized, the processing time is limited to only: the memory transfer CPU->AIPU + normalization + inference + post-processing.

I have also measured the time between I ‘yield’ the frame in my image_pusher loop and the time it arrives at the main inference loop (for frame_result in stream) and find also the same latency so indeed the fps is part of the game:

For 5 FPS I now have:

Inferred 1000 images
========================================================================
Element Time(𝜇s) Effective FPS
========================================================================
decodebin-link0 390 2,557.9
axtransform-colorconvert-cl0 4,331 230.9
inference-task0:libtransform_normalize_cl_0 4,704 212.6
inference-task0:libtransform_padding_0 2,474 404.1
inference-task0:inference 199,443 5.0
inference-task0:Inference latency 206,586 n/a
inference-task0:libtransform_paddingdequantize_0
34,068 29.4
inference-task0:libdecode_semantic_seg_0 28,820 34.7
inference-task0:Postprocessing latency 30,488 n/a
inference-task0:Total latency 277,964 n/a
========================================================================
End-to-end average measurement 5.0
========================================================================
Core Temp : 48.0°C
CPU % : 12.0%
End-to-end : 5.0fps
Latency : 285.3ms (min:228.8 max:378.6 σ:19.5 x̄:284.6)ms

But still 285ms for latency is a very very high value for my use case. I’m glad you’re working on this for the next release, anything I can test in the meantime would be welcome..

 

I see also inference-task0:libdecode_semantic_seg_0 = ~30ms, is it the ArgMax ?

 

I’m still scratching my head why inference latency depends on fps.

    our current executor effectively requires aipu-cores-1 frames to enter the inference element before the first frame is emitted

So if I understand correctly there is a kind of FIFO buffer of size aipu-cores-1 that is clocked at the pace of the fps. So the need to fill the FIFO before first frame goes out and be inferred. Even if the size of the FIFO is one we still need to wait the next frame to push the first frame out. Is there a way to bypass this FIFO and push the frame directly to be inferred by the model ?

Or as the size of the FIFO is reduced to 1 frame, can I push the good frame following immediately by a black frame then if I could reset the FIFO to remove the black frame I can go on like this frame after frame? The hypothesis here being that the model is independently clocked (not clocked by the fps) which I presume it’s the case.

 


  • Author
  • Ensign
  • September 25, 2025

@SamP gentle ping


Spanner
Axelera Team
Forum|alt.badge.img+2
  • Axelera Team
  • October 10, 2025

Hi ​@npi ! Sorry again for the delay. I got some feedback from the team. It looks like the latency you're seeing is an inherent limitation of the current AxInferenceNet implementation, as it requires buffering frames before processing, which particularly impacts low FPS scenarios.

There’s an improved low latency mode in the works that directly addresses this issue, and early testing shows significant improvements - though there isn’t a delivery date on this as yet.

In the meantime, you could bypass AxInferenceNet and use the lower-level axruntime API directly. This would eliminate the buffering overhead but requires writing custom code. There are examples in the SDK's examples directory if you wanted to explore this approach.

Otherwise, waiting for the update might be the most practical option. 👍