Skip to main content
Question

Multi core mode

  • May 12, 2026
  • 4 replies
  • 49 views

I am currently working with metis M2 card. Trying to run yolov8 model using inference.py file from the sdk. The throughput is very low. I set the AIPU core to 4 and verfied the same in the compile_config.json file. But there is another argument called multicore_mode("multicore_mode": "batch"). what exactly does the multi core mode do? and is there way to set it to other option than ‘batch’?

4 replies

Spanner
Axelera Team
Forum|alt.badge.img+3
  • Axelera Team
  • May 12, 2026

Hi ​@manoj bhat!

Good news on multicore_mode: "batch" as that's actually the right setting for maximum throughput. It tells the compiler to target all four cores together with the full memory budget, which is what you want. It's not something you'd normally need to change.

Could you share a few more details on things like your host system, OS, and a few details from the current output (FPS, any errors, etc)?

 


  • Author
  • Cadet
  • May 13, 2026

Thank you for the reply. so if i understood correctly, in the batch mode (batch =1) four instances of the model would created one for each core and each of them performs the inference?
The host system: Jetson orin NX (16gb) (dev kit), OS: Tegra Linux(with the axelera patch).
The input data is jpg file of size 640x640. The FPS I am getting right now is around 18 FPS and latency of around 2800ms  for my custom model and around 28 FPS for the model from the SDKs model zoo
 


Spanner
Axelera Team
Forum|alt.badge.img+3
  • Axelera Team
  • May 13, 2026

Hi again ​@manoj bhat!  So, batch mode doesn't create four separate instances - it refers to all four cores working together on a single inference with the shared memory budget 👍

The 2800ms latency is waving a bit of a flap, though. At 18 FPS you'd expect something closer to 55ms per frame, so there's clearly overhead somewhere in the pipeline beyond the inference itself. A good first step is to run with --show-stream-timing, which breaks out latency and jitter through the run and should help pinpoint where the time is going. The instrumentation panel also splits "system throughput" from "device throughput", which will tell you whether the bottleneck is on the Metis side or in pre/post-processing on the host.

One other thing to be aware of: Jetson Orin NX is currently on limited/beta support, so some performance variability is expected on that platform, but I’d still expect better than what we’re seeing here. Keep me posted!


  • Author
  • Cadet
  • May 18, 2026

Thank you for the help. I tried the multi model parallel execution pipeline by using 2 and 4 streams of the same model and AIPU cores set to 1 and 2(1 for 4 stream and 2 for 2 stream). The METIS  inference throughput increased to 170 for 4 stream , latency is 1609 now and end-to-end throughput is around 43 .(σ:84.3 x̄:1609.5)
For model 2 stream the METIS inference throughput is 155 , latency of 277.4 and end-to-end throughput of 77.7.(σ:44.6 x̄:277.4).