Skip to main content
Question

Run simple Gemm model

  • November 5, 2025
  • 12 replies
  • 198 views

Hello,

I am very new to this so this question might be very basic. I am trying to run some very basic matrix multiplication application on the AIPU. I have a model forward doing only GEMM and exported to onnx. I tried doing as in the tutorial with a .yaml file to then deploy it and use it (maybe with axrunmodel?) but I am failing to do so as I don’t know what to put in the model info in the .yaml file. For example I don’t know what to put in the Task_category.

Could you please explain how one would do that or tell me if I am doing all this completely wrong? 

Thank you very much

12 replies

Spanner
Axelera Team
Forum|alt.badge.img+2
  • Axelera Team
  • November 5, 2025

Hi ​@MArio, welcome on board! Great to have you here!

So, the Gemm operator is supported by the Metis compiler, so you can run matrix multiplication operations. However, there's no pre-made GEMM model in the model zoo; you'd need to create your own ONNX model with the Gemm operator and deploy it.

For task_category, I'd suggest using Classification. It's the simplest category for basic tensor operations.

The best starting point is the t1-simplest-onnx tutorial built into the SDK. Run ./deploy.py t1-simplest-onnx from your SDK root to see how to deploy a minimal ONNX model. It'll show you the YAML structure you need.

Useful resources:

t1-simplest-onnx tutorial: https://github.com/axelera-ai-hub/voyager-sdk/blob/release/v1.3/docs/reference/model_zoo.md

ONNX operator support docs (Gemm info and limitations): https://github.com/axelera-ai-hub/voyager-sdk/blob/release/v1.2.5/docs/reference/onnx-opset14-support.md

Someone else doing similar custom ONNX work: https://community.axelera.ai/voyager-sdk-2/is-there-some-low-level-numpy-onnx-example-956

Hopefully some of that gives you a starting point, so let me know where you get to and maybe there’s more we can do to get up and running!


  • Author
  • Ensign
  • November 6, 2025

Hi ​@Spanner , Thank you for your answer.

I have tried to create my own ONNX model from pytorch with only the GEMM operator inside but i am having trouble deploying it.

It seems that whatever the way I code the model I always end up with the same Error : 

ValueError: Invalid data shape: (X, Y). Only scalars and 4D tensors supported.
 

with X and Y numbers varying depending on my choices in the model. 

It seems 

 had the same issue is I will try fixing that following what they said but I am not really sure what would be the root cause of this error.


  • Author
  • Ensign
  • November 6, 2025

I seem to have issues with GEMM, as it uses (M,K) tensor shapes and as all the examples I have seen use 4D tensor shape (NCHW). I tried implementing 2 reshapes node (the first one functionning as squeez() and the second one unsqueezing) but i get this :
IndexError: tuple index out of range

Is there a way of handling 2D shaped tensors?


  • Author
  • Ensign
  • November 7, 2025

I am now getting a “Index error: tuple index out of range” but i can’t find out where it comes from. From what i have checked the dimensions seem all correct. I am joining the compiler error if someone can help me from that issue.


Forum|alt.badge.img
  • Axelera Team
  • November 7, 2025

Hi ​@MArio,

Welcome on board!

Thanks for sharing the error report. Can you please also share the onnx file with us, that will help us pinpoint the layers that are causing this issue.

Many thanks!


  • Author
  • Ensign
  • November 7, 2025

Hi ​@Habib ,

Thank you very much for your answer, here is my onnx file.

There are 3 nodes in the model : 

Flatten to go from 4d tensor to 2d

Gemm which is the application i want

LeakyRelu with alpha=1.0 (which basically does nothing, but is here to overcome the error of preprocessing nodes connected to the output)

 


Forum|alt.badge.img
  • Axelera Team
  • November 14, 2025

​Hi @MArio,

Thanks for your patience and for sharing the onnx model file.

It seems, this model is quite minimal, and for some reason compiler could not figure out how to map the Input directly to the Flatten layer. This is also mentioned in the “Axelera's notes for developers” section in the docs, but I’ll admit can be a bit more explicit.

Please try the attached original_with_conv.zip file with the compile binary as show below, and feel free to let us know if you encounter any issues or have any further questions.

Thanks again!

--

> unzip original_with_conv.zip
> compile -i original_with_conv.onnx -o v0 --overwrite

10:37:32 [INFO] Dump used CLI arguments to: /home/ubuntu/v141_test/voyager-sdk/customers/community/v0/cli_args.json
10:37:32 [INFO] Dump used compiler configuration to: /home/ubuntu/v141_test/voyager-sdk/customers/community/v0/conf.json
10:37:32 [INFO] Input model has static input shape(s): ((1, 10, 2, 2),). Use it for quantization.
10:37:32 [INFO] Data layout of the input model: NCHW
10:37:32 [INFO] Using dataset of size 100 for calibration.
10:37:32 [INFO] In case of compilation failures, turn on 'save_error_artifact' and share the archive with Axelera AI.
10:37:32 [INFO] Quantizing '' using QToolsV2.
10:37:33 [INFO] ONNX model validation can be turned off by setting 'validate_operators' to 'False'.
10:37:33 [INFO] Checking ONNX model compatibility with the constraints of opset 17.
Calibrating... ✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨ | 100% | 460.56it/s | 100it |
10:37:33 [INFO] Exporting '' using GraphExporterV2.
10:37:33 [INFO] Quantization finished.
10:37:33 [INFO] Quantization took: 0.59 seconds.
10:37:33 [INFO] Export quantized model manifest to JSON file: /home/ubuntu/v141_test/voyager-sdk/customers/community/v0/quantized_model_manifest.json
10:37:33 [INFO] Lower input model to target device...
10:37:33 [INFO] In case of compilation failures, turn on 'save_error_artifact' and share the archive with Axelera AI.
10:37:33 [INFO] Lowering '' to target 'device' in 'multiprocess' mode for 1 AIPU core(s) using 100.0% of available AIPU resources.
10:37:33 [INFO] Running LowerFrontend...
10:37:34 [INFO] Running FrontendToMidend...
10:37:34 [INFO] Running LowerMidend...
10:37:34 [INFO] Running MidendToTIR...
10:37:34 [INFO] Running LowerTIR...
10:37:35 [INFO] LowerTIR succeeded to fit buffers into memory after iteration 0/4.
Pool usage: {L1: alloc:115,968B avail:4,194,304B over:0B util:2.76%, L2: alloc:115,712B avail:32,309,248B over:0B util:0.36%, DDR: alloc:320B avail:1,040,187,392B over:0B util:0.00%}
Overflowing buffer IDs: set()
10:37:35 [INFO] Running TirToAtex...
10:37:35 [INFO] Running LowerATEX...
10:37:35 [INFO] Running AtexToArtifact...
10:37:35 [INFO] Allocate L1 memory pool pool_l1
10:37:35 [INFO] Allocating pool_l1 at 0x0183E3B00 (size: 115968 bytes, alignment: 64 bytes)
10:37:35 [INFO] Lowering finished!
10:37:35 [INFO] Compilation took: 2.22 seconds.
10:37:35 [INFO] Passes report was generated and saved to: /home/ubuntu/v141_test/voyager-sdk/customers/community/v0/compiled_model/pass_benchmark_report.json
10:37:35 [INFO] Lowering finished. Export model manifest to JSON file: /home/ubuntu/v141_test/voyager-sdk/customers/community/v0/compiled_model_manifest.json
10:37:35 [INFO] Total time: 2.82 seconds.
10:37:35 [INFO] Done.

 


  • Author
  • Ensign
  • November 14, 2025

Hi ​@Habib ,

Thank you very much for your answer. I would have imagined that as my Flatten node is before the Gemm it would work. Anyway the onnx you sent me compiles for me too so thank you very much again. 

As you said my model is very minimal, I am planning to perform some fault injection on the model and would like it to be as minimalist as possible for easier analysis. 

Do you know if there would be a way to use a model with only one Gemm node? 


Forum|alt.badge.img
  • Axelera Team
  • November 14, 2025

You are most welcome ​@MArio!

Perhaps we can try with only one Gemm node without Flatten layer, it may or may not work, would be interesting to know if it does. Or you could initialize the input Conv as an identity conv. 

Your use case for fault injection is quite interesting. If you could share more about your application and its main KPIs, it would greatly help the community understand your goal. More context could also lead to some excellent suggestions from other members.

We look forward to hearing about your results!

Thanks again!


  • Author
  • Ensign
  • November 18, 2025

Hello ​@Habib ,

I have not tried implementing GEMM alone as this seems rather difficult if I must use 4D Tensors.

For my use case i want to inject faults during inference and see where it fails and how, when it fails.

That is why I am using a simple model, to be able to understand easily what it is doing. I can then change the tensor size to use most of the core and be able to get more observable errors/faults.

For now I am mostly happy with the simplistic model I have (My conv is and identity conv), but i have 2 main questions :

1- Can I see how long it takes to execute each layer in my model (i.e. to execute the identity conv, the Flatten or the Gemm,...)

2- Can I access the memory in order to verify for example that my wieghts haven’t been corrupted during my fault injecting?


Forum|alt.badge.img
  • Axelera Team
  • November 19, 2025

Hi ​@MArio,

Ok sounds interesting! Please see my reply below:

1- Can I see how long it takes to execute each layer in my model (i.e. to execute the identity conv, the Flatten or the Gemm,...) 
AFAIK, this is currently not possible, we dont yet support a way to profile the graph that runs on Metis, but perhaps will have this soon.

2- Can I access the memory in order to verify for example that my wieghts haven’t been corrupted during my fault injecting?
Once the weights of the Gemm layer get compiled for Metis, they dont change. If I understood well, fault injection will be in the input. And perhaps, if you want to fix your input, and see how output changes, with faults in weights, then you can compile multiple models with varying weights … that’s what I would do … but perhaps, I could be wrong.

Hope that helps!
---


  • Author
  • Ensign
  • November 20, 2025

Hi ​@Habib ,

Thank you very much for your answer. I would perform fault injection on the device and not the input ideally (physical fault injection aswell), so the errors should come from computational blocks and memories, and as your device is a Compute in Memory one I would like to be able to know when there is an error if it happened as a memory error or as a computational error. 
That is why, as from what I understand the wieghts are stored in the memory, I would ideally like to check their value.

Thank you for your help!