Skip to main content
Solved

Is there some low level NumPy / ONNX example?

  • August 22, 2025
  • 3 replies
  • 119 views

I’m doing some testing with the PCI-Express based AI card. 

After some initial hiccups due to motherboard compatibility which required me to swap place for GPU and AI-card and then exposing the card into a ubuntu-22.04 docker container it’s working fine.

 

I’m impressed by the performance of the card compared to using --pipe torch and having a CUDA based backend for the benchmark problems identifying traffic.

 

I want to do inference with not only custom weights, but also custom nets and see how it performs. My wish is to have a sample that uses numpy f32 arrays as input and output and does inference on the card given a model from a ONNX file.

 

To keep it simple, I made a PyTorch net [3, 64][ Relu ][64, 64][ Relu ][64, 1] and trained it

on a mathematical function f(x1, x2, x3) = y.

This toy net was exported to ONNX.

 

From there I wanted to test inference on the AI card before moving to more complex things.

I could not see that my wanted usecase was compatible with `deploy.py` and the required items in the model/pipeline yaml file.

So instead I tried to use the compiler directly which gave an error:


07:58:42 [INFO] Running LowerFrontend...
07:58:42 [ERROR] Failed passes: ['axelera.PadChannelsToPword', 'LowerFrontend']
07:58:42 [INFO] TVM pass trace information stored in: /drive/build/compiled_model
07:58:42 [ERROR] Lowering failed. Failed pass: axelera.PadChannelsToPword <- LowerFrontend

 

With a small change to the default compilation config, I got a successful compilation thinking that I now have something that I should be able to load onto the card.


compile --generate-config --output config

sed -i 's/quantize_and_lower/quantize_only/'  config/default_conf.json 

compile --input toy-model/toy3d_model.onnx --input-shape 1,3 --overwrite --output build --config config/default_conf.json


...

07:59:58 [INFO] Checking ONNX model compatibility with the constraints of opset 17.
Calibrating... ✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨ | 100% | 469.31it/s | 100it |
07:59:59 [INFO] Exporting '' using GraphExporterV2.
07:59:59 [INFO] Quantization finished.
07:59:59 [INFO] Quantization took: 1.42 seconds.
07:59:59 [INFO] Export quantized model manifest to JSON file: /drive/build/quantized_model_manifest.json
07:59:59 [INFO] Quantization only was requested. Skipping lowering.
07:59:59 [INFO] Done.


Next I wanted to load the compiled model using python and axelera.runtime.

 

Which json file should I use for loading? I tried different ones, but loading the model

does not work and I get the error:


ValueError: AXR_ERROR_VALUE_ERROR: Failed to load model from /drive/build/quantized_model/quantized_model.json
Error: Version not found in model


 

Is there some example in the SDK which shows similar things that I’m trying to do but I’ve missed it? Any hints or guidance would be appreciated.

Best answer by Linde

Thanks for the information!

 

I did look at that tutorial and at a few other places. 

The required parts in the yaml file needed for e.g. deploy.py did not match my “simple” test.

My few numbers in my 1D-input don’t have a color format. 

 

You are correct, `quantize_and_lower` is needed, otherwise there is nothing that can be loaded onto the card.

 

I had several errors on many places in my test which caused problems.

I’ll try to mention those that I remember and how I fixed them in case it helps someone else.

  • My exported ONNX-file had a dynamic batch size. I changed this to a static size.
  • When input was 2d [batch_size, vector_size] I got compilaton errors during lowering. One of the errors mentioned something about only 1D and 4D being supported. Ended up doing 4D-input to get a working setup.
  • I had to add some initial reshaping of the data to match the LinearLayer that expected a 1D-vector. My first attempts for this resulted in warnings about compatibility. In the end I used an ugly solution with a fixed Conv2d with no learnable parameters and reshape to make the compiler happy.

Next thing that quantization

  • I loaded `build/compiled_model/model.json` and that worked, but requesting model.inputs() and model.outputs() gave TensorInfo which had no information about scale and zero-point. I had to load and parse `build/compiled_model_manifest.json` to get that info.
  • Using one json file for ctx.load_model and parsing another feels like I’m doing something wrong… but it worked.

 

The full list of things I did when I got a working setup was:

  • Run python script 1 that makes a PyTorch net, trains it and exports an onnx file
  • Run compiler manually and give it exported onnx file. All settings left at default values.
  • Run python script 2 that loads compiled mode, loads another json file, quantize my floats, perform inference, de-quantize the output

How the Post-Training-Quantization works is also good to know. I made some guesses.

The manifest shows a scale that is around 0.0039.. which is close to 1/256 so I have a strong feeling that the randomized data that was used in quantization was in the range [0,1] and hence I can do inference for my learned function F with arguments in the same range. Other values will be clipped and not give the expected results.

 

Many details and many things where one can make mistakes. Due to that, there might be a point in having a step-by-step guide for this. But then again, one must be aware of the limitations and what kind of precision that can be expected when float32 is quantized to int8 for general math and which range of inputs that is used when the histogram is computed and which values that will be clipped.

 

3 replies

  • Author
  • Cadet
  • August 25, 2025

Follow up:

 

quantize_only is not enough for compile mode setting, lowering is needed.

After a successful quantize_and_lower, `build/compiled_model/model.json` is the file that should be used with context and `load_model`


Spanner
Axelera Team
Forum|alt.badge.img+2
  • Axelera Team
  • August 28, 2025

Hi there ​@Linde! Great to see you here.

Maybe take a look a the t1-simplest-onnx tutorial in the Voyager SDK. It shows how to compile a (small) ONNX model and run it with axelera.runtime — inputs and outputs can be plain NumPy arrays. I think that might be what you’re looking for? But let me know if I’ve missed the mark there 😄

Also, I think you’ll need quantize_and_lower (not just quantize_only) so that the compiler produces a usable model.json (under build/compiled_model/), which is the one to load with AxRuntime. 👍


  • Author
  • Cadet
  • Answer
  • August 28, 2025

Thanks for the information!

 

I did look at that tutorial and at a few other places. 

The required parts in the yaml file needed for e.g. deploy.py did not match my “simple” test.

My few numbers in my 1D-input don’t have a color format. 

 

You are correct, `quantize_and_lower` is needed, otherwise there is nothing that can be loaded onto the card.

 

I had several errors on many places in my test which caused problems.

I’ll try to mention those that I remember and how I fixed them in case it helps someone else.

  • My exported ONNX-file had a dynamic batch size. I changed this to a static size.
  • When input was 2d [batch_size, vector_size] I got compilaton errors during lowering. One of the errors mentioned something about only 1D and 4D being supported. Ended up doing 4D-input to get a working setup.
  • I had to add some initial reshaping of the data to match the LinearLayer that expected a 1D-vector. My first attempts for this resulted in warnings about compatibility. In the end I used an ugly solution with a fixed Conv2d with no learnable parameters and reshape to make the compiler happy.

Next thing that quantization

  • I loaded `build/compiled_model/model.json` and that worked, but requesting model.inputs() and model.outputs() gave TensorInfo which had no information about scale and zero-point. I had to load and parse `build/compiled_model_manifest.json` to get that info.
  • Using one json file for ctx.load_model and parsing another feels like I’m doing something wrong… but it worked.

 

The full list of things I did when I got a working setup was:

  • Run python script 1 that makes a PyTorch net, trains it and exports an onnx file
  • Run compiler manually and give it exported onnx file. All settings left at default values.
  • Run python script 2 that loads compiled mode, loads another json file, quantize my floats, perform inference, de-quantize the output

How the Post-Training-Quantization works is also good to know. I made some guesses.

The manifest shows a scale that is around 0.0039.. which is close to 1/256 so I have a strong feeling that the randomized data that was used in quantization was in the range [0,1] and hence I can do inference for my learned function F with arguments in the same range. Other values will be clipped and not give the expected results.

 

Many details and many things where one can make mistakes. Due to that, there might be a point in having a step-by-step guide for this. But then again, one must be aware of the limitations and what kind of precision that can be expected when float32 is quantized to int8 for general math and which range of inputs that is used when the histogram is computed and which values that will be clipped.