Question

Custom Model on Metis SBC

Forum|Forum|1 month ago
May 28, 2026
8 replies
70 views

spectral369
Cadet

Hello,

I am trying to compile an IBM Granite embedding model for the Metis AIPU.

Current test model:

ibm-granite/granite-embedding-311m-multilingual-r2
https://huggingface.co/ibm-granite/granite-embedding-311m-multilingual-r2

My final target model is:

ibm-granite/granite-switch-4.1-3b-preview
https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview

Target hardware and software:

Metis SBC
16 GB board RAM
4 GB AIPU
Python 3.12
Voyager SDK 1.6.0

I tried to compile the embedding model with `axcompile`, but it fails during quantization.

Command:

bash
axcompile \
  --input granite_embedding_metis_work/granite_embedding_311m_static.onnx \
  --config granite_embedding_metis_work/granite_embedding_metis_4gb.json \
  --output granite_embedding_metis_work/axcompile_out \
  --overwrite \
  --dataset-len 2 \
  --log-level DEBUG \
  --quantize-only

The ONNX export succeeds.

Current ONNX details:

ONNX opset: 17
Input: inputs_embeds [1, 128, 768]
Output: embeddings [1, 768]

The model reaches calibration, then fails after calibration completes.

Error:

Calibrating... | 100% | 11.05s/it | 2it |

RuntimeError: External op model_layers_dot_0_attn_squeeze_const_input1 found in the model (<class 'qtoolsv2.intermediate_representation.operators.constant.Constant'> op). QTools may have issues quantizing this model.

ONNX operator summary:

Constant: 789
Mul: 226
Add: 154
MatMul: 134
Slice: 132
Transpose: 68
Squeeze: 66
Cast: 52
Concat: 46
Reshape: 44
Neg: 44
LayerNormalization: 44
Div: 44
Gather: 23
Split: 22
Softmax: 22
Shape: 22
Erf: 22
Unsqueeze: 4
ConstantOfShape: 2
Equal: 2
Where: 2
Expand: 2
Cos: 2
Sin: 2

I tried several changes already:

Moved tokenizer and token embedding lookup to CPU.
Changed the ONNX input to inputs_embeds [1, 128, 768].
Used a fixed attention mask.
Moved final L2 normalization to CPU.
Tried FP32 export instead of FP16.
Used dataset-len 2 because dataset-len 1 fails.

My question is:

Can Voyager SDK 1.6.0 compile Transformer embedding models like this through the generic `axcompile` ONNX path?

Or do IBM Granite models need a precompiled Metis package, similar to the LLM flow with `precompiled_url`?

I can provide these files if useful:

metadata.json
cli_args.json
conf.json
compilation_log.txt
compilation_report.json
ONNX operator summary
the export script

I would really appreciate guidance on this.

Thank you,
Peter

Spanner
Axelera Team
Forum|Forum|1 month ago
May 28, 2026

Hi there @spectral369! Welcome to the show, and thanks for the detailed write-up. You've clearly put a lot of work into the export already, nice one. 👍

So,for LLM models on Metis the supported path at present (I beleive) is the precompiled AxLLM flow rather than generic axcompile. The LLM tutorial has a bit deeper info on this. All the SLMs on Metis are precompiled, and arbitrary models can't be loaded unless they've been compiled for the platform (catch 22, at the moment anyway -- this is getting a lot of work right now though!). I bet the QTools error you're hitting is the compiler saying it doesn't have rules for some of the Constant ops. Which in turn fits with Transformer-class embedding models like Granite not being on the generic ONNX compile path at the moment.

Before going further with the export, maybe it’d be illuminating to try running one of the precompiled SLMs as a sanity check? Something like axllm llama-3-2-1b-1024-4core-static --prompt "Hello" should run on the MCB. The walkthrough is in that same LLM tutorial. If that runs cleanly, we know your Computer Board and SDK install are good and you've seen the supported LLM path working. If it errors, we can dig in deeper from there.

Out of curiosity, what's the wider use case?

spectral369
Author
Cadet
Forum|Forum|1 month ago
May 28, 2026

Hello,

Thank you for your swift response!

I tried the SBC tutorial, and it worked as described. At first, I was a bit confused by the stripped-down OS, just Docker with no additional packages but I figured it out after manually installing a few from the Debian repositories 😅. The Llama model worked, and that was the extent of it, since I had to build my demo app on my PC and so far haven’t had a reason to push it to the SBC.

The demo use case is both …..naive and complex. At its core, it’s a custom RAG system using Docling and Qdrant for sensitive data. Since Granite doesn’t support my native language, I’m also using [Helsinki-NLP](https://huggingface.co/Helsinki-NLP). For now, the data ingestion is at a ~decent level. I could elaborate, but it would take a few paragraphs.

Is there an experimental workflow for my case, so I can test it myself ?
Is there anything I can do besides waiting for llm precompilation ?

Thank you,
Peter

Spanner
Axelera Team
Forum|Forum|1 month ago
May 28, 2026

That’s not naive at all, it’s a great use case! Really interested to see how that progresses.

So, the custom model deployment tutorial we currently have is very much focused on vision models, since that’s primarily Axelera’s bread and butter. It doesn’t currently have an LLM angle to it, so unfortunately wouldn’t get you closer to Granite.

At this stage, probably the best thing you can do is add it as a feature request in the Launchpad section. The team does keep a close eye on that when working on priorities and the roadmap, so if it’s in there, it’ll get seen. And I’ll also pass your request along internally (we have a group chat specifically about new models people are looking for) to give it a bump there, too.

So, is it Finnish you’re looking to use? (Guessing from the model name!) . I’m just wondering if it's one of the ones Llama 3.2 supports there might be a path that drops the Helsinki-NLP translation step out of your stack entirely? Not a like-for-like Granite replacement, but maybe worth considering depending on your accuracy requirements.

spectral369
Author
Cadet
Forum|Forum|1 month ago
May 28, 2026

Hello,

Oh, I understand, this feels like I’ve hit a wall 😞

Llama 3.2 doesn’t fit, but microsoft/Phi-3-mini-4k-instruct might be viable. From the start, docling and granite (embedding + model) were part of my plan, as they are well-maintained and compatible, in addition to the Query Rewrite adapter within the model.

I’ll try to create a feature request!

Are Phi 3|4 mini models from huggingface usable with `axcompile` ?

Thank you for your time and help!
Peter

Spanner
Axelera Team
Forum|Forum|1 month ago
May 29, 2026

Hi Peter,

So, there's a ready-to-run set of LLMs for your 4 GB Compute Board, used exactly the way you ran the Llama 1B. Full list here: Model Zoo — Large Language Models. Looking over it, the ones that fit a 4 GB card are Phi-3 mini (512 tokens), Llama 3.2 1B and 3B (1024 tokens), and Almawave Velvet-2B (1024 tokens). The HuggingFace links on that page show each model's language coverage, so you can match it to your native language. Velvet is an Italian-built model though, so not sure if that meets the application you were looking for?

These are all precompiled by Axelera, and the public axcompile path doesn't cover LLM-class models at the moment, so Phi 3/4 from HuggingFace can't be compiled through it yourself. Phi-3 mini is in the zoo because it was statically compiled with Axelera's own tooling (and Phi-4 mini isn't in the set at the moment).

Quickest path could be to try one of the 4 GB models above against your language needs. Let me know how it goes!

spectral369
Author
Cadet
Forum|Forum|1 month ago
June 11, 2026

Hello,

You’re right—no matter how much I squeezed the input and answers, I couldn’t fit them into 512 tokens. So, Llama 3B it is, though I’d now prefer 2048 😅

I’m using Romanian as the main language, but 4GB of AIPU now seems way too low. I’m feeling a little frustrated. I thought I could optimize this for short, high-quality retrieval, but the AIPU size and LLM seem to be ruining my plans.

Now I’m in the phase of testing the app I managed to create on the Metis SBC. I’m having a bit of a hard time understanding how to integrate the interference stream. As I mentioned, the tutorial works:

axllm llama-3-2-3b-1024-4core-static --show-stats --prompt "Tell me short joke"
WARNING : Unsupported tracer: core_temp: valid tracers are: cpu_usage, end_to_end_fps, end_to_end_infs, latency
INFO    : Attempting to use tokenizer from URL: https://llm.axelera.ai/embeddings/meta-llama_Llama-3.2-3B-Instruct_tokenizer.zip
INFO    : Found AIPU driver: metis                 126976  0
INFO    : Firmware version matches: v1.6.1
INFO    : Firmware version matches: v1.6.1
INFO    : Selected available device metis-0:1:0 with 4.0 GB memory
INFO    : Model already downloaded and verified: /home/ubuntu/voyager-sdk/build/llama-3-2-3b-1024-4core-static/model/model.json
INFO    : Loaded embeddings: vocab_size=128256, embedding_dim=3072
INFO    : EmbeddingProcessor initialized with file: /home/ubuntu/voyager-sdk/build/llama-3-2-3b-1024-4core-static/llama_3_2_3b_embeddings.npz
INFO    : AxInstance initialized with model: /home/ubuntu/voyager-sdk/build/llama-3-2-3b-1024-4core-static/model/model.json
Assistant: I'm still in beta, but I've got a chip that's "neuron-ally" funny. Why did the transistor go to therapy? Because it was feeling a little "short-circuited."
INFO    : Tokenization: 1.5ms | Prefill: 58.4us | TTFT: 0.593s | Gen: 6.850s | Tokens/sec: 6.28 | Tokens: 43
INFO    : CPU %: 9.8%

Any tips would be appreciated, especially code example 😉

I’ll post here more info on this as I get further

P.S I didn’t check, but can I put on the SBC m.2 max ?

spectral369
Author
Cadet
Forum|Forum|1 month ago
June 12, 2026

Re,

[?] Web query: Care este adresa persoanei care are ROL-ul 1000211 ?
[2026-06-12 08:19:01] [*] BM25 query terms: adresa adresa adresa persoanei persoanei persoanei ROL ROL ROL 1000211 1000211 1000211 1000211 1000211 adresa adresa adresa persoana persoana persoana
Batches: 100%|██████████| 1/1 [00:01<00:00, 1.63s/it]
2026-06-12 08:19:02,951 - httpx - INFO - HTTP Request: POST http://localhost:6333/collections/legal_docs/points/query "HTTP/1.1 200 OK"
[2026-06-12 08:19:02] [*] Reranking 12 candidates...
Batches: 100%|██████████| 1/1 [00:08<00:00, 8.02s/it]
[2026-06-12 08:19:10] [*] Rerank cutoff=-2.000 (top=8.675, rel_drop=0.0, floor=-2.0); 12/12 above cutoff.
[2026-06-12 08:19:10] rank 1 score=8.675 crop_test_2026.pdf
2026-06-12 08:19:10,977 - httpx - INFO - HTTP Request: POST http://localhost:6333/collections/legal_docs/points "HTTP/1.1 200 OK"
[2026-06-12 08:19:10] [*] Neighbor expansion: requested 2, fetched 2.
[2026-06-12 08:19:10] [*] Context group 1: 3 chunk(s), score=8.675, crop_test_2026.pdf, page 5
INFO: 192.168.10.4:53468 - "POST /api/chat HTTP/1.1" 200 OK
[2026-06-12 08:19:12] [*] Query translated: What is the address of the person with the ROL 1000211?
[2026-06-12 08:19:12] [*] Prompt budget: fixed=108, context=686, gen_reserve=200, final=794/1024
[2026-06-12 08:19:12] [*] Prompt length (chars): 2147
2026-06-12 08:19:12,102 - httpx - INFO - HTTP Request: POST http://localhost:7860/gradio_api/queue/join "HTTP/1.1 200 OK"
2026-06-12 08:19:12,177 - httpx - INFO - HTTP Request: GET http://localhost:7860/gradio_api/queue/data?session_hash=729d656b-77db-4bb7-926a-58e75df78d96 "HTTP/1.1 200 OK"
2026-06-12 08:19:15,323 - httpx - INFO - HTTP Request: POST http://localhost:6333/collections/legal_docs/facet "HTTP/1.1 200 OK"
INFO: 192.168.10.4:37360 - "GET /api/categories HTTP/1.1" 200 OK
2026-06-12 08:19:15,356 - httpx - INFO - HTTP Request: GET http://localhost:6333/collections/legal_docs "HTTP/1.1 200 OK"
INFO: 192.168.10.4:37360 - "GET /health HTTP/1.1" 200 OK
2026-06-12 08:19:16,097 - httpx - INFO - HTTP Request: GET http://localhost:6333/collections/legal_docs "HTTP/1.1 200 OK"
INFO: 127.0.0.1:59598 - "GET /health HTTP/1.1" 200 OK
[2026-06-12 08:19:20] [*] English answer: The address of the person with the ROL 1000211 is: STRADA <REDACTED> TIMISOARA.
[2026-06-12 08:19:21] [*] Timing: {'retrieval_s': 9.76, 'translate_query_s': 1.07, 'generate_s': 8.13, 'translate_answer_s': 1.58}

I managed to make it work using:
`docker exec -it Voyager-SDK bash -lc 'cd /home/ubuntu/voyager-sdk && source venv/bin/activate && axllm llama-3-2-3b-1024-4core-static --ui local --port 7860'`

On the PC, processing the files takes a reasonable amount of time on the CPU. However, on the METIS board, a 6-page PDF (with 5 pages of non-standard tables) takes around 1 hour and a bit, I can optimize this quite easily.

Spanner
Axelera Team
Forum|Forum|1 month ago
June 12, 2026

Awesome work @spectral369 ! In terms of the processing time, I’d say it’s almost certainly the CPU that’s grinding there, rather than the Metis AIPU. So if you aim any optimisations in the direction of the CPU, I think that’s where you’ll see the quickest improvements 👍

As for adding another Metis card to the Compute Board, I’m actually not sure! I don’t even know if anyone’s tried that, given it’s got a Metis already built in, but I’ll ask! It’d be a really interesting experiment, if nothing else… 😄

Sign up

Log in, or create an Axelera AI account

Login to the community

Log in, or create an Axelera AI account