Skip to main content
Solved

M.2 AI Inference Acceleration card

  • July 13, 2025
  • 7 replies
  • 187 views

Non c’è modo di aggiungere RAM a questa bellezza 😁?

E’ perfetta per il mio Laptop o su Orange PI ( Embedded ).

PCIe 5.0 ?

 

LLM non gira su M.2 ? 😣

Leggendo l’SDK… Interessante… 🤔

Best answer by jonask-ai

Ciao Falcon9,

honestamente non riesco a seguirti 😅

I also find the thought interesting, that LLMs will be in a way become an addition to existing operating systems, like what we see in science fiction. Would be great to have it e.g. in a smart speaker for the beginning 🤓

Just to make sure I am understanding the technical part right: 
You are already running (inference) LLMs like gemma3:12b on your Laptop (with intel CPU and NVIDIA GPU) via Ollama, but you are not happy with the performance right?

So in order for you to run Language Models with our current version of SDK, you’d need to have a PCIe card and a system which has a slot for the PCIe card. The M.2 will not work, as our Metis chip has a memory associated with it, which is in the case of M.2 not sufficient to run the LLMs in our test environment. 

7 replies

Spanner
Axelera Team
Forum|alt.badge.img+2
  • Axelera Team
  • July 14, 2025

Non c’è modo di aggiungere RAM a questa bellezza 😁?

Hi there ​@Falcon9! Not to the Metis cards, though if you mean the hosts, some of them can certainly take more RAM.

 

E’ perfetta per il mio Laptop o su Orange PI ( Embedded ).

PCIe 5.0 ?

Yeah, been some great (and very promising) experimentation on Orange Pi! This for instance. And Metis is a PCIe 3, but it’s forward compatible (at PIe 3 speeds, I’d assume).

 

LLM non gira su M.2 ? 😣

Leggendo l’SDK… Interessante… 🤔

Not really enough RAM on the M.2 to run an LLM, at least the way everything currently works. Check out the PCIe card if LLMs are what you need, though.


  • Author
  • Cadet
  • July 14, 2025

Hello Spanner, thanks for all the reply...

Ollama say: 

Model: gemma3:12b

Microsoft Windows [Versione 10.0.26100.4652]
(c) Microsoft Corporation. Tutti i diritti riservati.

C:\Users\LucaF>ollama ps
NAME          ID              SIZE     PROCESSOR          UNTIL
gemma3:12b    f4031aab637d    11 GB    55%/45% CPU/GPU    4 minutes from now

means CUDA drivers ?

 

CPU:

11th Gen Intel Core i9-11900H 2.5Ghz ( 8 Cores )

RAM:

32GB DDR4 2666MHz

VIDEO CARD:

Nvidia RTX3600 6GB RAM ( Laptop series )

 

So seams, that ollama can allocate “thread” on CPU and GPU. In these configuration the model tends to be sloow ( mean inference, right ? )… The model is 12 Billion fo parameters…

What do you think ?

Seriously, my laptop need a new Brain. A way to communicate with me….…

Yes the cloud, other machines do their job… ( Embedded )

Why not ? Like HAL9000, may be ? Better ?

A new window ( OS ) for the future :)

 

I still believe that PCI 3 is not a problem…

PCIe 3.0, also known as PCIe Gen 3, has a transfer speed of 8 GT/s (gigatransfers per second) per lane. This translates to roughly 1 GB/s (gigabyte per second) of effective data transfer rate per lane. A standard PCIe 3.0 x4 slot, commonly found on motherboards, offers a total of 4 lanes, resulting in a maximum theoretical bandwidth of 4 GB/

Anyway my Orange has to arrive… So party ? :)


jonask-ai
Axelera Team
  • Axelera Team
  • Answer
  • July 16, 2025

Ciao Falcon9,

honestamente non riesco a seguirti 😅

I also find the thought interesting, that LLMs will be in a way become an addition to existing operating systems, like what we see in science fiction. Would be great to have it e.g. in a smart speaker for the beginning 🤓

Just to make sure I am understanding the technical part right: 
You are already running (inference) LLMs like gemma3:12b on your Laptop (with intel CPU and NVIDIA GPU) via Ollama, but you are not happy with the performance right?

So in order for you to run Language Models with our current version of SDK, you’d need to have a PCIe card and a system which has a slot for the PCIe card. The M.2 will not work, as our Metis chip has a memory associated with it, which is in the case of M.2 not sufficient to run the LLMs in our test environment. 


  • Author
  • Cadet
  • July 16, 2025

Hello Jonask-ai

Yes, the actually M.2 Metis ( NOT THE CHIP ITSELF ) has only 1Gig of ram. Ok.

Means this product is suitable for AI machine vision.

Embedded… in…

Orange Pi 5 Plus 16GB LPDDR4/4x Rockchip RK3588 8-Core 64-Bit Single Board Computer con eMMC Socket. ( ARRIVED TODAY :) )

Now i’m able to add computer vision to my smart home ( whit a Linux based Router, Access Point, Custom home management server, mqtt, zigbee. With a little cost…

And Alexa… Now i want say… Locally...

MentorPi Open Source Robot Car: ROS2 & Raspberry Pi 5

Hiwonder xArm AI Programmable Desktop Robot Arm with AI Vision & Voice Interaction

This isn't meant to be a criticism, just a consideration… The Metis Chip is capable of… ?

How many papers and very interesting youtube video about ai… Have you see the new Hugging Face hand ?

Anyway… Just a consideration...


  • Cadet
  • November 19, 2025

Non c’è modo di aggiungere RAM a questa bellezza 😁?

E’ perfetta per il mio Laptop o su Orange PI ( Embedded ).

PCIe 5.0 ?

 

LLM non gira su M.2 ? 😣

Leggendo l’SDK… Interessante… 🤔


Ciao ​@Falcon9 ,

Welcome to the community! It sounds like you have an awesome setup going with the Orange Pi 5 Plus and your smart home integration. 🤖🏠

To clear up the confusion regarding the Metis chip, memory, and what it can actually run, here is the breakdown:

1. Computer Vision is the "Home Turf" You are absolutely right. The Metis accelerator was primarily born for high-performance Computer Vision. It excels at:

  • Classification: EfficientNet, MobileNet, ResNet.

  • Object Detection: YOLO series, SSD-MobileNet, RetinaFace.

  • Segmentation: U-Net, Mask R-CNN.

  • Depth: FastDepth. 

  • Keypoint Detection: YOLO.

For your Orange Pi / Robot car projects, the standard M.2 (even with 1GB RAM) is a beast for these vision tasks.

2. From LLMs to SLMs (Small Language Models) Regarding your "HAL9000" dreams: technically, we are shifting the conversation from Large Language Models to SLMs (Small Language Models). With the current Voyager SDK, Axelera has opened up support for SLMs, but memory is physics. As per the Model Zoo docs:

  • Phi-3-mini-4k-instruct: Requires ~4GB memory (for a 512 token context window).

  • Llama-3.2 (1B and 3B): Requires ~4GB.

  • Llama-3.2 (8B): Requires ~8GB.

  • Velvet-2b: Requires ~4GB.

3. The "Hacker" Route: Using your 1GB M.2 for Language Since you have the 1GB M.2 module, you cannot simply load these standard SLMs entirely onto the chip. However, if you are an expert developer and willing to get your hands dirty, you have two theoretical alternatives:

  • Extreme Optimization: You could work on optimizing/quantizing a very tiny custom Language Model to fit within the 1GB constraint.

  • Layer Offloading (Hybrid Inference): Similar to how llama.cpp handles models that don't fit entirely in VRAM, you could attempt to construct a pipeline where you offload only specific layers to the Metis NPU, while running the remaining layers on your Laptop's GPU or the Orange Pi's CPU.

    • Note: This is not a "plug-and-play" feature. It would require deep knowledge of the Voyager SDK and significant custom development to manage the split-computing flow between AIPU, CPU, and GPU.

4. The Hardware Route: "A Bigger Brain" If you want a smoother experience running SLMs without heavily modifying the SDK, looking at hardware with more memory is the way to go:

  • M.2 Max: Keep an eye out for this version. It upgrades the onboard memory from 1GB to 16GB, which is perfect for the models listed above.

  • PCIe Cards: The Metis PCIe cards come in 4GB and 16GB variants. There is even a 4x Metis card coming that pushes memory up to 64GB (allowing for much larger models).

  • Future Tech (Europa): Axelera is also working on the Europa chip (2nd Gen AIPU), which targets high-performance inference for large models (potentially up to Llama-3 70B).

Summary: For now, your 1GB M.2 is a powerhouse for Vision. For text generation, you either need to upgrade the hardware (M.2 Max/PCIe) or do some serious software engineering to split the workload!

Enjoy the Orange Pi party! :)


  • Cadet
  • November 19, 2025

Non c’è modo di aggiungere RAM a questa bellezza 😁?

E’ perfetta per il mio Laptop o su Orange PI ( Embedded ).

PCIe 5.0 ?

 

LLM non gira su M.2 ? 😣

Leggendo l’SDK… Interessante… 🤔


Ciao Falcon9,

Welcome to the community! It sounds like you have an awesome setup going with the Orange Pi 5 Plus and your smart home integration. 🤖🏠

To clear up the confusion regarding the Metis chip, memory, and what it can actually run, here is the breakdown:

1. Computer Vision is the "Home Turf" You are absolutely right. The Metis accelerator was primarily born for high-performance Computer Vision. It excels at:

  • Classification: EfficientNet, MobileNet, ResNet.

  • Object Detection: YOLO series, SSD-MobileNet, RetinaFace.

  • Segmentation: U-Net, Mask R-CNN.

  • Depth: FastDepth. 

  • Keypoint Detection: YOLO.

For your Orange Pi / Robot car projects, the standard M.2 (even with 1GB RAM) is a beast for these vision tasks.

2. From LLMs to SLMs (Small Language Models) Regarding your "HAL9000" dreams: technically, we are shifting the conversation from Large Language Models to SLMs (Small Language Models). With the current Voyager SDK, Axelera has opened up support for SLMs, but memory is physics. As per the Model Zoo docs:

  • Phi-3-mini-4k-instruct: Requires ~4GB memory (for a 512 token context window).

  • Llama-3.2 (1B and 3B): Requires ~4GB.

  • Llama-3.2 (8B): Requires ~8GB.

  • Velvet-2b: Requires ~4GB.

3. The "Hacker" Route: Using your 1GB M.2 for Language Since you have the 1GB M.2 module, you cannot simply load these standard SLMs entirely onto the chip. However, if you are an expert developer and willing to get your hands dirty, you have two theoretical alternatives:

  • Extreme Optimization: You could work on optimizing/quantizing a very tiny custom Language Model to fit within the 1GB constraint.

  • Layer Offloading (Hybrid Inference): Similar to how llama.cpp handles models that don't fit entirely in VRAM, you could attempt to construct a pipeline where you offload only specific layers to the Metis NPU, while running the remaining layers on your Laptop's GPU or the Orange Pi's CPU.

    • Note: This is not a "plug-and-play" feature. It would require deep knowledge of the Voyager SDK and significant custom development to manage the split-computing flow between AIPU, CPU, and GPU.

4. The Hardware Route: "A Bigger Brain" If you want a smoother experience running SLMs without heavily modifying the SDK, looking at hardware with more memory is the way to go:

  • M.2 Max: Keep an eye out for this version. It upgrades the onboard memory from 1GB to 16GB, which is perfect for the models listed above.

  • PCIe Cards: The Metis PCIe cards come in 4GB and 16GB variants. There is even a 4x Metis card coming that pushes memory up to 64GB (allowing for much larger models).

  • Future Tech (Europa): Axelera is also working on the Europa chip (2nd Gen AIPU), which targets high-performance inference for large models (potentially up to Llama-3 70B).

Summary: For now, your 1GB M.2 is a powerhouse for Vision. For text generation, you either need to upgrade the hardware (M.2 Max/PCIe) or do some serious software engineering to split the workload!

Enjoy the Orange Pi party! :)

@Spanner and ​@jonask-ai,

Since you both have much more experience with the Axelera ecosystem, I wanted to quickly check if you agree with the summary I shared with Falcon9 above?

Specifically, regarding the "hybrid inference" idea (offloading specific layers to Metis while keeping others on CPU/GPU)—is that theoretically feasible with the current state of the Voyager SDK, or is that still too far out of reach for now?

Also, do we have any rough ETAs or availability updates for the upcoming hardware?

  • Metis M.2 Max (16GB)

  • The 4x Metis PCIe Card (64GB)

  • The Europa Chip

I think knowing when these might land would really help Falcon9 plan his "HAL9000" upgrade path! 😄

Thanks for the insights!


  • Cadet
  • November 21, 2025

Hi ​@Falcon9 ,

Following up on our discussion about the "Hardware Route": for your specific setup with the Orange Pi, the M.2 MAX (16GB) is definitely still the one to keep on your radar. It is not commercially available yet, but it remains the perfect upgrade path to run those SLMs locally on your robot without the bulk of a full card.

However, for other scenarios requiring massive throughput or "heavy lifting" at the edge, Axelera has just released the PCIe AI accelerator card powered by 4 quad-core Metis AIPUs.

It’s now available in 16GB and massive 64GB configurations here: https://store.axelera.ai/products/pcie-ai-accelerator-card-powered-by-4-metis-aipu

I’ve been analyzing the technical datasheet (PDF here: https://axelera.ai/hubfs/Axelera%20AI%20Metis%20PCIe%20x4%20Edge%20Accelerator%20Card.pdf), and for pure inference, this looks like a game-changer:

  • Raw Power: A peak of 856 TOPS in INT8.

  • Efficiency: While the card has a Power Rating of 225W (for headroom), the Typical Power draw is stated at just 30W–58W.

The Inference Comparison The most interesting part is how it compares to standard data-center cards like the NVIDIA L4. If we look at YOLOv8l:

  • Axelera (4-Metis): ~720 FPS

  • NVIDIA L4: ~238 FPS

Considering the L4 has a 72W TDP, 485 TOPS, and a higher price point, this new 4-chip solution from Axelera seems extremely promising for high-performance inference tasks.

So, while we wait for the M.2 MAX for your project, this new card is a very interesting development for the ecosystem!