Llama 3.2B Chatbot Demo on Metis® - Fully Offline

I just built a quick demo showing the Llama 3.2B chatbot running on our Metis platform, totally offline. This model packs 3 billion parameters and runs smoothly on both a standard Lenovo P360 with our PCIe card and even on an Arduino-based dev board (Portenta X8).

We hit 6+ tokens/sec with a single core – which means real-time chat. Perfect for smart customer support bot, digital concierge systems, any edge AI assistant application really, all running fully on-device. No cloud needed.

Check out the video and let me know what you think. Any projects you can think of where you could use a self-contained, power-efficient, offline AI chatbot like this?

Page 1 / 1

For me right now, running a local LLM would be the main use case for buying the Mentis AI hardware, and this demo convinced me that this could indeed be a viable application!

I’m running Home Assistant locally and it already has all the infrastructure in place for AI integration (openwakeword, text-to-speech, speech to text). Nextcloud also is starting to include (local) AI options which I'd like to integrate in my setup.

The common route in the homelab community is to go for an energy-intensive (and pricy) GPU with a lot of VRAM, but a dedicated device like the arduino or SBC sound like an excellent alternative, especially if you weigh in the energy usage, price and TOPS.

Question: The key for LLM performance seems to be VRAM. The way I understand it is that your chip design has a different approach to how memory and processing power are integrated. Does the raw power (214 TOPS) balance out the lower memory availability? What impact on token generation does loading a larger model for example have?

Hi there @Eis-T! Yeah, you’ve nailed it there - Metis balances out the lower available memory by relying on smarter memory usage, reduced data movement, and high parallel compute.

And I’m right with you regarding how amazing it’d be to see this coupled with Home Assistant! I’m a big HA user as well, and having a local LLM integrated into it would be incredible.

Is that something you’re actively working on?

Thanks for the quick response @Spanner!

It's not something I'm actively working on, but more something I want to sink my teeth in and learn more about. I don´t plan on developing a product or service, but I do plan on sharing my findings with the Home Assistant community. Curious to hear if you are already in contact with Nabu Casa / Open Home Foundation as they are based in the Netherlands too. Building an ethical, local alternative to Amazon's Alexa/Echo is something they are dreaming about, just saying :)

Follow up question: which of these two systems would you advice for this purpose? The Arduino or the ARM SBC? Spoiler: the one you recommend will be the one I'm ordering.

How amazing would it be if we could put some of this in Frenck’s hands and get him excited about it too?!

Homestly, I don’t know which I’d go for either! For what we’ve talked about, either does the job beautifully. If you wanted to experiemnt more and try out some alternative applications, the Compute board is probably more flexible?

I went for the arduino one, because when looking at the product brief I saw that one has 16 GB of LPDDR4X and the compute board has 4 GB. Excited to tinker around with this!

I think the 16 you saw on the arduino is the storage not the memory.

Really looking forward to seeing what you build with it @Eis-T! Great choice of board.

Hello everyone!

the demo is amazing :)

I was able to run the LLMs on the Axelera board. Do you have any plans to release also your nice GUI/framework to interact with the LLMs?

Great work @lorenzo.lamberti ! What system did you use to run the LLM? And what kind of project are you working on? Be great to hear what you’re doing with all that local LLM juice!

Hi @lorenzo.lamberti ! You can actually use our browser based GUI already!
From the docs:

Launch a Gradio web interface for chat:

./inference_llm.py llama-3-2-1b-1024-4core-static --ui

By default, this shares the UI publicly. To run locally only:

./inference_llm.py llama-3-2-1b-1024-4core-static --ui local

The docs also show how to connect to the Gradio environment after setting it up:

When you launch the UI mode, the console will print a URL such as:

Running on local URL: http://127.0.0.1:7860 (for local mode)
Running on public URL: https://xxxx.gradio.live (for public/share mode)

The public URL will be different each time you launch. Open the printed URL in your browser to access the chat UI.

Thanks steven, i didnt notice that i will check it out!

@Spanner we are working (with Steven Hunsche) to bring metis on the SiFive P550. you will hear from me :)

Reply

Sign up

Log in, or create an Axelera AI account

Login to the community

Log in, or create an Axelera AI account

Scanning file for viruses.

This file cannot be downloaded