Skip to main content
Question

inference.py always deploys model

  • April 1, 2026
  • 1 reply
  • 15 views

I am using Metis dev board with Arduino x8. When running
./inference_llm.py llama-3-2-1b-1024-4core-static --prompt "Give me a joke"
it always deploys the model first. Even when I run this command one by one several times it is always deploying the model again, even thought it was just deployed.

Is there a way to deploy the model once and then just inference it?

1 reply

Spanner
Axelera Team
Forum|alt.badge.img+3
  • Axelera Team
  • April 1, 2026

Yo ​@Dominik ! 

I think that's expected behaviour when using --prompt since each command invocation is a separate process that loads the model, runs your prompt, and then exits.

If you want to keep the model loaded and send multiple prompts without redeploying each time, give  axllm  a shot in interactive mode:

axllm llama-3-2-1b-1024-4core-static

This drops you into a chat session where the model stays loaded and you can keep sending prompts. exit when you're done. The LLM tutorial has more details on the available options like --temperature and --system-prompt. Let me know what happens!