Skip to main content
Question

Voyager `deploy.py` bug

  • June 26, 2026
  • 0 replies
  • 3 views

Summary

On a Metis dev system, ./deploy.py <model> fails at the compile step: it exits with RC=1 and writes nothing to stdout or stderr (captured 1>out.txt 2>err.txt → both empty), produces no log file, and no traceback. It reproduces on a trivial 27×30 model, so it is not model‑size related. The one time the compiler did emit an error (earlier run, attached earlier_pkl_memoize_PermissionError.md) it was a TVM .pkl_memoize_py3 PermissionError, which makes me suspect the current silent failure is the same TVM‑cache / working‑directory‑writability problem, now swallowed.

Environment

  • Hardware: Metis dev system — Portenta X8 on the board, host access via a Raspberry Pi acting as an adb relay over USB.
  • SDK: Voyager container voyager-sdk-1.4.0, SDK at /home/ubuntu/voyager-sdk, venv /home/ubuntu/.cache/axelera/venvs/b4c581dc (Python 3.10, TVM backend).
  • from axelera import compiler, runtime imports successfully (prints OK).

What I'm deploying

A single 1×1 Conv2d exported to ONNX (it implements a matrix–vector product A@x for a PDHG LP solver), wrapped via the tutorial custom AxONNXModel flow (ax_models/tutorials/onnx/ + a YAML + matvec_model.py that returns a representative input vector for INT8 calibration). The smallest case (genip054) is [1,30,1,1] → [1,27,1,1] — a 27×30 weight. (Full files attached.)

Current symptom

==> [3/4] compiling forward matvec  A @ x   ([1,30,1,1] -> [1,27,1,1])
<returns to prompt; RC=1; out.txt empty; err.txt empty; no log written>

Captured directly (bypassing my wrapper script), unbuffered:

source venv/bin/activate; cd /home/ubuntu/voyager-sdk
MATVEC_CALIB=.../calib_genip054_x.npy python -u ./deploy.py matvec_forward --num-cal-images=8 \
1>out.txt 2>err.txt ; echo RC=$? # -> RC=1, out.txt=0 bytes, err.txt=empty

A related earlier error (the only traceback I ever got)

From an earlier session (attached earlier_pkl_memoize_PermissionError.md), deploy.py failed at import with:

File ".../tvm/contrib/pickle_memoize.py", line 47, in __init__
os.mkdir(cache_dir)
PermissionError: [Errno 13] Permission denied: '.pkl_memoize_py3'

i.e. TVM creates .pkl_memoize_py3 in the current working directory and fails when CWD isn't writable. I worked around the import‑time case by running from $SDK (writable) — axelera OK now prints — but the compile step still dies silently. My hypothesis: the compile (or a TVM worker it spawns) runs with a CWD that isn't writable, hits the same .pkl_memoize_py3 PermissionError, and the error is being suppressed.

What I've already ruled out

  • Model size — the 27×30 model fails identically (so not the conv being too large).
  • Files — both .onnx are present and valid (load fine in onnxruntime off‑board).
  • Memory / OOMdocker inspectMemory=0, MemorySwap=0, OOMKilled=false; no container memory limit, not an OOM kill.
  • Stale cache — cleared $SDK/build/*, all .pkl_memoize_py3, ~/.tvm, and docker restarted the container; still fails the same way.
  • Reboot — persists across a board power cycle.

Reproduction

# inside the SDK container, from $SDK:
source /home/ubuntu/voyager-sdk/venv/bin/activate
cd /home/ubuntu/voyager-sdk
cp <attached>/deploy/matvec_forward.yaml <attached>/deploy/matvec_model.py ax_models/tutorials/onnx/
MATVEC_CALIB=<attached>/calib/calib_genip054_x.npy ./deploy.py matvec_forward --num-cal-images=8 -v
# -> exits RC=1 with no console output and no log

Questions

  1. How do I get deploy.py to emit the actual compiler error? Is there a log‑level env var (e.g. AXELERA_LOG_LEVEL) or a log‑file path the compiler writes to, so a failing compile isn't silent?
  2. Given the earlier .pkl_memoize_py3 PermissionError, is the current silent RC=1 the same TVM‑cache / CWD‑writability issue at compile time, and what is the robust fix (a writable cache dir, a TVM_* env var, a required -w/CWD)?
  3. Is there a known limitation compiling a single 1×1 Conv2d (a plain fully‑connected matvec) through the custom AxONNXModel tutorial path?

Attached (in the zip)

  • model/matvec_forward.onnx, model/matvec_adjoint.onnx (27×30 / 30×27)
  • deploy/matvec_forward.yaml, deploy/matvec_adjoint.yaml, deploy/matvec_model.py, deploy/make_repr_dir.py
  • calib/calib_genip054_x.npy, calib/calib_genip054_y.npy
  • deploy_genip054.sh (my wrapper: copy YAMLs → make repr imgs → clear build → compile fwd+adj)
  • logs/out.txt, logs/err.txt (current run — both empty), logs/deploy.log (wrapper output, dies at [3/4])
  • logs/earlier_pkl_memoize_PermissionError.md (the one real traceback)
  • relaxed_gen-ip054.npz (the LP, for context)