Summary
On a Metis dev system, ./deploy.py <model> fails at the compile step: it exits with RC=1 and writes nothing to stdout or stderr (captured 1>out.txt 2>err.txt → both empty), produces no log file, and no traceback. It reproduces on a trivial 27×30 model, so it is not model‑size related. The one time the compiler did emit an error (earlier run, attached earlier_pkl_memoize_PermissionError.md) it was a TVM .pkl_memoize_py3 PermissionError, which makes me suspect the current silent failure is the same TVM‑cache / working‑directory‑writability problem, now swallowed.
Environment
- Hardware: Metis dev system — Portenta X8 on the board, host access via a Raspberry Pi acting as an
adbrelay over USB. - SDK: Voyager container
voyager-sdk-1.4.0, SDK at/home/ubuntu/voyager-sdk, venv/home/ubuntu/.cache/axelera/venvs/b4c581dc(Python 3.10, TVM backend). from axelera import compiler, runtimeimports successfully (prints OK).
What I'm deploying
A single 1×1 Conv2d exported to ONNX (it implements a matrix–vector product A@x for a PDHG LP solver), wrapped via the tutorial custom AxONNXModel flow (ax_models/tutorials/onnx/ + a YAML + matvec_model.py that returns a representative input vector for INT8 calibration). The smallest case (genip054) is [1,30,1,1] → [1,27,1,1] — a 27×30 weight. (Full files attached.)
Current symptom
==> [3/4] compiling forward matvec A @ x ([1,30,1,1] -> [1,27,1,1])
<returns to prompt; RC=1; out.txt empty; err.txt empty; no log written>
Captured directly (bypassing my wrapper script), unbuffered:
source venv/bin/activate; cd /home/ubuntu/voyager-sdk
MATVEC_CALIB=.../calib_genip054_x.npy python -u ./deploy.py matvec_forward --num-cal-images=8 \
1>out.txt 2>err.txt ; echo RC=$? # -> RC=1, out.txt=0 bytes, err.txt=empty
A related earlier error (the only traceback I ever got)
From an earlier session (attached earlier_pkl_memoize_PermissionError.md), deploy.py failed at import with:
File ".../tvm/contrib/pickle_memoize.py", line 47, in __init__
os.mkdir(cache_dir)
PermissionError: [Errno 13] Permission denied: '.pkl_memoize_py3'
i.e. TVM creates .pkl_memoize_py3 in the current working directory and fails when CWD isn't writable. I worked around the import‑time case by running from $SDK (writable) — axelera OK now prints — but the compile step still dies silently. My hypothesis: the compile (or a TVM worker it spawns) runs with a CWD that isn't writable, hits the same .pkl_memoize_py3 PermissionError, and the error is being suppressed.
What I've already ruled out
- Model size — the 27×30 model fails identically (so not the conv being too large).
- Files — both
.onnxare present and valid (load fine in onnxruntime off‑board). - Memory / OOM —
docker inspect→Memory=0,MemorySwap=0,OOMKilled=false; no container memory limit, not an OOM kill. - Stale cache — cleared
$SDK/build/*, all.pkl_memoize_py3,~/.tvm, anddocker restarted the container; still fails the same way. - Reboot — persists across a board power cycle.
Reproduction
# inside the SDK container, from $SDK:
source /home/ubuntu/voyager-sdk/venv/bin/activate
cd /home/ubuntu/voyager-sdk
cp <attached>/deploy/matvec_forward.yaml <attached>/deploy/matvec_model.py ax_models/tutorials/onnx/
MATVEC_CALIB=<attached>/calib/calib_genip054_x.npy ./deploy.py matvec_forward --num-cal-images=8 -v
# -> exits RC=1 with no console output and no log
Questions
- How do I get
deploy.pyto emit the actual compiler error? Is there a log‑level env var (e.g.AXELERA_LOG_LEVEL) or a log‑file path the compiler writes to, so a failing compile isn't silent? - Given the earlier
.pkl_memoize_py3PermissionError, is the current silent RC=1 the same TVM‑cache / CWD‑writability issue at compile time, and what is the robust fix (a writable cache dir, aTVM_*env var, a required-w/CWD)? - Is there a known limitation compiling a single 1×1 Conv2d (a plain fully‑connected matvec) through the custom
AxONNXModeltutorial path?
Attached (in the zip)
model/matvec_forward.onnx,model/matvec_adjoint.onnx(27×30 / 30×27)deploy/matvec_forward.yaml,deploy/matvec_adjoint.yaml,deploy/matvec_model.py,deploy/make_repr_dir.pycalib/calib_genip054_x.npy,calib/calib_genip054_y.npydeploy_genip054.sh(my wrapper: copy YAMLs → make repr imgs → clear build → compile fwd+adj)logs/out.txt,logs/err.txt(current run — both empty),logs/deploy.log(wrapper output, dies at[3/4])logs/earlier_pkl_memoize_PermissionError.md(the one real traceback)relaxed_gen-ip054.npz(the LP, for context)
