Skip to main content

Jetson + metis real-time inference on stereo camera

  • June 12, 2026
  • 2 replies
  • 34 views

Forum|alt.badge.img+1

use the metis m.2, zed x and nvme on a jetson orin nx 16gb

 

a condensed field guide to a real-time vision stack on the orin nx 16gb: preempt_rt kernel, axelera metis npu, zed x stereo, nvme. every section here exists in long form in the repository docs, linked throughout and listed at the end.

 

what this is

a complete recipe for taking a stock jetson orin nx 16gb and turning it into a real-time vision platform:

  • custom preempt_rt kernel with low-jitter tuning (cyclictest avg ~3 µs on the isolated cores; max under the 100 µs gate make verify enforces. run it headless — a desktop session adds ~150 µs ipi spikes)
  • axelera metis m.2 as the inference accelerator (in-tree driver, not oot, vermagic-safe). detection costs the jetson's gpu essentially nothing, which is the whole point
  • stereolabs zed x via the zed link mono capture card (max9296 deserializer)
  • nvme boot + a btrfs data partition for recordings
  • dmabuf zero-copy pipeline (design goal: camera → isp → npu with no cpu memcpy)

everything below is live-verified on the reference unit (2026-06-11) — measurements, not projections. the headline numbers:

metric measured
yolov5s-v7-coco on 1080p video → metis, end-to-end 49.2 fps at 13.7% cpu
live zed x → metis, c++ headless 56.4 fps @ HD1200@60 · 95.7 @ SVGA@120 (yolov8s)
fusion sample: npu detection + gpu depth + point cloud + imu pose + skeletons 46–53 fps @ HD1200@60, every feature live
gpu cost of detection zero: ~25% GR3D, all of it zed rectification + compositor
cyclictest, 10 s burst on isolated core 1, headless avg ~3 µs · max under the 100 µs verify gate
boot to sshd ~60 s

full tables, exact commands, expected outputs and the bench harness are in the samples and benchmarks docs. it's all in one repo: https://github.com/silicondoritos/jetson-rt-stack (apache 2.0), and it builds on the original axelera bring-up guide and axl-jetson.patch from the axelera team.

 

Demo image

a still from the sample app streaming the zedx feed and performing real time inference.

sample application showing the stuff you can do with the metis running yolo8l and the zedx camera sdk. you can see the labels, distances, boxes, the arrow showing the calculated direction of the object and the 3D volumes for detected objects.

 

 

 

what you need

  • a jetson orin nx 16gb (p3767 module, p3509-class carrier), an axelera metis m.2 (key m, 2280, gen3 x4 — not 2230 / gen4 x2 as some secondary sources claim), a zed x + zed link mono capture card, an nvme drive
  • a ubuntu 22.04 host (or docker on anything newer) with ~100 gb free
  • the third-party pieces the repo can't ship: the nvidia l4t r36.4.3 tarballs, the bootlin toolchain, axelera-driver + voyager-sdk (axelera, NDA), zedx-driver (stereolabs NDA), the public zed sdk. pinned versions and where to get each: the third-party guide in the repo.

if you only take four things from this post

  1. vermagic discipline: no third-party .ko anyone ships will load on a preempt_rt kernel. promote drivers in-tree, ship a matching linux-headers .deb, and build the stereolabs modules as external modules with KBUILD_EXTRA_SYMBOLS pointing at the nvidia-oot + hwpm Module.symvers (or modpost dies "undefined!" and the camera silently ships absent).
  2. the dtbo trap: nvidia's kernel build silently skips dtbo-y targets. your camera overlay can "build" for an hour and not exist. compile it yourself, then check the .dtbo is in /boot/ and the OVERLAYS line is in extlinux.conf.
  3. LINK_WAIT_MAX_RETRIES=200: at the stock 10 pcie link-training retries the metis ghosts on cold boots and brownouts. 200 trains it reliably (axelera suggests 50, i chose 200). the cost: each retry sleeps ~90–100 ms, so an empty m.2 slot now burns ~18–20 s of boot — with the metis seated, that path is never hit.
  4. GST_PLUGIN_FEATURE_RANK=nvv4l2decoder:NONE: nvidia's hardware decoder outputs NVMM-memory caps the axelera gstreamer elements can't consume, so every file/rtsp inference source dies not-negotiated until you force software decode.

the one-command recipe

everything below stages inputs; the build itself is one command at the end. make doctor runs first and refuses to start if anything is missing.

1. host setup (one-time, ubuntu 22.04 or newer with docker):

git clone https://github.com/silicondoritos/jetson-rt-stack.git
cd jetson-rt-stack

sudo apt update && sudo apt install -y \
build-essential bc bison flex git rsync zstd make openssl xxd \
libssl-dev dpkg-dev qemu-user-static device-tree-compiler \
nfs-kernel-server docker.io curl
sudo usermod -aG docker $USER && newgrp docker
pip install kconfiglib

2. the three public nvidia tarballs, dropped at the repo root:

wget https://developer.nvidia.com/downloads/embedded/l4t/r36_release_v4.3/release/Jetson_Linux_r36.4.3_aarch64.tbz2
wget https://developer.nvidia.com/downloads/embedded/l4t/r36_release_v4.3/release/Tegra_Linux_Sample-Root-Filesystem_r36.4.3_aarch64.tbz2
wget https://developer.nvidia.com/downloads/embedded/l4t/r36_release_v4.3/sources/public_sources.tbz2

(the bootlin toolchain url circulating in older guides, under v5.0, 404s — nvidia moved the toolchain to r36_release_v3.0/. the dockerfile carries the fixed url.)

3. the vendor trees, also at the repo root:

jetson-rt-stack/
axelera-driver/ <- metis kernel driver (NDA, ask axelera support)
voyager-sdk/ <- voyager sdk + axl-jetson.patch (same package)
zedx-driver/ <- zed x kernel driver (NDA, ask stereolabs)
zed-sdk/ <- ZED_SDK_Tegra_*.run (public, stereolabs.com/developers/release)

no camera? skip the two stereolabs entries and set CONFIG_CAMERA_NONE=y in make menuconfig — the metis + nvme baseline builds with no stereolabs NDA at all and still does 49.2 fps on video.

4. profile + build container (one-time):

make defconfig       # committed defaults: rt kernel + metis + zed x mono + maxn_super
make docker-build # ~5 min

5. the recipe:

SEED_USER=j SEED_WIFI_SSID=yournet SEED_WIFI_PSK=yourpass make ignite

make ignite = preflight, extract, kernel build, rootfs bake, hard audit gate, then it pauses ONCE for you to put the jetson in force recovery (short REC+GND, usb-c straight into the host, no hub), flashes, waits out first boot, and runs post-flash validation. about 90 minutes of host-side work end to end; budget separately for the one-time on-device steps (the opencv-cuda build on unit #1, an hour-plus; the ~17 min first model compile). make ignite-no-flash does everything except touch the device.

the three things no script can do for you: the NDA paperwork, the two physical moves (recovery jumper on, then off + power-cycle), and one-time internet on the jetson (first boot provisions the python env online; offline it defers cleanly and finishes on the next connected boot). after that, make verify over ssh comes back green and the fun part starts.

the kernel, compressed

the defconfig is the heart of it; every knob and its reason is in the configuration doc. the load-bearing decisions:

  • rt core: PREEMPT_RT + NO_HZ_FULL + CPU_ISOLATION + RCU_NOCB_CPU move together — isolcpus / nohz_full / rcu_nocbs all set to 1-5, drop any one and the jitter comes back. one sharp edge: isolcpus removes cores from the scheduler domains, so threads never load-balance between isolated cores. the single-threaded metis loop on core 1 loves this; threaded consumers (zed sdk, cuvslam) present as "pinned but lopsided" — one core pegged, its sibling idle — and need per-thread affinity or cgroup-v2 isolated partitions.
  • debug stripping: KASAN + ftrace + lockdep (a bring-up debug fragment — stock flashed l4t does not run kasan) put my cyclictest max at 180–220 µs; stripped, 30–50 µs. honesty about what that buys: at 60 fps the 16.7 ms frame period dominates latency, so the real case for rt is the tail — bounding the multi-millisecond stalls a loaded debug kernel occasionally throws. that bound is exactly what the validation gate measures.
  • pcie aspm off at three layers (defconfig, boot arg, per-device sysfs): a sleeping link costs 50 µs per dma wakeup and each wakeup is a correctable-error event. aer stays on, so post-mortem you can correlate "metis disappeared at T+1247s" with "aer correctable +3 at T+1245s" and know it was electrical, not driver.
  • corrected from earlier revisions: the cortex-a78ae is armv8.2-a — pauth (8.3) and bti (8.5) don't exist on orin silicon; those configs are harmless hint-space no-ops and only the crypto extension + kernel-mode neon do real work. bbr is tcp-only and inert for dds — the fq qdisc is what protects ros traffic. and the tpm group is staged, not active: no verified discrete tpm on the reference carrier (check /dev/tpm0; spi parts also need TCG_TIS_SPI).
  • MODVERSIONS=y, MODULE_FORCE_LOAD off: per-symbol crc checks on top of vermagic, and no escape hatch for insmod --force. that's a feature.

vermagic is the single biggest "why doesn't my driver load" source on a custom rt kernel. our string is 5.15.x-tegra SMP preempt_rt mod_unload aarch64, which means: stock nvidia kernel-module .debs no (preempt vs preempt_rt), stereolabs .debs no, prebuilt community metis .ko no, dkms conditional (only after our headers .deb is installed), our in-tree phase-2 build yes. the three-layer defense: promote drivers in-tree (metis at drivers/misc/axelera/, zed x at drivers/media/i2c/zedx/, built by the kernel's own make modules), bake a matching linux-headers .deb installed at first boot, and hard-fail gates at end-of-build, pre-flash, and on-target.

the dtbo mechanism, since the trap is item 2 up top: nvidia's Makefile.lib adds dtb-y to always-y but not dtbo-y (registered, never built), the overlay dts needs -DBUILDOVERLAY or you get a malformed empty file, and dtc 1.5.x false-positives on overlays. phase 2 compiles the dtbo directly (cpp + the kernel's in-tree dtc) and the pre-flash audit refuses to flash unless the ~79 kb .dtbo exists.

drivers, the traps

  • the metis pci id is 1f9d:1100. i carried 1d60 in every grep for months and chased a ghost that was actually a wrong query.
  • the voyager pip index must end in /api/pypi/<repo>/simple or pip 404s; pin numpy<2.0.0 (voyager 1.6 hasn't certified 2.x).
  • zed link mono = max9296. pick max96712 and the camera still "works" at 30 fps with clean dmesg while stereo depth is silently garbage. the choice is enforced in the defconfig and a vendor-makefile sed — one without the other = corrupted frames.
  • the zed sdk installer has no skip_drivers flag (it's silently ignored); the working flags are silent runtime_only skip_python skip_cuda skip_tools skip_od_module skip_hub nvpmodel=0. it also leaves three gaps our daemons script closes: the bmi088/spsc kernel modules built against our kernel (sdk ≥5.x refuses to open the camera without them), the privileged spsc broker daemons, and the patched libnvisppg.so for the stock-r36.4.x soft-image bug.
  • isp .isp calibrations go to /var/nvidia/nvcam/settings/ or colors drift and visual slam degrades.

flash + first boot, the traps

  • the board target is jetson-orin-nano-devkit — not -super, which is an orin NANO power-table variant: the flash "succeeds" and the board runs ~30% slow with weird thermals, the hardest kind of bug because nothing fails loud.
  • apply_binaries.sh silently reinstalls the stock kernel; the flash script backs up and restores the rt kernel behind a vermagic gate. the initrd's early-boot modules must themselves be preempt_rt (we build nvme/pcie in, so none are needed).
  • root=/dev/nvme0n1p1 explicitly — the emmc default makes an nvme-only board hang forever on rootwait, no error.
  • a blank hdmi during boot is normal on orin. judge the boot by the usb gadget (0955:7020) → ping 192.168.55.1 → ssh, ~60 s.
  • an unseeded flash boots into the oem wizard, which blocks sshd while still answering ping — looks exactly like a hang. SEED_USER=j at flash kills it; the script hard-fails if the wizard symlink survives.
  • first boot regenerates ssh host keys (i once shipped 5 jetsons with identical keys), pins nvidia-l4t-kernel* and bootloader packages at Pin-Priority: -1 (apt-mark hold alone is overridable; the pin also rejects nvidia's security updates — deliberate, those components update only through build → audit → flash, but the cve watch is yours now), installs our headers .deb, builds /opt/av-env, and defers cleanly if offline.
  • per-boot rt-tune relocks the volatile stuff: nvpmodel mode 0 = MAXN_SUPER on the super conf table (earlier revisions said mode 4 — on this table that's the fixed 40 w profile and it silently downgraded the board every boot), clocks, gpu devfreq at .gpu on r36.x (.ga10b is r35.x and silently no-ops), irq pinning per core role, the fq qdisc, and an oom shield on the axelera runtime.

inference, the payoff

prove the npu on a file first — a deterministic input isolates the metis + runtime from the whole camera stack:

cd ~/voyager-sdk
GST_PLUGIN_FEATURE_RANK=nvv4l2decoder:NONE PYTHONPATH=$PWD \
/opt/av-env/bin/python inference.py yolov5s-v7-coco media/h264/traffic1_1080p.mp4
# ~49 fps end-to-end, <15% cpu. first run compiles the model (~17 min), then cached.

four traps on the way there: the app framework lives in the voyager checkout, not the wheels (install its requirements minus opencv-python — it shadows your cuda cv2 build — and minus pyopencl); the gstreamer operators are a source build (make operators, plus ninja-build opencl-headers ocl-icd-opencl-dev libsimde-dev); the decode workaround above; and pass aipu_cores=4 to create_inference_stream or you trigger a full recompile (the app api default is 1, inference.py's is 4).

then the live camera — the c++ sample runs 56.4 fps at HD1200@60 and up to 95.7 at SVGA@120 (yolov8s), detector at zero gpu — and the full fusion sample: npu detection + gpu depth + point cloud + imu-fused pose + skeleton tracking concurrently at 46–53 fps with every feature on. --depth-every N is the big tuning lever, and think of N in meters, not frames: at HD1200@60 every skipped grab is 16.7 ms of depth staleness, so N=3 means the geometry under your distance/ttc numbers is up to ~50 ms / 1 m old at 20 m/s, and N=6 doubles that — pick N from closing speed and stopping distance, not from the fps chart. launch ad-hoc runs through axrun (core 1, oom-shielded) or eat ±300 µs of jitter from landing on a random core.

one more: apt's python3-opencv ships without cuda — every cv2.cuda call silently runs on cpu, a 10–30x slowdown. the stack builds opencv 4.10 from source and caches the .deb so units 2..N never rebuild.

validation

make verify ssh's in and hard-gates: -tegra kernel, isolation/tickless boot args, CmaTotal matching the device-tree pool (a cmdline cma= bypasses it and breaks gpu init), vermagic on every loaded .ko, expected drivers per expectations.conf, MAXN_SUPER active, no throttling, a 10 s cyclictest burst on core 1 with max < 100 µs (run headless), and the /opt/av-env imports. exits 0 only on full green. past one unit, the repo has signed releases, batch flashing, and bit-identical golden-image cloning — the runbook.

troubleshooting, the top hits

symptom fix
metis ghost: lspci -d 1f9d: empty on cold boot the LINK_WAIT_MAX_RETRIES=200 patch; verify in source, rebuild
"Invalid module format" / .ko on disk but not loaded modinfo vermagic vs uname -r; rebuild from clean tree, never force-load
dtbo missing after an hour of make dtbs nvidia skips dtbo-y; the direct-compile path + pre-flash audit
stereo depth garbage, zero errors anywhere wrong deserializer (max96712 vs max9296); both enforcement points required
first boot "hangs" but answers ping oem wizard blocking sshd; seed a user at flash
inference dies not-negotiated on file/rtsp GST_PLUGIN_FEATURE_RANK=nvv4l2decoder:NONE
cv2 import dies after installing voyager app deps opencv-python shadowed the cuda build; install requirements minus it (and pyopencl)
cuda pip build dies with a bare Killed the image ships zero swap (zram off for rt determinism); add a low-swappiness nvme swapfile

the full symptom-first catalog is 32 entries deep in the troubleshooting doc.

closing

it's the artifact you need to run metis + zed x + nvme on an orin nx 16gb doing real-time vision without it falling apart on a cold boot or a brownout — and it's not a beginner guide: you need to know what make does, be willing to read a kernel defconfig, and be comfortable when dmesg is the only thing between you and the answer. one gap named out loud: the boot path is unsigned and the data partition unencrypted, so the blackbox's sha256 hash chain starts at userspace. fused secure boot, measured boot (if a tpm materializes on the carrier), and luks under the btrfs partition are the natural sequel post.

contributions welcome — file an issue with the bug-report template, include make logs output if you have it. if your numbers differ from the samples-doc tables on a different carrier or sdk version, post them: comparative data is how the catalog grows. and if you build something with this and it flies, say so in a github issue. seriously.

contact: silicondoritos at gmail.

the docs (where every cut section lives in full)

2 replies

Spanner
Axelera Team
Forum|alt.badge.img+3
  • Axelera Team
  • June 12, 2026

Wow, that’s top banana ​@doritos ! Outstanding work there - practically one command to save days of head scratching there!

Excellent FPS that barely even touches the CPU, too. That alone is awesome.

Great for smart cameras, but honestly, I could see this being amazing for autonomous robotics, drones, warehouse automation - anywhere where 3D dept, AI vision and edge operation are needed. Which actually, is SO many potential applications! Thanks so much for sharing, this is so awesome. 😎

Do you have any plans to put it to work in a specific use case?


Forum|alt.badge.img+1
  • Author
  • Ensign
  • June 12, 2026

Wow, that’s top banana ​@doritos ! Outstanding work there - practically one command to save days of head scratching there!

Excellent FPS that barely even touches the CPU, too. That alone is awesome.

Great for smart cameras, but honestly, I could see this being amazing for autonomous robotics, drones, warehouse automation - anywhere where 3D dept, AI vision and edge operation are needed. Which actually, is SO many potential applications! Thanks so much for sharing, this is so awesome. 😎

Do you have any plans to put it to work in a specific use case?

Thanks! The goal was a stable foundation to develop applications on, without getting hassled by all the configuration and prerequisites just to start writing code. Once you have this, it's very easy to expand the sample app into a perception stack you can plug into Isaac ROS, cuVSLAM and more, so you can close the automation loop.


I'm personally building an autonomous perception stack around the ZED X camera, 1D lidar rangefinders and GPS. But the tutorial is camera-agnostic too, swap the ZED SDK for Isaac ROS DNN Stereo Depth (ESS) and you get GPU stereo depth from any synchronized stereo pair, no ZED hardware required.