Skip to main content

Project Gaius: 2-Stage Pipeline & Smart Cropping

  • February 2, 2026
  • 0 replies
  • 6 views

Forum|alt.badge.img

Hi Community!

I want to share the architectural progress I've made with the Voyager SDK since my last post about Project Gaius.

My journey began with a plan to utilize a pre-trained YOLOv11, which showed great promise for hand keypoint detection. However, after encountering persistent matmul errors during deployment to the AIPU, I pivoted to YOLOv8n-pose. I trained a custom model on the Ultralytics Hand Keypoints Dataset, which successfully compiled and is now running smoothly on the Metis.

To address the resolution challenge where downscaling camera streams makes hands too pixelated for accurate tracking, I am implementing a "Smart Cropping" 2-stage inference pipeline. Two networks work in tandem: the first detects the global context and roughly locates the hand, while the second extracts precise 21-point keypoints. Instead of resizing the whole image to the standard 640x640 input, the logic uses the first stage to crop the hand from the high-quality stream before feeding it into the second model.

To achieve this specific behavior, I decided to build using the low-level axelera.runtime API rather than the high-level inference_stream. This approach grants me the necessary control for dynamic cropping logic—which isn't possible in the stream API since the first model recognizes body keypoints—and improves efficiency by enabling conditional execution. If the first stage sees no hand, the second stage never runs, saving energy on this always-on system.

Currently, I have both models running independently and am implementing a geometric classifier in Python that accurately identifies gestures like Fist, Palm, and Like based on finger linearity. My next steps involve finalizing the code that binds the two models together and beginning the integration with Home Assistant to turn these gestures into real-world actions.

Stay tuned for the next update!