Built a physical AI agent device that controls any connected device via USB HID, and sharing the demo here
Hugging Face Forums [Unofficial]
June 12, 2026
Really like this. “Don’t ask the host for permission - just present as a keyboard and mouse” is exactly the right lateral move: the integration wall is where most agent deployments quietly die, and meeting it at the HID layer instead of the API layer sidesteps the whole problem. Nice work.
I’m heads-down on my own thing right now (latent-space work - injecting into and reading out of a model’s hidden state rather than going through text), so I genuinely don’t know what kind of help you’re after, if any. But I’ve been circling this same corner - autonomous agents on cheap edge silicon - for a while, so a few threads in case any are worth chewing on:
1. The loop latency, not the HID. HID output is basically free; what sets the felt responsiveness is the observe->reason cycle - HDMI frame → JPEG → LLM round-trip, every turn. What’s your end-to-end latency per action, and is the LLM the bottleneck? I’d be curious whether you frame-diff so an unchanged screen doesn’t pay for a fresh round-trip - seems like the obvious lever for a screen-driven ReAct loop.
2. Where does the on-device line land? You’ve already got Silero VAD running local on the RV1106 NPU, which is great; the reasoning still leans cloud. My whole rabbit hole is running capable models on tiny/fixed-point silicon, so I’m naturally curious how far onto the NPU you think the loop can move - privacy, offline, and cost all pull that way even if the heavy reasoning stays remote.
3. USB/IP. You’re already on the Linux gadget stack - tunnelling the HID gadget over usbip would decouple the brain from the physical endpoint (one agent driving a fleet of hosts, or the endpoint living somewhere else entirely). Felt like a natural marriage with what you’re doing.
4. Security, in both directions. The universality that makes this powerful is also the classic BadUSB surface - a keyboard that types on its own - so attestation/detection becomes an interesting co-problem. And the reverse: the agent acts on what it reads off the screen, which quietly makes the screen itself a prompt-injection channel. Not a knock, just the rich (and slightly scary) part of the design.
On squeezing silicon: I had a video bookmarked of someone wringing absurd performance out of bare-metal RISC-V that I’ve annoyingly lost, but the nearest kindred spirit I can point at is Hazard3 (GitHub - Wren6991/Hazard3: 3-stage RV32IMACZb* processor with debug · GitHub) - apt given the RV1106 already has a RISC-V core sitting right next to the A7 and the NPU. If I dig the original up I’ll drop it in here.
Anyway - really like the direction, and I think this field is about to get very rich. I’ll be watching. Best of luck.
Discussion in the ATmosphere