FlashRT: realtime small-batch inference for VLA / embodied AI models
Hi Hugging Face community,
I’d like to share FlashRT , an open-source realtime inference engine I’m building for small-batch, latency-sensitive AI workloads.
The motivation is simple: most modern inference infrastructure is extremely good at large-batch cloud serving, but many emerging workloads look very different:
VLA / embodied AI policies
local LLM inference
diffusion / video generation
world models
robot control loops
single-user local GPU applications
edge deployment on Jetson-class devices
These workloads are often batch size 1 , latency-sensitive, multi-modal, and hard to optimize with general-purpose serving assumptions. Average throughput is not enough — p50 / p95 latency, launch overhead, memory movement, quantization behavior, and small-op fragmentation matter a lot.
FlashRT is my attempt to build a CUDA-first inference runtime for this setting.
It focuses on:
direct CUDA kernel execution
model-specific small-batch optimization
fused quantization and dequantization
FP8 / INT8 / FP4-oriented execution paths
CUDA Graph replay
reducing Python/runtime overhead
avoiding long compile/export/calibration pipelines
making realtime inference practical on both edge devices and consumer GPUs
Current supported / tested directions include:
VLA / robotics:
- Pi0.5
- Pi0
- Pi0-FAST
- GR00T N1.6
- Jetson Thor / Orin / RTX local GPU deployment
LLM:
- Qwen-style local inference experiments
- single-stream / small-batch optimization
- kernel-level quantization and fusion paths
Diffusion / world models:
- Wan2.2 / Motus-style world model optimization
- video/world-model inference on consumer GPUs
- reducing fragmented launch overhead with fused kernels
Some current benchmark examples:
Pi0.5 / VLA inference:
Jetson AGX Orin 64GB / SM87
Pi0.5 DROID INT8, 2 cameras, 27 layers, 10 diffusion steps
cache_frames=1:
P50 latency: 124 ms
Throughput: 8.04 Hz
Cosine: 1.000 vs BF16 reference
Jetson Thor / SM110
Pi0.5-class workload:
around 44–46 ms
RTX 5090 / SM120
Pi0.5-class workload:
around 17–18 ms
GR00T N1.6:
Jetson Thor:
around 41–45 ms depending on sequence setting
RTX 5090:
around 12–13 ms
Pi0-FAST:
Jetson Thor:
around 8.1 ms / token
RTX 5090:
around 2.4 ms / token
World model / video diffusion direction:
Wan2.2 / Motus-style 5B world model:
baseline: around 1.2s E2E for 10-step inference
current optimized path: around 200 ms
target: around 100 ms-class realtime world-model inference on consumer GPUs
The core idea is that many of these models are not limited only by “GEMM speed”. They are often limited by runtime overhead, fragmented kernels, poor small-batch scheduling, quantization boundaries, memory movement, and compiler/toolchain mismatch.
For example, in robotics inference, an action chunk can give you some temporal buffer, but the observation still needs to be fresh. If the runtime introduces unstable latency or excessive overhead, the robot ends up executing stale actions. This makes VLA inference a realtime systems problem, not just a model-serving problem.
Similarly, for local LLM and world-model workloads, single-user latency is very different from datacenter batch throughput. A backend optimized for high-throughput serving is not always optimal for a single stream on a local GPU.
Repo:
https://github.com/LiangSu8899/FlashRT
Jetson Orin deployment docs:
https://github.com/LiangSu8899/FlashRT/blob/main/docs/deployment_orin.md
The project is still early, but I’d love feedback from people working on:
VLA / robotics policies
local LLM inference
diffusion / video generation
world models
Jetson deployment
CUDA kernels
FP8 / INT8 / FP4 quantization
Hugging Face model integration
realtime inference APIs
My longer-term goal is to make FlashRT a practical low-latency backend for models that do not fit the standard “large-batch cloud serving” assumption — especially embodied AI, local agents, and realtime multi-modal systems.
Feedback, benchmarks, issues, PRs, and model integration suggestions are very welcome.
Discussion in the ATmosphere