External Publication
Visit Post

FlashRT: realtime small-batch inference for VLA / embodied AI models

Hugging Face Forums [Unofficial] May 25, 2026
Source

Hi Hugging Face community,

I’d like to share FlashRT , an open-source realtime inference engine I’m building for small-batch, latency-sensitive AI workloads.

The motivation is simple: most modern inference infrastructure is extremely good at large-batch cloud serving, but many emerging workloads look very different:

  • VLA / embodied AI policies

  • local LLM inference

  • diffusion / video generation

  • world models

  • robot control loops

  • single-user local GPU applications

  • edge deployment on Jetson-class devices

These workloads are often batch size 1 , latency-sensitive, multi-modal, and hard to optimize with general-purpose serving assumptions. Average throughput is not enough — p50 / p95 latency, launch overhead, memory movement, quantization behavior, and small-op fragmentation matter a lot.

FlashRT is my attempt to build a CUDA-first inference runtime for this setting.

It focuses on:

  • direct CUDA kernel execution

  • model-specific small-batch optimization

  • fused quantization and dequantization

  • FP8 / INT8 / FP4-oriented execution paths

  • CUDA Graph replay

  • reducing Python/runtime overhead

  • avoiding long compile/export/calibration pipelines

  • making realtime inference practical on both edge devices and consumer GPUs

Current supported / tested directions include:

VLA / robotics:
  - Pi0.5
  - Pi0
  - Pi0-FAST
  - GR00T N1.6
  - Jetson Thor / Orin / RTX local GPU deployment

LLM:
  - Qwen-style local inference experiments
  - single-stream / small-batch optimization
  - kernel-level quantization and fusion paths

Diffusion / world models:
  - Wan2.2 / Motus-style world model optimization
  - video/world-model inference on consumer GPUs
  - reducing fragmented launch overhead with fused kernels

Some current benchmark examples:

Pi0.5 / VLA inference:

Jetson AGX Orin 64GB / SM87
  Pi0.5 DROID INT8, 2 cameras, 27 layers, 10 diffusion steps
  cache_frames=1:
    P50 latency: 124 ms
    Throughput: 8.04 Hz
    Cosine: 1.000 vs BF16 reference

Jetson Thor / SM110
  Pi0.5-class workload:
    around 44–46 ms

RTX 5090 / SM120
  Pi0.5-class workload:
    around 17–18 ms




GR00T N1.6:

Jetson Thor:
  around 41–45 ms depending on sequence setting

RTX 5090:
  around 12–13 ms




Pi0-FAST:

Jetson Thor:
  around 8.1 ms / token

RTX 5090:
  around 2.4 ms / token




World model / video diffusion direction:

Wan2.2 / Motus-style 5B world model:
  baseline: around 1.2s E2E for 10-step inference
  current optimized path: around 200 ms
  target: around 100 ms-class realtime world-model inference on consumer GPUs

The core idea is that many of these models are not limited only by “GEMM speed”. They are often limited by runtime overhead, fragmented kernels, poor small-batch scheduling, quantization boundaries, memory movement, and compiler/toolchain mismatch.

For example, in robotics inference, an action chunk can give you some temporal buffer, but the observation still needs to be fresh. If the runtime introduces unstable latency or excessive overhead, the robot ends up executing stale actions. This makes VLA inference a realtime systems problem, not just a model-serving problem.

Similarly, for local LLM and world-model workloads, single-user latency is very different from datacenter batch throughput. A backend optimized for high-throughput serving is not always optimal for a single stream on a local GPU.

Repo:

https://github.com/LiangSu8899/FlashRT

Jetson Orin deployment docs:

https://github.com/LiangSu8899/FlashRT/blob/main/docs/deployment_orin.md

The project is still early, but I’d love feedback from people working on:

  • VLA / robotics policies

  • local LLM inference

  • diffusion / video generation

  • world models

  • Jetson deployment

  • CUDA kernels

  • FP8 / INT8 / FP4 quantization

  • Hugging Face model integration

  • realtime inference APIs

My longer-term goal is to make FlashRT a practical low-latency backend for models that do not fit the standard “large-batch cloud serving” assumption — especially embodied AI, local agents, and realtime multi-modal systems.

Feedback, benchmarks, issues, PRs, and model integration suggestions are very welcome.

Discussion in the ATmosphere

Loading comments...