{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigbpwczankpshwjmtlt3ypajn5smlebuq4lufrwyqhlyfnuzc3oce",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmo7g7judn32"
  },
  "path": "/t/flashrt-realtime-small-batch-inference-for-vla-embodied-ai-models/176212#post_1",
  "publishedAt": "2026-05-25T08:05:05.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://github.com/LiangSu8899/FlashRT",
    "https://github.com/LiangSu8899/FlashRT/blob/main/docs/deployment_orin.md"
  ],
  "textContent": "Hi Hugging Face community,\n\nI’d like to share **FlashRT** , an open-source realtime inference engine I’m building for small-batch, latency-sensitive AI workloads.\n\nThe motivation is simple: most modern inference infrastructure is extremely good at large-batch cloud serving, but many emerging workloads look very different:\n\n  * VLA / embodied AI policies\n\n  * local LLM inference\n\n  * diffusion / video generation\n\n  * world models\n\n  * robot control loops\n\n  * single-user local GPU applications\n\n  * edge deployment on Jetson-class devices\n\n\n\n\nThese workloads are often **batch size 1** , latency-sensitive, multi-modal, and hard to optimize with general-purpose serving assumptions. Average throughput is not enough — p50 / p95 latency, launch overhead, memory movement, quantization behavior, and small-op fragmentation matter a lot.\n\nFlashRT is my attempt to build a CUDA-first inference runtime for this setting.\n\nIt focuses on:\n\n  * direct CUDA kernel execution\n\n  * model-specific small-batch optimization\n\n  * fused quantization and dequantization\n\n  * FP8 / INT8 / FP4-oriented execution paths\n\n  * CUDA Graph replay\n\n  * reducing Python/runtime overhead\n\n  * avoiding long compile/export/calibration pipelines\n\n  * making realtime inference practical on both edge devices and consumer GPUs\n\n\n\n\nCurrent supported / tested directions include:\n\n\n    VLA / robotics:\n      - Pi0.5\n      - Pi0\n      - Pi0-FAST\n      - GR00T N1.6\n      - Jetson Thor / Orin / RTX local GPU deployment\n\n    LLM:\n      - Qwen-style local inference experiments\n      - single-stream / small-batch optimization\n      - kernel-level quantization and fusion paths\n\n    Diffusion / world models:\n      - Wan2.2 / Motus-style world model optimization\n      - video/world-model inference on consumer GPUs\n      - reducing fragmented launch overhead with fused kernels\n\n\n\nSome current benchmark examples:\n\n\n    Pi0.5 / VLA inference:\n\n    Jetson AGX Orin 64GB / SM87\n      Pi0.5 DROID INT8, 2 cameras, 27 layers, 10 diffusion steps\n      cache_frames=1:\n        P50 latency: 124 ms\n        Throughput: 8.04 Hz\n        Cosine: 1.000 vs BF16 reference\n\n    Jetson Thor / SM110\n      Pi0.5-class workload:\n        around 44–46 ms\n\n    RTX 5090 / SM120\n      Pi0.5-class workload:\n        around 17–18 ms\n\n\n\n\n    GR00T N1.6:\n\n    Jetson Thor:\n      around 41–45 ms depending on sequence setting\n\n    RTX 5090:\n      around 12–13 ms\n\n\n\n\n    Pi0-FAST:\n\n    Jetson Thor:\n      around 8.1 ms / token\n\n    RTX 5090:\n      around 2.4 ms / token\n\n\n\n\n    World model / video diffusion direction:\n\n    Wan2.2 / Motus-style 5B world model:\n      baseline: around 1.2s E2E for 10-step inference\n      current optimized path: around 200 ms\n      target: around 100 ms-class realtime world-model inference on consumer GPUs\n\n\n\nThe core idea is that many of these models are not limited only by “GEMM speed”. They are often limited by runtime overhead, fragmented kernels, poor small-batch scheduling, quantization boundaries, memory movement, and compiler/toolchain mismatch.\n\nFor example, in robotics inference, an action chunk can give you some temporal buffer, but the observation still needs to be fresh. If the runtime introduces unstable latency or excessive overhead, the robot ends up executing stale actions. This makes VLA inference a realtime systems problem, not just a model-serving problem.\n\nSimilarly, for local LLM and world-model workloads, single-user latency is very different from datacenter batch throughput. A backend optimized for high-throughput serving is not always optimal for a single stream on a local GPU.\n\nRepo:\n\nhttps://github.com/LiangSu8899/FlashRT\n\nJetson Orin deployment docs:\n\nhttps://github.com/LiangSu8899/FlashRT/blob/main/docs/deployment_orin.md\n\nThe project is still early, but I’d love feedback from people working on:\n\n  * VLA / robotics policies\n\n  * local LLM inference\n\n  * diffusion / video generation\n\n  * world models\n\n  * Jetson deployment\n\n  * CUDA kernels\n\n  * FP8 / INT8 / FP4 quantization\n\n  * Hugging Face model integration\n\n  * realtime inference APIs\n\n\n\n\nMy longer-term goal is to make FlashRT a practical low-latency backend for models that do not fit the standard “large-batch cloud serving” assumption — especially embodied AI, local agents, and realtime multi-modal systems.\n\nFeedback, benchmarks, issues, PRs, and model integration suggestions are very welcome.",
  "title": "FlashRT: realtime small-batch inference for VLA / embodied AI models"
}