{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreieqo4el27k5lqrmfz3swotjjd52de5phjltosbp57cfgve2zetzri",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjjlcvb3ivp2"
},
"path": "/t/docling-studio-0-4-0-from-ocr-debugger-to-rag-pipeline-inspection-tool/175267#post_1",
"publishedAt": "2026-04-15T09:01:29.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"github.com/scub-france/Docling-Studio",
"huggingface.co/spaces/Pier-Jean/Docling-Studio"
],
"textContent": "Hey everyone,\n\nJust shipped Docling Studio 0.4.0 and wanted to share here since the project started getting traction on HF.\n\nQuick recap: Docling Studio is a visual inspection tool for Docling (IBM Research / LF AI & Data). You convert a PDF, you see bounding boxes, chunks, layout — everything Docling extracts, rendered visually so you can actually debug what’s going on.\n\nThat part is still there and unchanged. But 0.4.0 adds something I’ve been working toward for a while: **a full ingestion pipeline**.\n\nThe flow is now: Docling → chunking → embedding (sentence-transformers) → OpenSearch. End-to-end, orchestrated, with idempotent re-ingestion.\n\nWhy does this matter? If you’re building RAG on top of Docling, at some point your retrieval gives bad results and you need to figure out why. Was the chunking wrong? Did a table get split across two chunks? Is there garbage text from a bad OCR region? Docling Studio now lets you visually inspect what’s actually in your vector store, edit chunk text inline, soft-delete chunks that shouldn’t be there, and search across indexed content.\n\nA few things worth noting:\n\n * The whole ingestion pipeline is **opt-in via feature flags**. No `OPENSEARCH_URL` set → no ingestion UI, no extra dependencies, same lightweight image as before. People using it as a pure OCR debugger won’t notice any difference.\n\n * Architecture is hexagonal (ports & adapters). OpenSearch is the first `VectorStore` adapter. The port is a Python Protocol with 5 methods — adding another store is straightforward.\n\n * 541 tests (380 backend, 161 frontend) including Karate E2E tests covering the full PDF-to-OpenSearch flow.\n\n * Still ships as a single Docker image, multi-arch.\n\n\n\n\nYou can try it right now:\n\n\n docker pull ghcr.io/scub-france/docling-studio:0.4.0-remote\n\n\nOr check the repo: github.com/scub-france/Docling-Studio\n\nThere’s also a demo on HF Spaces (OCR debug mode only, no ingestion there obviously): huggingface.co/spaces/Pier-Jean/Docling-Studio\n\nWould love to hear feedback — especially from people building RAG pipelines with Docling. What vector store would you want to see next? What’s your biggest pain point when debugging retrieval quality?",
"title": "Docling Studio 0.4.0 — from OCR debugger to RAG pipeline inspection tool"
}