{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihrovy7owe46bdutac62l52ifxhvg5jip6ecdfcnehzlwr6o2azdu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhsijudqjr32"
},
"path": "/t/architecture-suggestions-for-a-chatbot-website-widget/174553#post_2",
"publishedAt": "2026-03-24T08:04:16.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"National Center for State Courts",
"OpenAI Developers",
"FastAPI",
"Docling Project",
"qdrant.tech",
"Pinecone Docs",
"GitHub",
"Claude Platform",
"vLLM"
],
"textContent": "for now:\n\n* * *\n\nThe right next step is to turn your current metadata lookup system into a **routed retrieval system** with **offline document preparation** and **grounded answer generation**.\n\nYour current design already proves one important thing: the corpus is structured enough to answer exact questions like “What is the Gazette for date X?” The part that is missing is not “more RAG.” It is a second layer that can **read inside the matched document** , select the right sections, and synthesize them without losing legal structure. Public-sector chatbot guidance points in the same direction: these systems work best when they are narrow, task-oriented, use plain language, keep answers short, stay current, and provide a clear path to human follow-up when they cannot fully resolve the question. (National Center for State Courts)\n\n## The architecture I would use\n\nI would split the system into **two lanes**.\n\n### Lane 1: deterministic lookup\n\nUse this for questions such as:\n\n * “What is the Gazette for date X?”\n * “Which ordinance was published on day Y?”\n * “What is the latest version of rule Z?”\n\n\n\nThis lane should mostly bypass the LLM. It should hit PostgreSQL or your structured store directly, return the exact record, and optionally let the LLM rewrite it into plain language. This keeps latency low and avoids hallucinations.\n\n### Lane 2: grounded synthesis\n\nUse this for questions such as:\n\n * “Summarize the ordinance from Gazette X.”\n * “How do I file a complaint?”\n * “When did street parking legislation first emerge?”\n\n\n\nThis lane should do five things in order:\n\n 1. identify the document set with metadata\n 2. retrieve the relevant text sections\n 3. expand to surrounding legal structure\n 4. synthesize from those sections only\n 5. answer with citations\n\n\n\nThis is where tool calling helps. OpenAI’s function-calling guide is explicit that models can be connected to external data and actions through structured tools, which is exactly what you need for routing between SQL lookup, document retrieval, summary generation, and procedural extraction. (OpenAI Developers)\n\n## Orchestration for 50+ concurrent users\n\nFor your scale, I would not overbuild. A good production baseline is:\n\n * **FastAPI** for the HTTP layer\n * **PostgreSQL** for metadata and legal structure\n * **Qdrant** for semantic and hybrid retrieval\n * **Redis + Celery** for background jobs\n * **object storage** for PDFs and derived artifacts\n * **API-based LLMs** first, not self-hosting first\n\n\n\nFastAPI’s own deployment docs say that when deploying you typically want replication to take advantage of multiple cores and handle more requests. They also note that in Kubernetes you will usually run a **single Uvicorn process per container** and scale by replicas instead of cramming many workers into one container. Celery’s docs describe the exact queue model you need: a task queue with workers consuming jobs from a broker, supporting both real-time processing and scheduling. (FastAPI)\n\nThe key is not “how many workers?” The key is **queue separation**.\n\nI would create three execution classes:\n\n### A. Interactive queue\n\nFor:\n\n * exact lookup\n * short procedural answers\n * answer generation over already indexed chunks\n\n\n\nTarget latency here should feel like chat.\n\n### B. Heavy synthesis queue\n\nFor:\n\n * full-gazette summaries\n * multi-document comparisons\n * historical timeline reconstruction\n\n\n\nThese jobs are slower and should not block chat.\n\n### C. Ingestion queue\n\nFor:\n\n * PDF parsing\n * OCR fallback\n * embeddings\n * summary generation\n * document reindexing\n\n\n\nThat separation is what protects the user experience when several users ask expensive questions at the same time.\n\n## Context processing for large gazettes and laws\n\nThis is your second major problem, and the answer is **hierarchical summarization plus parent-child retrieval**.\n\nOpenAI’s long-document summarization cookbook shows the correct pattern: split a large document into manageable pieces, summarize the pieces, then combine them into a higher-level summary with controllable detail. That is much more reliable than stuffing one huge document into a single prompt. (OpenAI Developers)\n\nFor your corpus, I would precompute four levels:\n\n### 1. Section summary\n\nFor each article, clause, or ordinance block:\n\n * plain-language summary\n * legal summary\n * key obligations\n * dates\n * penalties\n * responsible office\n\n\n\n### 2. Document summary\n\nFor each ordinance or law:\n\n * what it does\n * what changed\n * who is affected\n * current status\n * relationship to prior rules\n\n\n\n### 3. Gazette-day summary\n\nFor each gazette date:\n\n * all ordinances\n * major changes\n * repeals\n * amendments\n * citizen-facing impact\n\n\n\n### 4. Topic timeline summary\n\nFor topics like parking, complaints, permits, zoning:\n\n * first appearance\n * major amendments\n * latest controlling rule\n * related procedures\n\n\n\nThat gives you a very efficient runtime pattern:\n\n * retrieve the record with metadata\n * load the precomputed summary\n * verify against the source sections\n * answer with evidence\n\n\n\nThis is much faster and cheaper than re-reading the whole PDF every time.\n\n## Document parsing is not optional\n\nYour current failure mode likely starts **before retrieval**.\n\nIf the parser breaks reading order, merges columns, loses tables, or cuts article boundaries badly, the best model in the world will still answer poorly. That is why I would treat parsing as a first-class subsystem.\n\nDocling’s docs say it supports advanced PDF understanding, including page layout, reading order, table structure, formulas, and a unified document representation. PaddleOCR-VL-1.5’s current model card says it is built for document understanding and reaches state-of-the-art accuracy on OmniDocBench v1.5, with strong robustness to skew, warping, and screen-photography artifacts. (Docling Project)\n\nFor your project, that means:\n\n * use **Docling** first for structured extraction\n * use **OCR/document VLM fallback** only for hard scans\n * chunk by **legal structure** , not just by token count\n\n\n\nI would chunk at three levels:\n\n * full document\n * structural unit like chapter or article range\n * small retrieval chunk\n\n\n\nAt runtime, retrieve small chunks, rerank them, then expand to the parent structural unit before asking the LLM to answer.\n\n## Which vector database I recommend\n\nFor your workload, my first recommendation is **Qdrant**.\n\nWhy:\n\n * it supports **hybrid retrieval**\n * it can combine **dense and sparse** queries\n * it documents **Reciprocal Rank Fusion** for combining them\n * it works well with metadata-heavy search patterns\n\n\n\nQdrant’s hybrid query docs show exactly the pattern you need: prefetch sparse and dense candidates, fuse them with RRF, then limit the final set. That is a strong fit for legal corpora because legal search needs both semantic meaning and lexical precision. (qdrant.tech)\n\n### Why not pure vector search\n\nYour corpus has:\n\n * dates\n * gazette numbers\n * ordinance numbers\n * jurisdictions\n * statuses\n * exact legal phrases\n\n\n\nSo the right pattern is:\n\n**metadata filter → hybrid retrieval → rerank → synthesize**\n\n### When I would choose something else\n\nIf you want the least operational work, **Pinecone** is a credible managed alternative. Its own docs recommend a single hybrid index for most cases because it reduces operational overhead and allows single-request hybrid queries. (Pinecone Docs)\n\nIf you want the smallest possible stack and your corpus is still moderate, **pgvector** is a real option. But its own README shows the tradeoff: with approximate indexes, filtering is applied **after** the index scan, so filtered ANN queries often need tuning with `ef_search`, iterative scans, partial indexes, or partitioning. That is workable, but it is not as clean for metadata-heavy legal search as a dedicated retrieval system. (GitHub)\n\nSo my recommendation is:\n\n * **Qdrant + PostgreSQL** for the best balance\n * **Pinecone + PostgreSQL** if you want lower ops\n * **pgvector** only if simplicity matters more than retrieval sophistication\n\n\n\n## API models or local models\n\nFor your current stage, I would start with **API models**.\n\nNot because local serving is bad. Because your true bottlenecks are still:\n\n * parsing\n * routing\n * retrieval quality\n * summary design\n * evidence formatting\n\n\n\nLocal inference becomes attractive later, when you know your real token volume and your workload is steady enough to keep GPUs busy.\n\n### Why APIs make sense first\n\nOpenAI’s current API pricing page lists **gpt-5.4-mini** at **$0.75 per 1M input tokens** and **$4.50 per 1M output tokens** , with **Batch** pricing at half that level. The same docs say prompt caching can reduce latency by up to **80%** and input token cost by up to **90%** when requests share long prompt prefixes, and the Batch API gives **50% lower costs** , a separate higher-rate-limit pool, and async completion within 24 hours. OpenAI’s data-controls guide also says API data is not used to train or improve OpenAI models unless you explicitly opt in. (OpenAI Developers)\n\nAnthropic’s current pricing and Sonnet pages say **Claude Sonnet 4.6** starts at **$3 per million input tokens** and **$15 per million output tokens** , supports a **1M-token context window** , and also offers up to **90% savings with prompt caching** and **50% savings with batch processing**. (Claude Platform)\n\nThat leads to a practical rule:\n\n### Use API models for:\n\n * live chat answers\n * routing\n * short summaries\n * high-quality final answer synthesis\n\n\n\n### Use Batch / offline jobs for:\n\n * embeddings\n * document summaries\n * gazette summaries\n * timeline generation\n * nightly reprocessing\n\n\n\nThis is the highest-leverage cost optimization for your case.\n\n### When local inference becomes worth it\n\nvLLM is the right self-hosting path when you get there, because it provides an **OpenAI-compatible server**. That lowers migration friction. But self-hosting only starts to make economic sense when you have one or more of these conditions:\n\n * strict sovereignty or residency needs\n * large, steady token volume\n * strong in-house ops capability\n * predictable, heavy offline workloads\n\n\n\nUntil then, APIs are usually cheaper in total engineering cost, even if the raw per-token price looks higher. vLLM solves serving. It does not solve parsing, retrieval, routing, monitoring, or concurrency tuning for you. (vLLM)\n\n## How your example questions should work\n\n### “When did street parking legislation first emerge?”\n\nThis should not be treated as plain semantic search.\n\nThe system should:\n\n 1. detect a history/timeline question\n 2. filter candidate documents by topic and jurisdiction\n 3. sort by publication or effective date\n 4. retrieve the earliest candidates\n 5. verify that they actually introduce parking rules\n 6. answer with the earliest verified source and later milestones\n\n\n\nThat is **metadata logic plus retrieval plus synthesis**.\n\n### “How do I submit a law for floor approval?”\n\nThis is a **procedure question**.\n\nThe system should:\n\n 1. search manuals, workflow rules, forms, and deadlines\n 2. extract a step-by-step procedure\n 3. answer in plain language\n 4. cite the rule and any required forms or offices\n\n\n\n### “How do I file a complaint?”\n\nThis should return:\n\n * who can file\n * where to file\n * required documents\n * deadlines\n * online or in-person options\n * what happens next\n\n\n\nThat is a structured service answer, not a chunk dump.\n\n## Product design ideas that fit this domain\n\nThe strongest product idea is to stop thinking in terms of “chat only” and start thinking in terms of **answer cards**.\n\nI would render answers as:\n\n### Direct answer\n\nOne short paragraph.\n\n### Legal basis\n\nGazette, ordinance, section, date.\n\n### What to do next\n\nFor procedures.\n\n### Related resources\n\nForms, offices, newer version, older version.\n\n### Limits\n\nIf the evidence is weak, say so.\n\nThat matches public legal-information guidance well. The NCSC guide explicitly recommends plain language, short responses, clear expectation-setting, and a clear path to follow up with the court when the chatbot cannot answer everything. (National Center for State Courts)\n\n## My direct answers to your four questions\n\n### 1. Orchestration\n\nUse **FastAPI + PostgreSQL + Qdrant + Redis/Celery + object storage + API models**. Scale FastAPI by replicas. Keep interactive traffic separate from heavy summary and ingestion jobs. (FastAPI)\n\n### 2. Context processing\n\nUse **hierarchical summarization** , structural chunking, parent-child retrieval, and precomputed summaries. Do not rely on one huge prompt per gazette. (OpenAI Developers)\n\n### 3. Vector infrastructure\n\nUse **Qdrant** first. It is a strong fit for metadata-aware hybrid retrieval. Use Pinecone if you want lower operational burden. Use pgvector only if you deliberately want a smaller, simpler stack and can tune filtered ANN behavior yourself. (qdrant.tech)\n\n### 4. Cost-benefit\n\nFor your scale today, **API models are the better first choice**. Use prompt caching and batch for the heavy offline work. Revisit vLLM only after you have measured real usage and know that GPU utilization will stay high enough to justify the operational overhead. (OpenAI Developers)\n\n## Bottom line\n\nYour best architecture is not “better RAG.” It is:\n\n**structured ingestion → metadata routing → hybrid retrieval → reranking → grounded synthesis → citations**\n\nThat is the design that will let your widget feel fluid like a general AI assistant while still behaving like a trustworthy government information system. The next concrete step is to define **four request flows** only: lookup, summary, procedure, and history.",
"title": "Architecture Suggestions for a Chatbot (Website Widget)"
}