{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigy2pvbzunmtvfp7emhaaag2luwd6vi47la6uqgvctswszsrb5u2a",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mikw6otfpva2"
},
"path": "/t/hallucinationbench-detect-hallucinations-in-rag-output-now-on-pypi/1378411#post_1",
"publishedAt": "2026-04-03T04:02:00.000Z",
"site": "https://community.openai.com",
"tags": [
"Client Challenge",
"GitHub - bdeva1975/hallucinationbench: Detect hallucinations in your RAG pipeline output — in two lines of Python. · GitHub"
],
"textContent": "Hi everyone,\n\nJust published HallucinationBench to PyPI — a lightweight library for\ndetecting hallucinations in RAG pipeline output.\n\npip install hallucinationbench\n\nUsage:\n\nfrom hallucinationbench import score\n\nresult = score(context=docs, response=llm_output)\nprint(result.verdict) # PASS / WARN / FAIL\nprint(result.faithfulness_score) # 0.0 – 1.0\nprint(result.hallucinated_claims) # list of fabricated statements\n\nIt uses GPT-4o-mini as a structured judge (~$0.001 per eval).\nNo embeddings, no vector DB, no infrastructure.\n\nTwo design decisions I would love feedback on from this community:\n\n 1. Using response_format: json_object with temperature=0 for\ndeterministic structured output — any edge cases I should handle?\n\n 2. Verdict thresholds (PASS >= 0.8, WARN >= 0.5, FAIL < 0.5) —\ndo these feel right for production RAG systems?\n\n\n\n\nPyPI: Client Challenge\nGitHub: GitHub - bdeva1975/hallucinationbench: Detect hallucinations in your RAG pipeline output — in two lines of Python. · GitHub\n\nFeedback and PRs welcome!",
"title": "HallucinationBench — detect hallucinations in RAG output, now on PyPI"
}