Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigy2pvbzunmtvfp7emhaaag2luwd6vi47la6uqgvctswszsrb5u2a",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mikw6otfpva2"
  },
  "path": "/t/hallucinationbench-detect-hallucinations-in-rag-output-now-on-pypi/1378411#post_1",
  "publishedAt": "2026-04-03T04:02:00.000Z",
  "site": "https://community.openai.com",
  "tags": [
    "Client Challenge",
    "GitHub - bdeva1975/hallucinationbench: Detect hallucinations in your RAG pipeline output — in two lines of Python. · GitHub"
  ],
  "textContent": "Hi everyone,\n\nJust published HallucinationBench to PyPI — a lightweight library for\ndetecting hallucinations in RAG pipeline output.\n\npip install hallucinationbench\n\nUsage:\n\nfrom hallucinationbench import score\n\nresult = score(context=docs, response=llm_output)\nprint(result.verdict) # PASS / WARN / FAIL\nprint(result.faithfulness_score) # 0.0 – 1.0\nprint(result.hallucinated_claims) # list of fabricated statements\n\nIt uses GPT-4o-mini as a structured judge (~$0.001 per eval).\nNo embeddings, no vector DB, no infrastructure.\n\nTwo design decisions I would love feedback on from this community:\n\n  1. Using response_format: json_object with temperature=0 for\ndeterministic structured output — any edge cases I should handle?\n\n  2. Verdict thresholds (PASS >= 0.8, WARN >= 0.5, FAIL < 0.5) —\ndo these feel right for production RAG systems?\n\n\n\n\nPyPI: Client Challenge\nGitHub: GitHub - bdeva1975/hallucinationbench: Detect hallucinations in your RAG pipeline output — in two lines of Python. · GitHub\n\nFeedback and PRs welcome!",
  "title": "HallucinationBench — detect hallucinations in RAG output, now on PyPI"
}