HallucinationBench — detect hallucinations in RAG output, now on PyPI

OpenAI Developer Community April 3, 2026

Source

Hi everyone,

Just published HallucinationBench to PyPI — a lightweight library for detecting hallucinations in RAG pipeline output.

pip install hallucinationbench

Usage:

from hallucinationbench import score

result = score(context=docs, response=llm_output) print(result.verdict) # PASS / WARN / FAIL print(result.faithfulness_score) # 0.0 – 1.0 print(result.hallucinated_claims) # list of fabricated statements

It uses GPT-4o-mini as a structured judge (~$0.001 per eval). No embeddings, no vector DB, no infrastructure.

Two design decisions I would love feedback on from this community:

Using response_format: json_object with temperature=0 for deterministic structured output — any edge cases I should handle?
Verdict thresholds (PASS >= 0.8, WARN >= 0.5, FAIL < 0.5) — do these feel right for production RAG systems?

PyPI: Client Challenge GitHub: GitHub - bdeva1975/hallucinationbench: Detect hallucinations in your RAG pipeline output — in two lines of Python. · GitHub

Feedback and PRs welcome!

Discussion in the ATmosphere