TRACE Score — a metric for multi-turn LLM consistency
Hugging Face Forums [Unofficial]
April 19, 2026
Built a metric that evaluates the full conversation arc instead of individual turns.
BERTScore for a conversation where the model ignores every user correction: 0.84. TRACE for the same conversation: 0.61.
TRACE has five components — fact retention, self-contradiction, correction retention, topic coherence, confidence stability. Benchmarked on 102 conversations with Llama-3.1-8B. TRACE separates failure categories with a range of 0.277. BERTScore range is 0.044. The model retains user corrections 25% of the time. No per-turn metric can detect this.
PyPi Package: trace-score · PyPI
github.com
GitHub - Giri530/trace-score
Contribute to Giri530/trace-score development by creating an account on GitHub.
Discussion in the ATmosphere