External Publication

TRACE Score — a metric for multi-turn LLM consistency

Hugging Face Forums [Unofficial] April 19, 2026

Built a metric that evaluates the full conversation arc instead of individual turns.

BERTScore for a conversation where the model ignores every user correction: 0.84. TRACE for the same conversation: 0.61.

TRACE has five components — fact retention, self-contradiction, correction retention, topic coherence, confidence stability. Benchmarked on 102 conversations with Llama-3.1-8B. TRACE separates failure categories with a range of 0.277. BERTScore range is 0.044. The model retains user corrections 25% of the time. No per-turn metric can detect this.

PyPi Package: trace-score · PyPI

github.com

GitHub - Giri530/trace-score

Contribute to Giri530/trace-score development by creating an account on GitHub.

GitHub - Giri530/trace-score

Discussion in the ATmosphere