{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreibnkl5ndkolqn6pj75olqfp6clymeptzej76ifesoy7rewpk5faeq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjm3uetrtdt2"
},
"path": "/t/why-this-bertscore-has-a-high-precision/175276#post_2",
"publishedAt": "2026-04-16T07:45:57.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub",
"Comet",
"Sentence Transformers"
],
"textContent": "It seems to be a feature of the library that the scores can be hard to interpret unless you explicitly rescale them…\n\n* * *\n\nNo. You are not doing it wrong.\n\nThe confusing part is that **BERTScore precision is not “86% correct”**. BERTScore is a **token-matching similarity metric**. It compares each token in your prediction with tokens in the reference using **contextual embeddings** and **cosine similarity** , then computes precision, recall, and F1 from those matches. The official Hugging Face metric docs describe it exactly that way. (GitHub)\n\n## What your result really means\n\nYour output:\n\n\n {'precision': [0.8612],\n 'recall': [0.8896],\n 'f1': [0.8752],\n 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=5.0.0)'}\n\n\ndoes **not** mean:\n\n * the two sentences are 86% semantically similar\n * or that 86% of the words are “correct”\n\n\n\nIt means something narrower:\n\n * for each token in **“who are you”** , BERTScore found the **best-matching token** in **“look at the dogs”**\n * those best-match similarities were fairly high on average\n * so the final precision came out around `0.86` (GitHub)\n\n\n\n## Why this can happen even for unrelated sentences\n\nBERTScore uses **soft matches** , not exact matches.\n\nSo it is not asking:\n\n> “Are these two sentences obviously the same meaning?”\n\nIt is asking something closer to:\n\n> “How similar are the candidate tokens to their best reference-token matches in embedding space?” (GitHub)\n\nThat is why a short unrelated pair can still score high.\n\n## Why your score is especially high\n\nYour hashcode says:\n\n * `roberta-large`\n * layer `L17`\n * `no-idf`\n\n\n\nThat is the standard English setup. The official BERTScore repo uses **`roberta-large` as the default English model**, and its rescaling note says raw BERTScore values with large RoBERTa models often live in a **narrow high range** , often around **0.85 to 0.95**. So your `0.8612` is surprising to a human reader, but **not surprising to the metric**. (GitHub)\n\n## Is your input format okay?\n\nYes.\n\nThis is valid:\n\n\n references=[[\"look at the dogs\"]]\n\n\nThe Hugging Face BERTScore metric accepts `references` as either:\n\n * a list of strings, or\n * a list of lists of strings\n\n\n\nThe nested-list form is for **multiple references per prediction**. Since you only gave one reference, your format is fine. (GitHub)\n\n## The easiest way to think about it\n\nImagine BERTScore doing this:\n\n * candidate: `who | are | you`\n * reference: `look | at | the | dogs`\n\n\n\nFor precision, it roughly does:\n\n * best match for `who`\n * best match for `are`\n * best match for `you`\n * average those 3 values\n\n\n\nSo if those best matches were something like:\n\n\n who -> 0.84\n are -> 0.87\n you -> 0.87\n\n\nthen the average would be about `0.86`.\n\nThat still would **not** mean the sentences are 86% similar. It would only mean the **best token-level contextual matches** were high. (Comet)\n\n## What to do if you want a more human-readable score\n\nUse:\n\n\n rescale_with_baseline=True\n\n\nThe official BERTScore rescaling note says this was added because raw scores are hard to interpret and are often compressed into a small range. (GitHub)\n\nExample:\n\n\n bert_metric.compute(\n predictions=[\"who are you\"],\n references=[\"look at the dogs\"],\n lang=\"en\",\n rescale_with_baseline=True,\n )\n\n\n## If your real goal is sentence similarity\n\nIf what you really want is:\n\n> “How semantically similar are these two sentences?”\n\nthen a sentence-similarity model is usually a better fit than raw BERTScore. Sentence-Transformers’ official docs are specifically about **Semantic Textual Similarity** for this kind of task. (Sentence Transformers)\n\n## Bottom line\n\nThe simplest answer is:\n\n * **Your code is fine**\n * **Your interpretation is the part to change**\n * **`0.8612` is not “86% similar”**\n * it is a **raw token-level contextual similarity score**\n * and with default English `roberta-large`, raw BERTScore values often look **artificially high** to humans (GitHub)\n\n",
"title": "Why this BERTScore has a high precision?"
}