Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibnkl5ndkolqn6pj75olqfp6clymeptzej76ifesoy7rewpk5faeq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjm3uetrtdt2"
  },
  "path": "/t/why-this-bertscore-has-a-high-precision/175276#post_2",
  "publishedAt": "2026-04-16T07:45:57.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Comet",
    "Sentence Transformers"
  ],
  "textContent": "It seems to be a feature of the library that the scores can be hard to interpret unless you explicitly rescale them…\n\n* * *\n\nNo. You are not doing it wrong.\n\nThe confusing part is that **BERTScore precision is not “86% correct”**. BERTScore is a **token-matching similarity metric**. It compares each token in your prediction with tokens in the reference using **contextual embeddings** and **cosine similarity** , then computes precision, recall, and F1 from those matches. The official Hugging Face metric docs describe it exactly that way. (GitHub)\n\n## What your result really means\n\nYour output:\n\n\n    {'precision': [0.8612],\n     'recall': [0.8896],\n     'f1': [0.8752],\n     'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=5.0.0)'}\n\n\ndoes **not** mean:\n\n  * the two sentences are 86% semantically similar\n  * or that 86% of the words are “correct”\n\n\n\nIt means something narrower:\n\n  * for each token in **“who are you”** , BERTScore found the **best-matching token** in **“look at the dogs”**\n  * those best-match similarities were fairly high on average\n  * so the final precision came out around `0.86` (GitHub)\n\n\n\n## Why this can happen even for unrelated sentences\n\nBERTScore uses **soft matches** , not exact matches.\n\nSo it is not asking:\n\n> “Are these two sentences obviously the same meaning?”\n\nIt is asking something closer to:\n\n> “How similar are the candidate tokens to their best reference-token matches in embedding space?” (GitHub)\n\nThat is why a short unrelated pair can still score high.\n\n## Why your score is especially high\n\nYour hashcode says:\n\n  * `roberta-large`\n  * layer `L17`\n  * `no-idf`\n\n\n\nThat is the standard English setup. The official BERTScore repo uses **`roberta-large` as the default English model**, and its rescaling note says raw BERTScore values with large RoBERTa models often live in a **narrow high range** , often around **0.85 to 0.95**. So your `0.8612` is surprising to a human reader, but **not surprising to the metric**. (GitHub)\n\n## Is your input format okay?\n\nYes.\n\nThis is valid:\n\n\n    references=[[\"look at the dogs\"]]\n\n\nThe Hugging Face BERTScore metric accepts `references` as either:\n\n  * a list of strings, or\n  * a list of lists of strings\n\n\n\nThe nested-list form is for **multiple references per prediction**. Since you only gave one reference, your format is fine. (GitHub)\n\n## The easiest way to think about it\n\nImagine BERTScore doing this:\n\n  * candidate: `who | are | you`\n  * reference: `look | at | the | dogs`\n\n\n\nFor precision, it roughly does:\n\n  * best match for `who`\n  * best match for `are`\n  * best match for `you`\n  * average those 3 values\n\n\n\nSo if those best matches were something like:\n\n\n    who -> 0.84\n    are -> 0.87\n    you -> 0.87\n\n\nthen the average would be about `0.86`.\n\nThat still would **not** mean the sentences are 86% similar. It would only mean the **best token-level contextual matches** were high. (Comet)\n\n## What to do if you want a more human-readable score\n\nUse:\n\n\n    rescale_with_baseline=True\n\n\nThe official BERTScore rescaling note says this was added because raw scores are hard to interpret and are often compressed into a small range. (GitHub)\n\nExample:\n\n\n    bert_metric.compute(\n        predictions=[\"who are you\"],\n        references=[\"look at the dogs\"],\n        lang=\"en\",\n        rescale_with_baseline=True,\n    )\n\n\n## If your real goal is sentence similarity\n\nIf what you really want is:\n\n> “How semantically similar are these two sentences?”\n\nthen a sentence-similarity model is usually a better fit than raw BERTScore. Sentence-Transformers’ official docs are specifically about **Semantic Textual Similarity** for this kind of task. (Sentence Transformers)\n\n## Bottom line\n\nThe simplest answer is:\n\n  * **Your code is fine**\n  * **Your interpretation is the part to change**\n  * **`0.8612` is not “86% similar”**\n  * it is a **raw token-level contextual similarity score**\n  * and with default English `roberta-large`, raw BERTScore values often look **artificially high** to humans (GitHub)\n\n",
  "title": "Why this BERTScore has a high precision?"
}