Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifhbqjbrkczczj7o3qca23qshnbyli5weybs3cdmitlw3ff7bgek4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkhzcjzgy2u2"
  },
  "path": "/t/custom-batches-in-sentence-transformers-for-multiplenegativesrankingloss/175587#post_1",
  "publishedAt": "2026-04-27T10:39:07.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "MultipleNegativesRankingLoss",
    "SentenceTransformerTrainer"
  ],
  "textContent": "I am using the `sentence-transformers` library to finetune a model to generate embeddings for postal addresses so that embeddings for the same address written in different manners are close to each other.\nHowever, addresses that only differ for a small part (e.g. the street number, or the name of the city) must have sufficiently different embeddings, which is not the case when I try to finetune the `all-mpnet-base-v2` model using the `CosineSimilarityLoss` (or similars).\n\nTherefore, I am trying to use the MultipleNegativesRankingLoss. As far as I understand, the computation of this loss function takes into account the whole minibatch, not just the individual pairs of sentences/addresses. It enforces not only that sentences/addresses in a given pair have similar embeddings, but also consider sentences/addresses in different pairs of the same batch as negatives (which is exactly what I need).\n\nTherefore, I prepared a trainining set that is already partitioned in batches with 256 pairs each, taking care to put in the same batch pairs that must be considered strong negatives even if they are quite similar.\n\n\n    batches: list[tuple[tuple[str, str], 256]] = [\n        (\n            (batch1_anchor1,  batch1_positive1),  # ('Blue Street, 1, New York', 'Blue Street 1 - New York'),\n            (batch1_anchor2,  batch1_positive2),  # ('Blue Street, 11, New York', 'Blue Street 11 - New York'),\n            (batch1_anchor3,  batch1_positive3),\n            ...\n        ),\n            (\n            (batch2_anchor1,  batch2_positive1),\n            (batch2_anchor2,  batch2_positive2),\n            (batch2_anchor3,  batch2_positive3),\n            ...\n        ),\n        ....\n    ]\n\n\nMy question is: how do I preserve this batch structure when loading the training data into the trainer?\nThe SentenceTransformerTrainer class only accepts a `datasets.Dataset, I see no way to preserve my batches.\n\n\n        loss_fn = MultipleNegativesRankingLoss(\n            model,\n            directions=(query_to_doc', 'query_to_query', 'doc_to_query', 'doc_to_doc')\n        )\n\n        trainer = SentenceTransformerTrainer(\n            model=model,\n            args=args,\n            train_dataset=???,  # here I can pass a datasets.Dataset, not a torch.utils.data.DataLoader or equivalent\n            loss=loss_fn,\n         )\n",
  "title": "Custom batches in sentence-transformers for MultipleNegativesRankingLoss"
}