Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiftg3fkqglzzby7t6sujlxp47h3dnfpacpajm22bhsg6m43wdacai",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlqp5vufjxu2"
  },
  "path": "/t/fine-tuning-microsoft-harrier-oss-v1-270m-with-sentencetransformertrainer-is-it-supported/175947#post_2",
  "publishedAt": "2026-05-13T14:21:46.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "microsoft/harrier-oss-v1-270m",
    "SentenceTransformerTrainingArguments",
    "MultipleNegativesRankingLoss",
    "Harrier-270M model card",
    "config_sentence_transformers.json",
    "Sentence Transformers training overview",
    "loss docs",
    "BAAI/bge-m3",
    "mainguyen9/vietlegal-harrier-0.6b",
    "MTEB",
    "MMTEB",
    "MTEB paper",
    "Quati",
    "pt-mteb/quati_1M_retrieval",
    "JurisTCU",
    "ASSIN2",
    "SkillRet paper",
    "Qwen3 Embedding discussion"
  ],
  "textContent": "Seems practically supported?\n\n* * *\n\n## Fine-tuning `microsoft/harrier-oss-v1-270m` with `SentenceTransformerTrainer`\n\nYes — **this should be a supported and reasonable setup** , with one important caveat: I have not found a public example that exactly combines:\n\n\n    microsoft/harrier-oss-v1-270m\n    + SentenceTransformerTrainer\n    + MultipleNegativesRankingLoss\n    + Portuguese QA retrieval\n\n\nHowever, the evidence strongly points to this being a valid path:\n\n  * microsoft/harrier-oss-v1-270m is packaged as a `sentence-transformers` model and can be loaded with `SentenceTransformer(\"microsoft/harrier-oss-v1-270m\")`.\n  * The model card explicitly shows Sentence Transformers usage and encodes **queries with a prompt** while encoding **documents without a prompt**.\n  * The model card says Harrier uses a **decoder-only architecture** , **last-token pooling** , and **L2-normalized embeddings**.\n  * The model card also says query instructions are how the model is trained and that omitting them can degrade performance; document-side instructions are not needed.\n  * SentenceTransformerTrainingArguments supports training-time `prompts`, including column-specific prompt mappings.\n  * MultipleNegativesRankingLoss is the standard Sentence Transformers loss for positive `(query, document)` / `(anchor, positive)` retrieval pairs.\n  * Nearby public examples exist, especially a Harrier-family Vietnamese legal retrieval model using Sentence Transformers + MNRL, and SkillRet-style decoder-embedding fine-tuning using query instructions and unprompted documents.\n\n\n\nMy recommendation is:\n\n\n    training query / anchor:      Instruct: ...\\nQuery: <Portuguese question>\n    training document / positive: <raw Portuguese passage>\n\n    inference query:              Instruct: ...\\nQuery: <Portuguese question>\n    inference document:           <raw Portuguese passage>\n\n\nSo: **apply the instruction prefix to the query/anchor side during training, not only at inference.** Keep documents/passages unprompted.\n\n* * *\n\n## 1. Should the instruction be applied during training?\n\nYes. For Harrier, the instruction is not just an inference-time decoration. It is part of the expected query-side input format.\n\nThe Harrier-270M model card shows this Sentence Transformers pattern:\n\n\n    query_embeddings = model.encode(queries, prompt_name=\"web_search_query\")\n    document_embeddings = model.encode(documents)\n\n\nThe same model card also shows the raw Transformers pattern:\n\n\n    def get_detailed_instruct(task_description: str, query: str) -> str:\n        return f\"Instruct: {task_description}\\nQuery: {query}\"\n\n    # Each query must come with a one-sentence instruction that describes the task.\n    # No need to add instruction for retrieval documents.\n\n\nThe FAQ is especially relevant: it says query instructions are how the model is trained, omitting them causes degradation, and document-side instructions are not needed.\n\nSo for Portuguese QA retrieval, I would train with:\n\n\n    query / anchor:\n    Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question\n    Query: <question>\n\n    document / positive:\n    <passage>\n\n\nand infer with exactly the same policy:\n\n\n    query:\n    Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question\n    Query: <question>\n\n    document:\n    <passage>\n\n\nThis avoids a train/inference mismatch.\n\n### Bad pattern\n\n\n    training query:\n    Qual é o prazo para interpor recurso administrativo?\n\n    inference query:\n    Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question\n    Query: Qual é o prazo para interpor recurso administrativo?\n\n\nThis fine-tunes the model on raw questions but deploys it on prompted questions. For Harrier, that is probably the wrong distribution.\n\n### Better pattern\n\n\n    training query:\n    Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question\n    Query: Qual é o prazo para interpor recurso administrativo?\n\n    inference query:\n    Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question\n    Query: Qual é o prazo para interpor recurso administrativo?\n\n\nDocuments should remain raw in both training and inference:\n\n\n    training document:\n    O prazo para interposição de recurso administrativo é de 10 dias úteis...\n\n    indexed document:\n    O prazo para interposição de recurso administrativo é de 10 dias úteis...\n\n\n* * *\n\n## 2. How to apply query-only prompts in `SentenceTransformerTrainer`\n\nIf your dataset columns are named `query` and `document`, use column-specific prompts:\n\n\n    from sentence_transformers import (\n        SentenceTransformer,\n        SentenceTransformerTrainer,\n        SentenceTransformerTrainingArguments,\n        losses,\n    )\n    from sentence_transformers.training_args import BatchSamplers\n\n    model = SentenceTransformer(\n        \"microsoft/harrier-oss-v1-270m\",\n        model_kwargs={\"dtype\": \"auto\"},\n    )\n\n    query_prompt = (\n        \"Instruct: Given a Portuguese question, \"\n        \"retrieve relevant Portuguese passages that answer the question\\n\"\n        \"Query: \"\n    )\n\n    args = SentenceTransformerTrainingArguments(\n        output_dir=\"harrier-270m-pt-qa-mnrl\",\n\n        per_device_train_batch_size=8,\n        gradient_accumulation_steps=16,  # effective batch size 128 on 1 GPU\n\n        learning_rate=5e-6,\n        num_train_epochs=1,\n        warmup_ratio=0.10,\n        lr_scheduler_type=\"cosine\",\n\n        bf16=True,\n        gradient_checkpointing=True,\n\n        batch_sampler=BatchSamplers.NO_DUPLICATES,\n\n        prompts={\n            \"query\": query_prompt,\n            \"document\": \"\",\n        },\n\n        logging_steps=50,\n        save_strategy=\"steps\",\n        save_steps=500,\n        save_total_limit=2,\n    )\n\n    loss = losses.MultipleNegativesRankingLoss(\n        model,\n        directions=(\"query_to_doc\",),\n    )\n\n    trainer = SentenceTransformerTrainer(\n        model=model,\n        args=args,\n        train_dataset=train_dataset,  # columns: query, document\n        loss=loss,\n    )\n\n    trainer.train()\n    trainer.save_model(\"harrier-270m-pt-qa-mnrl/final\")\n\n\nIf your dataset columns are named `anchor` and `positive`, change only the prompt mapping:\n\n\n    prompts={\n        \"anchor\": query_prompt,\n        \"positive\": \"\",\n    }\n\n\nThe important rule is simple:\n\n\n    query-like column:    prompt\n    document-like column: no prompt\n\n\n* * *\n\n## 3. Should the prompt be in English or Portuguese?\n\nI would start with the English instruction format because it matches Harrier’s public prompt style:\n\n\n    Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question\n    Query:\n\n\nThe query and document texts themselves should remain Portuguese.\n\nAfter you have a baseline, test a Portuguese instruction as an ablation:\n\n\n    Instruct: Dada uma pergunta em português, recupere passagens em português relevantes que respondam à pergunta\n    Query:\n\n\nI would not start by translating the structural markers to `Instrução:` and `Consulta:`. Keep `Instruct:` and `Query:` first, because that matches the Harrier format shown in the model card and config_sentence_transformers.json.\n\nRecommended first prompt:\n\n\n    query_prompt = (\n        \"Instruct: Given a Portuguese question, \"\n        \"retrieve relevant Portuguese passages that answer the question\\n\"\n        \"Query: \"\n    )\n\n\n* * *\n\n## 4. Is `MultipleNegativesRankingLoss` appropriate?\n\nYes. For QA retrieval, your data usually has the form:\n\n\n    query:    Portuguese question\n    positive: passage that answers the question\n\n\nThat is a natural fit for MultipleNegativesRankingLoss, which is designed for positive pairs such as `(query, response)` or `(anchor, positive)`.\n\nA basic version is:\n\n\n    loss = losses.MultipleNegativesRankingLoss(model)\n\n\nFor clarity in retrieval, I would write:\n\n\n    loss = losses.MultipleNegativesRankingLoss(\n        model,\n        directions=(\"query_to_doc\",),\n    )\n\n\nThis trains the model so that each query is closer to its matching passage than to other passages in the batch.\n\n* * *\n\n## 5. Main MNRL caveat: false negatives\n\nMNRL uses other positives in the batch as negatives. That is efficient, but it can be harmful if some of those “negatives” are actually relevant.\n\nExample:\n\n\n    query:\n    Como solicitar a segunda via da fatura?\n\n    positive A:\n    A segunda via da fatura pode ser solicitada no portal do cliente.\n\n    positive B for another query:\n    Para emitir uma cópia da fatura, acesse Minha Conta e clique em Segunda Via.\n\n\nFor the first query, positive B is not really negative. It probably answers the same question. If it appears in the same batch, MNRL may incorrectly push it away.\n\nThis is common in QA retrieval, FAQ retrieval, legal retrieval, policy retrieval, support retrieval, and any corpus with repeated answer templates.\n\nUse:\n\n\n    batch_sampler=BatchSamplers.NO_DUPLICATES\n\n\nThe Sentence Transformers training overview specifically notes that losses using in-batch negatives benefit from no duplicate samples in a batch. The loss docs also discuss cached / larger-batch variants of MNRL.\n\nAlso deduplicate aggressively before training:\n\n  * exact duplicate passages;\n  * near-duplicate chunks;\n  * boilerplate-heavy passages;\n  * repeated FAQ answers;\n  * multiple chunks from the same source document;\n  * multiple positives that answer the same query.\n\n\n\n* * *\n\n## 6. Decoder-only Harrier vs encoder-style BGE-M3\n\nHarrier and BGE-M3 should not be treated as interchangeable SBERT-style encoders.\n\n### Harrier-specific considerations\n\nmicrosoft/harrier-oss-v1-270m is:\n\n  * decoder-only;\n  * multilingual;\n  * 270M parameters;\n  * 640-dimensional embeddings;\n  * up to 32,768 tokens;\n  * last-token pooled;\n  * L2-normalized;\n  * instruction-sensitive on the query side.\n\n\n\nWhen used through `SentenceTransformer`, last-token pooling and normalization are handled automatically.\n\nIf using raw `AutoModel`, you must reproduce the model-card pooling behavior yourself:\n\n\n    def last_token_pool(last_hidden_states, attention_mask):\n        left_padding = attention_mask[:, -1].sum() == attention_mask.shape[0]\n        if left_padding:\n            return last_hidden_states[:, -1]\n        sequence_lengths = attention_mask.sum(dim=1) - 1\n        batch_size = last_hidden_states.shape[0]\n        return last_hidden_states[torch.arange(batch_size), sequence_lengths]\n\n\nFor this use case, I would stay with `SentenceTransformer` unless there is a strong reason not to.\n\n### BGE-M3-specific considerations\n\nBAAI/bge-m3 is not just a dense embedding model. Its model card describes it as multi-functional, multilingual, and multi-granular:\n\n  * dense retrieval;\n  * sparse retrieval;\n  * multi-vector retrieval;\n  * more than 100 languages;\n  * up to 8192 tokens.\n\n\n\nThis matters for a fair comparison. Do not compare:\n\n\n    BGE-M3 hybrid/sparse/multi-vector system\n    vs\n    Harrier dense-only system\n\n\nand call that a model-only comparison.\n\nFairer comparisons are:\n\n\n    BGE-M3 dense vs Harrier dense\n    BGE-M3 hybrid vs Harrier dense + BM25\n    BGE-M3 + reranker vs Harrier + reranker\n\n\n* * *\n\n## 7. Recommended starting hyperparameters\n\nFor a first full-model Harrier-270M MNRL run, I would start conservatively.\n\nParameter | Recommended first value\n---|---\nBase model | `microsoft/harrier-oss-v1-270m`\nLoss | `MultipleNegativesRankingLoss`\nDirection | `(\"query_to_doc\",)`\nQuery prompt | yes\nDocument prompt | no\nLearning rate | `5e-6`\nLR candidates | `3e-6`, `5e-6`, `1e-5`\nEpochs | `1`\nWarmup ratio | `0.10`\nScheduler | `cosine`\nPrecision | `bf16` if supported\nPhysical batch size | `4–16`, depending on GPU\nEffective batch size | `128–256`\nBatch sampler | `BatchSamplers.NO_DUPLICATES`\nGradient checkpointing | yes if memory-bound\nMax sequence length | `512` or `1024` first\n\nI would not start with `5e-5` for full-model MNRL fine-tuning. Harrier is already a strong embedding model; the goal is adaptation, not overwriting its embedding geometry.\n\nA useful nearby reference is mainguyen9/vietlegal-harrier-0.6b, a Harrier-family Vietnamese legal retrieval model that reports Sentence Transformers training, MNRL, hard-negative mining, LR `3e-6`, batch size `256`, one epoch, warmup `10%`, cosine scheduler, and bf16. It is not the same model size or language, but it is a closer reference than generic BERT/SBERT defaults.\n\n* * *\n\n## 8. Should you use `CachedMultipleNegativesRankingLoss`?\n\nUse it if your GPU memory prevents a useful effective batch size.\n\nMNRL benefits from larger batches because larger batches provide more in-batch negatives. If normal MNRL is memory-bound, try:\n\n\n    loss = losses.CachedMultipleNegativesRankingLoss(\n        model,\n        mini_batch_size=32,\n    )\n\n\nThen test effective batch sizes like:\n\n\n    256\n    512\n    1024\n\n\nBut I would not make cached MNRL the first experiment. First establish that the simple MNRL setup works.\n\n* * *\n\n## 9. Suggested experiment matrix\n\nRun these in order.\n\nRun | Model | Training | Query prompt | Doc prompt | LR | Effective batch | Purpose\n---|---|---|---|---|---|---|---\nA | BGE-M3 | existing fine-tune | current | current | current | current | incumbent baseline\nB | Harrier-270M | none | yes | no | — | — | zero-shot baseline\nC | Harrier-270M | MNRL | yes | no | `5e-6` | 128 | main first run\nD | Harrier-270M | MNRL | yes | no | `3e-6` | 128–256 | lower-LR check\nE | Harrier-270M | MNRL | yes | no | `1e-5` | 128–256 | upper-LR check\nF | Harrier-270M | MNRL | no | no | `5e-6` | 128 | prompt ablation\nG | Harrier-270M | Cached MNRL | yes | no | `5e-6` | 256–1024 | batch-size check\nH | Harrier-270M | hard-negative stage | yes | no | `3e-6–5e-6` | task-dependent | ranking refinement\n\nThe most important comparison is:\n\n\n    fine-tuned BGE-M3\n    vs\n    zero-shot Harrier with query instruction\n    vs\n    fine-tuned Harrier with query instruction\n    vs\n    fine-tuned Harrier without query instruction\n\n\nLeaderboard scores are useful for model shortlisting, but the final decision should be based on your own Portuguese QA retrieval benchmark.\n\nFor broader benchmark context, see MTEB, MMTEB, and the original MTEB paper. MTEB-style scores are useful, but they do not replace task-specific evaluation.\n\n* * *\n\n## 10. Evaluation metrics\n\nUse the same evaluation pipeline for BGE-M3 and Harrier.\n\nMinimum retrieval metrics:\n\n\n    nDCG@10\n    MRR@10\n    Recall@5\n    Recall@10\n    Recall@50\n    Recall@100\n\n\nWhy these metrics matter:\n\nMetric | What it tells you\n---|---\n`Recall@50` / `Recall@100` | Whether the retriever can put the right passage somewhere in the candidate pool\n`Recall@5` / `Recall@10` | Whether the retriever is good enough for direct RAG context selection\n`MRR@10` | Whether the first relevant passage appears early\n`nDCG@10` | Ranking quality when there are multiple relevant passages\n\nAlso track operational metrics:\n\n\n    embedding throughput\n    query latency\n    index size\n    GPU memory\n    embedding dimension\n    chunk length\n    max sequence length\n\n\nFor Portuguese-specific external sanity checks, useful resources include:\n\n  * Quati for Brazilian Portuguese retrieval evaluation;\n  * pt-mteb/quati_1M_retrieval for an MTEB-style Quati variant;\n  * JurisTCU if the domain is legal or institutional;\n  * ASSIN2 for Portuguese semantic similarity sanity checks.\n\n\n\n* * *\n\n## 11. Hard negatives: useful, but second stage\n\nDo not start with hard negatives. Start with clean query-positive MNRL.\n\nAfter the first baseline is stable:\n\n\n    1. Embed the full corpus.\n    2. Retrieve top 100 candidates per training query.\n    3. Remove known positives.\n    4. Skip the top few candidates if they may be unlabeled positives.\n    5. Sample negatives from ranks 20–100 or 50–100.\n    6. Train a second stage with explicit negatives or a hard-negative-aware setup.\n\n\nThe reason to avoid the top retrieved “negative” is that it may actually be a valid answer that was not labeled.\n\nThe SkillRet paper is a useful related reference. It fine-tunes decoder-style embedding models using `MultipleNegativesRankingLoss`, applies the same task-specific query instruction to anchor queries during training, uses no document prompt for Harrier/Qwen-style embedding models, and mines hard negatives for the reranker stage. It also reports that fine-tuning Harrier-OSS-0.6B and Qwen3-Embedding-0.6B gives nearly identical performance in that task, suggesting that the training recipe matters at least as much as the exact decoder-embedding base.\n\n* * *\n\n## 12. Common pitfalls\n\n### Pitfall 1: double prompting\n\nBad:\n\n\n    dataset query already contains:\n    Instruct: ...\n    Query: ...\n\n    and TrainingArguments also uses:\n    prompts={\"query\": \"Instruct: ...\\nQuery: \"}\n\n\nThis produces:\n\n\n    Instruct: ...\n    Query: Instruct: ...\n    Query: <question>\n\n\nUse one method:\n\n\n    Either store raw queries and use prompts=...\n    or store prompted queries and do not use prompts=...\n\n\nI recommend storing raw queries and using `prompts=...`.\n\n* * *\n\n### Pitfall 2: prompting documents\n\nBad:\n\n\n    document:\n    Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question\n    Query: <passage>\n\n\nFor Harrier retrieval, documents should be raw passages.\n\n* * *\n\n### Pitfall 3: train/inference mismatch\n\nBad:\n\n\n    training query:  raw question\n    inference query: prompted question\n\n\nBetter:\n\n\n    training query:  prompted question\n    inference query: prompted question\n\n\n* * *\n\n### Pitfall 4: comparing systems unfairly\n\nBad:\n\n\n    BGE-M3 hybrid vs Harrier dense-only\n\n\nBetter:\n\n\n    BGE-M3 dense vs Harrier dense\n    BGE-M3 hybrid vs Harrier dense + BM25\n    BGE-M3 + reranker vs Harrier + reranker\n\n\n* * *\n\n### Pitfall 5: starting with too much context\n\nHarrier supports long context, but that does not mean a first fine-tune should use 32k tokens.\n\nStart with:\n\n\n    512 or 1024 tokens\n\n\nThen test:\n\n\n    2048\n    4096\n    8192\n\n\nonly if your evaluation set shows that longer passages help.\n\nIn retrieval, better chunking is often more useful than simply increasing max length.\n\n* * *\n\n## 13. Inference after fine-tuning\n\nUse the same query prompt and raw documents:\n\n\n    from sentence_transformers import SentenceTransformer\n\n    model = SentenceTransformer(\"harrier-270m-pt-qa-mnrl/final\")\n\n    query_prompt = (\n        \"Instruct: Given a Portuguese question, \"\n        \"retrieve relevant Portuguese passages that answer the question\\n\"\n        \"Query: \"\n    )\n\n    queries = [\n        \"Qual é o prazo para interpor recurso administrativo?\",\n    ]\n\n    documents = [\n        \"O prazo para interposição de recurso administrativo é de 10 dias úteis...\",\n        \"A segunda via da fatura pode ser solicitada no portal do cliente...\",\n    ]\n\n    query_embeddings = model.encode(\n        queries,\n        prompt=query_prompt,\n        normalize_embeddings=True,\n    )\n\n    document_embeddings = model.encode(\n        documents,\n        normalize_embeddings=True,\n    )\n\n    scores = query_embeddings @ document_embeddings.T\n    print(scores)\n\n\nAvoid passing both `prompt` and `prompt_name` unless you intentionally want one to override the other. A related Qwen3 Embedding discussion notes that explicit `prompt` takes priority over `prompt_name` in Sentence Transformers-style usage.\n\n* * *\n\n## 14. Bottom-line recommendation\n\nFor this Portuguese QA retrieval use case, I would proceed like this:\n\n  1. Keep your fine-tuned BGE-M3 model as the incumbent baseline.\n  2. Evaluate Harrier-270M zero-shot with the correct query instruction and raw documents.\n  3. Fine-tune Harrier with MNRL using query-side instruction during training.\n  4. Do not prompt documents.\n  5. Start with `lr=5e-6`, one epoch, warmup `0.10`, cosine scheduler, bf16, effective batch size around `128`.\n  6. Run LR ablations at `3e-6`, `5e-6`, and `1e-5`.\n  7. Use `BatchSamplers.NO_DUPLICATES`.\n  8. Deduplicate query/document pairs aggressively.\n  9. Try `CachedMultipleNegativesRankingLoss` if memory prevents larger effective batches.\n  10. Add hard negatives only after the clean first-stage baseline works.\n  11. Compare systems fairly: dense vs dense, hybrid vs hybrid, reranked vs reranked.\n  12. Decide based on your own held-out Portuguese QA retrieval set, not only Multilingual MTEB v2.\n\n\n\n### Final concise answer\n\n  * **Supported?** Yes, practically. Harrier-270M is a Sentence Transformers model and should work with `SentenceTransformerTrainer`.\n  * **Exact public recipe?** I have not found an exact Harrier-270M + STTrainer + MNRL + Portuguese QA recipe.\n  * **Instruction during training?** Yes. Apply it to the query/anchor side during training and inference.\n  * **Documents?** Keep documents/passages unprompted.\n  * **Loss?** `MultipleNegativesRankingLoss` is appropriate for `(query, positive passage)` pairs.\n  * **Main risks?** Prompt mismatch, false negatives, duplicates, too-high LR, too-small effective batch, and incorrect pooling if using raw `AutoModel`.\n  * **Starting hyperparameters?** `lr=3e-6` to `1e-5`, one epoch, warmup `0.10`, cosine scheduler, bf16, effective batch `128–256`, `BatchSamplers.NO_DUPLICATES`.\n  * **Best next experiment?** Harrier zero-shot prompted vs Harrier MNRL prompted vs Harrier no-prompt ablation vs your fine-tuned BGE-M3 baseline.\n\n",
  "title": "Fine-tuning microsoft/harrier-oss-v1-270m with SentenceTransformerTrainer — is it supported?"
}