{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigigfknffdtfub7hnqhvdh2mdwvwykfutapkixxmg2ne63umif7ku",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnbvcd63iqa2"
  },
  "path": "/t/finetuning-a-reasoning-llm-with-supervised-or-reinforcement-learning/176449#post_2",
  "publishedAt": "2026-06-02T04:45:34.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "SFTTrainer",
    "Transformers: Tool use",
    "Agent Data Protocol",
    "Transformers chat templating docs",
    "Qwen-Agent",
    "Harmony response format",
    "tool calling docs",
    "tool-use docs",
    "SFTTrainer: Train on assistant messages only",
    "TRL issue #5471",
    "When2Call: When (not) to Call Tools",
    "ToolMind",
    "Berkeley Function Calling Leaderboard",
    "ToolSandbox",
    "DPOTrainer",
    "GRPOTrainer",
    "OpenEnv integration",
    "ToolRL: Reward is All Tool Learning Needs",
    "OTC: Optimal Tool Calls via Reinforcement Learning",
    "TRL SFTTrainer: Tool Calling with SFT",
    "Transformers: Chat templates",
    "TRL DPOTrainer: Tool Calling with DPO",
    "TRL GRPOTrainer",
    "TRL OpenEnv integration",
    "ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark",
    "OpenAI Harmony response format",
    "vLLM Tool Calling"
  ],
  "textContent": "Hmm, maybe something like this:\n\n* * *\n\nI would separate this into three questions that often get mixed together:\n\n  1. **How should I represent the training data?**\n  2. **Which tokens should actually receive loss during SFT?**\n  3. **When, if ever, should I move from SFT to preference optimization or RL?**\n\n\n\nMy short answer would be:\n\n> Start with SFT if you already have correct trajectories.\n>  Treat `assistant_think`, `assistant_tool`, and `assistant_answer` as an internal annotation format, not necessarily as model roles.\n>  Convert them into the target model’s actual chat template / tool-calling format.\n>  Add no-tool, clarification, and unavailable-tool examples.\n>  Consider DPO if you can create good/bad trajectory pairs.\n>  Consider GRPO/RL only if you have executable tools, a rollout environment, and reliable rewards.\n\nBelow is the longer version.\n\n## 1. Your SFT intuition is mostly right, but I would not train arbitrary custom roles directly\n\nYour idea of making each training example condition on the conversation history and supervise the next assistant-side behavior is basically reasonable.\n\nFor example, conceptually:\n\n\n    sample 1:\n      system\n      user_1\n      assistant_1\n\n    sample 2:\n      system\n      user_1\n      assistant_1\n      user_2\n      assistant_2\n\n\nThat is a normal way to turn multi-turn dialogue into next-assistant-response training examples.\n\nHowever, I would be careful with custom roles like:\n\n\n    assistant_think\n    assistant_tool\n    assistant_answer\n\n\nThose can be useful as **raw annotations** , but I would not assume the model understands them as roles unless your target model’s chat template explicitly supports them.\n\nIn Hugging Face / Transformers / TRL terms, the more standard representation is usually closer to:\n\n\n    assistant message containing reasoning-compatible content, if you really want to supervise reasoning text\n    assistant message containing tool_calls\n    tool role message containing the tool result\n    assistant message containing the final answer\n    tools column containing the available tool schemas\n\n\nTRL’s SFTTrainer now explicitly supports tool-calling SFT. Its docs say that each tool-calling dataset example should include conversation messages with `tool_calls` and `tool` role messages, plus the list of available tools in a `tools` column, typically as JSON schemas.\n\nThe Transformers tool-use docs also describe the same general shape: assistant messages can contain `tool_calls`, tool responses should be represented as `tool` role messages, and tools are supplied as schemas/functions to the chat template / tokenizer layer. See Transformers: Tool use.\n\nSo I would think of your format like this:\n\nYour raw annotation | Training-format target\n---|---\n`assistant_think` | Model-specific reasoning span, if you want to train visible reasoning\n`assistant_tool` | `assistant` message with `tool_calls`\ntool result / observation | `tool` role message\n`assistant_answer` | final `assistant` message\navailable tools | `tools` column / JSON schemas\n\nThis distinction between “raw trajectory format” and “training format” is important. A related research direction is the Agent Data Protocol, which treats heterogeneous agent trajectories as something to normalize into a common schema before training. You do not need to adopt ADP specifically, but the principle is useful: keep your internal annotation format separate from the model-specific training format.\n\n## 2. The chat template is not a cosmetic wrapper\n\nFor chat/instruct/tool models, the chat template is part of the interface the model was trained on.\n\nThe Transformers chat templating docs explain that role/content dictionaries are converted into a token sequence through the model’s chat template. Different model families use different control tokens and different tool-call formats.\n\nThat means this is risky:\n\n\n    role = assistant_think\n    role = assistant_tool\n    role = assistant_answer\n\n\nunless you intentionally write a chat template that renders those roles into the exact format your model should learn and later use.\n\nThis becomes especially important for reasoning/tool models:\n\nModel family / runtime | Why format matters\n---|---\nQwen / Qwen3 | Qwen has model-specific function-calling templates and parsers; Qwen-Agent encapsulates Qwen’s tool-calling templates/parsers.\nGPT-OSS | GPT-OSS models were trained on the Harmony response format, which defines conversation structure, reasoning output, and function calls.\nvLLM serving | vLLM’s tool calling docs require a chat template that handles `tool` role messages and assistant messages containing previous tool calls.\nGeneric Transformers | The tool-use docs expect tool schemas and model-specific rendering through `apply_chat_template`.\n\nSo my practical recommendation would be:\n\n> Keep `assistant_think`, `assistant_tool`, and `assistant_answer` in your preprocessing code if they help you reason about the data, but convert them before training into the exact message/tool format expected by your target model and inference stack.\n\n## 3. Which tokens should receive loss?\n\nFor SFT, you usually do **not** want to train on every token in the serialized conversation.\n\nA reasonable default is:\n\nSpan | Should receive loss? | Notes\n---|---|---\nsystem prompt | No | Conditioning context\nuser messages | No | Conditioning context\nassistant reasoning / thinking | Maybe | Only if you intentionally want the model to emit that reasoning format\nassistant tool call | Yes | The model must learn when/how to call tools\ntool result / observation | No | External environment output, not model-generated text\nfinal assistant answer | Yes | The model should learn the final response\n\nTRL has `assistant_only_loss=True` for assistant-message-only loss, and also supports completion-only loss for prompt/completion style datasets. See SFTTrainer: Train on assistant messages only.\n\nHowever, there is an important caveat: `assistant_only_loss=True` depends on the chat template being able to mark generation spans. The TRL docs mention that this uses `{% generation %}` / `{% endgeneration %}` blocks in the chat template. There is also an active-looking implementation/documentation issue around adding such generation markers to common chat templates: TRL issue #5471.\n\nSo I would not just trust the flag blindly. I would inspect the first batch.\n\nA simple sanity check is:\n\n\n    # Pseudocode / sketch\n    batch = next(iter(trainer.get_train_dataloader()))\n\n    input_ids = batch[\"input_ids\"][0]\n    labels = batch[\"labels\"][0]\n\n    visible_label_ids = [\n        token_id for token_id, label_id in zip(input_ids, labels)\n        if label_id != -100\n    ]\n\n    print(tokenizer.decode(visible_label_ids))\n\n\nYou want this decoded text to contain only the assistant-side spans you intend to supervise, such as tool calls and final answers. If it includes user messages, tool observations, or system text, your masking is wrong.\n\n## 4. Tool-call examples alone are not enough\n\nA common failure mode is: after fine-tuning on tool-call examples, the model starts calling tools too often.\n\nSo the dataset should not only contain “here is how to call a tool” examples. It should also contain:\n\nCase type | Why it matters\n---|---\nTool-required examples | Teach the model to call tools when needed\nNo-tool examples | Teach the model to answer directly when no tool is needed\nClarification examples | Teach the model to ask for missing required arguments\nUnavailable-tool examples | Teach the model to admit that the provided tools cannot solve the request\nIrrelevant-tool examples | Teach the model not to force an unrelated tool call\nBad-result / failed-tool examples | Teach recovery or fallback behavior\nMulti-turn tool-result examples | Teach the model to incorporate observations into later turns\n\nThis point is not just theoretical. The paper When2Call: When (not) to Call Tools focuses exactly on tool-calling decision-making: when to call a tool, when to ask follow-up questions, and when to admit that the question cannot be answered with the provided tools.\n\nThat is the part people often miss. Calling the right tool with the right arguments is one skill. Deciding whether a tool call should happen at all is another skill.\n\n## 5. Validate trajectories at the step level, not only the final answer level\n\nIf you have multi-turn trajectories, I would also inspect them at the turn/step level before training.\n\nA trajectory can have a correct final answer but still contain a bad intermediate action, such as:\n\n\n    wrong tool call\n    lucky tool result\n    correct final answer\n\n\nIf you train on that trajectory, the model may learn the bad intermediate policy.\n\nThis is one reason recent tool-use dataset work emphasizes filtering or validating intermediate steps. For example, ToolMind argues that trajectory-level validation can miss turn-level errors, and uses fine-grained turn-level filtering to remove erroneous or suboptimal steps.\n\nFor your case, I would check each step:\n\nStep | Check\n---|---\nReasoning / planning | Did the assistant correctly identify whether a tool is needed?\nTool selection | Was the selected tool relevant?\nArguments | Were the arguments available from context and schema-valid?\nTool result | Was the observation inserted into the dialogue correctly?\nFinal answer | Did the final answer use the tool result rather than hallucinating?\nCost | Did the trajectory avoid unnecessary tool calls?\n\n## 6. When is SFT enough?\n\nSFT is the right first move when you have high-quality demonstrations.\n\nSFT is especially good for:\n\nGoal | SFT suitability\n---|---\nLearning the serialized tool-call format | High\nLearning JSON/schema shape | High\nLearning basic tool choice from examples | Medium to high\nLearning to use tool results in final answers | High\nLearning no-tool behavior | Good if no-tool examples are included\nLearning robust exploration over new tools | Limited\nOptimizing tool-use cost | Limited\nRecovering from tool failure | Depends heavily on data\n\nSo I would start with SFT, but I would not assume that SFT alone solves the full policy problem.\n\nA practical first checkpoint after SFT:\n\nMetric | What to measure\n---|---\nFormat validity | Can you parse the model’s tool call?\nSchema validity | Do required fields and types match the schema?\nTool selection accuracy | Is the selected tool correct?\nNo-tool accuracy | Does it avoid tools when unnecessary?\nClarification accuracy | Does it ask for missing required info?\nGrounding | Does the final answer use the tool result?\nFinal answer correctness | Is the final answer correct?\nTool-call count | Is the model overusing tools?\n\nFor evaluation inspiration, see the Berkeley Function Calling Leaderboard, which focuses on function/tool-call accuracy, and ToolSandbox, which evaluates stateful, conversational, interactive tool use.\n\n## 7. DPO can be a natural next step before RL\n\nIf you can build preferred/rejected trajectory pairs, DPO is often simpler than full RL.\n\nTRL’s DPOTrainer supports tool-calling data too: examples can include `prompt`, `chosen`, and `rejected` conversations with `tool_calls`, `tool` role messages, and a `tools` column.\n\nExamples of useful DPO pairs:\n\nSituation | Chosen | Rejected\n---|---|---\nTool needed | Correct tool call + grounded answer | Hallucinated direct answer\nTool not needed | Direct answer | Unnecessary tool call\nMissing required argument | Clarifying question | Invalid tool call with guessed argument\nIrrelevant tools only | Explain that available tools are not enough | Force an unrelated tool call\nTool result given | Answer grounded in result | Answer ignores result\nCost-sensitive task | Minimal sufficient calls | Excessive repeated calls\nInvalid JSON risk | Parseable/schema-valid call | Malformed call\n\nThis is often a very practical middle ground:\n\n\n    SFT teaches the model the basic behavior.\n    DPO nudges the model away from bad variants of that behavior.\n    RL is only needed if you have an executable environment and reliable rewards.\n\n\n## 8. When should you use RL / GRPO?\n\nI would only move to RL if you have more than just example trajectories.\n\nYou need at least some of the following:\n\nRequirement | Why it matters\n---|---\nExecutable tools | The model’s tool calls must actually run during rollout\nParser | The training loop must parse tool calls from model output\nEnvironment state | Multi-turn tool use often changes state\nVerifier | You need to score success or failure\nReward components | Tool selection, arguments, execution, grounding, cost\nStable chat template | Tool calls and observations must serialize consistently\nInitial tool-capable policy | Otherwise RL may not explore useful tool calls\n\nTRL’s GRPOTrainer supports tools and also an `environment_factory` mode, where the trainer creates an environment instance per rollout and exposes public methods as tools. TRL’s OpenEnv integration is also relevant if you want environment-backed training.\n\nThe important point is that RL is not just “SFT plus a reward function”. You need the full loop:\n\n\n    model generates\n    → parser extracts tool call\n    → tool/environment executes\n    → observation is returned to the model\n    → model continues\n    → verifier computes rewards\n    → policy update happens\n\n\nIf you cannot execute tools during rollout or cannot compute meaningful rewards, I would not start with RL.\n\n## 9. Reward design for tool use should be decomposed\n\nA final-answer-only reward is often too coarse.\n\nThe paper ToolRL: Reward is All Tool Learning Needs makes this point directly: tool-use RL is hard because multiple tools and diverse parameters require more fine-grained feedback than simple answer matching.\n\nA useful reward decomposition might be:\n\nReward component | Example\n---|---\nFormat reward | Output is parseable as a tool call or final answer\nSchema reward | Required arguments exist and have correct types\nTool selection reward | Correct tool selected\nArgument semantic reward | Arguments are correct given the conversation\nExecution reward | Tool executes successfully\nGrounding reward | Final answer uses the tool observation\nFinal correctness reward | The final answer is correct\nNo-tool reward | Avoids tools when no tool is needed\nClarification reward | Asks for missing required information\nCost penalty | Penalizes unnecessary tool calls or excessive calls\n\nAlso, beware of overusing tools. Work such as OTC: Optimal Tool Calls via Reinforcement Learning focuses on encouraging accurate answers with fewer tool calls. This matters because a reward that only values final correctness can accidentally teach the model to call tools too often.\n\n## 10. Suggested practical training path\n\nI would use this staged approach:\n\nStage | Do this | Move on when\n---|---|---\n0. Normalize data | Convert raw `assistant_think/tool/answer` annotations into target chat/tool format | The rendered examples match the target model’s template\n1. Mask inspection | Verify which tokens receive loss | Only intended assistant spans are supervised\n2. SFT | Train on high-quality trajectories | Format, schema, and basic tool use work\n3. Evaluation | Test tool/no-tool, schema, grounding, final correctness | You know the failure modes\n4. DPO | Use chosen/rejected pairs for common mistakes | Over-calling, invalid calls, and hallucinations improve\n5. RL/GRPO | Only if tools are executable and rewards are reliable | You can run environment-backed rollouts\n\nIn short:\n\n\n    If you have demonstrations:\n      start with SFT.\n\n    If you have good vs bad trajectory pairs:\n      consider DPO.\n\n    If you have executable tools + verifier + reward:\n      consider GRPO/RL.\n\n    If you have none of those:\n      build evaluation and clean the dataset first.\n\n\n## 11. A possible data representation\n\nAs an internal raw format, something like this is fine:\n\n\n    {\n      \"system\": \"You are a helpful assistant with tool access.\",\n      \"turns\": [\n        {\n          \"user\": \"What's the weather in Paris tomorrow?\",\n          \"assistant_think\": \"The user asks for current/future weather, so I need a weather tool.\",\n          \"assistant_tool\": {\n            \"name\": \"get_weather\",\n            \"arguments\": {\n              \"city\": \"Paris\",\n              \"date\": \"tomorrow\"\n            }\n          },\n          \"tool_result\": {\n            \"forecast\": \"Light rain, 13C\"\n          },\n          \"assistant_answer\": \"Tomorrow in Paris, expect light rain and about 13°C.\"\n        }\n      ]\n    }\n\n\nBut before training, I would convert it to a model/tool format closer to:\n\n\n    {\n      \"messages\": [\n        {\n          \"role\": \"system\",\n          \"content\": \"You are a helpful assistant with tool access.\"\n        },\n        {\n          \"role\": \"user\",\n          \"content\": \"What's the weather in Paris tomorrow?\"\n        },\n        {\n          \"role\": \"assistant\",\n          \"tool_calls\": [\n            {\n              \"type\": \"function\",\n              \"function\": {\n                \"name\": \"get_weather\",\n                \"arguments\": {\n                  \"city\": \"Paris\",\n                  \"date\": \"tomorrow\"\n                }\n              }\n            }\n          ]\n        },\n        {\n          \"role\": \"tool\",\n          \"content\": \"{\\\"forecast\\\":\\\"Light rain, 13C\\\"}\"\n        },\n        {\n          \"role\": \"assistant\",\n          \"content\": \"Tomorrow in Paris, expect light rain and about 13°C.\"\n        }\n      ],\n      \"tools\": [\n        {\n          \"type\": \"function\",\n          \"function\": {\n            \"name\": \"get_weather\",\n            \"description\": \"Get a weather forecast for a city and date.\",\n            \"parameters\": {\n              \"type\": \"object\",\n              \"properties\": {\n                \"city\": {\n                  \"type\": \"string\"\n                },\n                \"date\": {\n                  \"type\": \"string\"\n                }\n              },\n              \"required\": [\"city\", \"date\"]\n            }\n          }\n        }\n      ]\n    }\n\n\nThe exact schema may differ depending on your model, trainer, and serving stack. The key point is not this exact JSON shape. The key point is that the training format should match the model’s tool-calling chat template.\n\n## 12. Final recommendation\n\nSo my answer would be:\n\n  1. **Yes, start with SFT** if you have correct trajectories.\n  2. **Do not train arbitrary custom roles directly** unless your target model’s template supports them.\n  3. **Convert your annotations into the target tool-call format** , usually `tool_calls`, `tool` role messages, and `tools` schemas.\n  4. **Mask loss carefully** : user/system/tool observations should generally not be supervised; assistant tool calls and final answers should be.\n  5. **Inspect the labels** , because assistant-only loss depends on the chat template.\n  6. **Add no-tool, clarification, and unavailable-tool cases** , not only positive tool-call examples.\n  7. **Use DPO** if you can create chosen/rejected trajectory pairs.\n  8. **Use GRPO/RL only when you have executable tools and meaningful rewards**.\n  9. **Evaluate more than final accuracy** : measure format validity, schema validity, tool selection, no-tool behavior, clarification behavior, grounding, final correctness, and tool-call cost.\n\n\n\nThe practical path is:\n\n\n    SFT first.\n    DPO if you can create preference pairs.\n    GRPO/RL only if you can run tools during rollout and compute reliable rewards.\n\n\nUseful references:\n\n  * TRL SFTTrainer: Tool Calling with SFT\n  * Transformers: Tool use\n  * Transformers: Chat templates\n  * TRL DPOTrainer: Tool Calling with DPO\n  * TRL GRPOTrainer\n  * TRL OpenEnv integration\n  * When2Call: When (not) to Call Tools\n  * ToolRL: Reward is All Tool Learning Needs\n  * ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark\n  * Berkeley Function Calling Leaderboard\n  * Agent Data Protocol\n  * ToolMind\n  * Qwen-Agent\n  * OpenAI Harmony response format\n  * vLLM Tool Calling\n\n",
  "title": "Finetuning a Reasoning LLM with Supervised or Reinforcement Learning?"
}