{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihwkp4hxdwho45gkyawqhyd5ug3vvbqw6xdudcr5nin63coiwhocu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnafzyqaupp2"
  },
  "path": "/t/finetuning-a-reasoning-llm-with-supervised-or-reinforcement-learning/176449#post_1",
  "publishedAt": "2026-06-01T14:05:13.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hello,\n\nI have a task to fine-tune small LLMs on annotated conversational data. The dataset contains not only the final answers, but also reasoning traces and tool-calling decisions (i.e., when the model should think and when it should call a tool).\n\nI am wondering what the best training approach would be and why.\n\nMy current dataset is stored in a chat format similar to this:\n\n\n    system\n    user\n    assistant_think\n    assistant_tool\n    assistant_answer\n\n    user\n    assistant_think\n    assistant_tool\n    assistant_answer\n    ...\n\n\nMy current idea is to split each conversation into multiple training samples. For example, if a conversation contains two user turns, I would create two samples:\n\n**Sample 1**\n\n\n    system\n    user\n    assistant_think\n    assistant_tool\n    assistant_answer\n\n\n**Sample 2**\n\n\n    system\n    user\n    assistant_think\n    assistant_tool\n    assistant_answer\n\n    user\n    assistant_think\n    assistant_tool\n    assistant_answer\n\n\nIn other words, each sample contains all previous conversation history up to the assistant response being trained.\n\nFor training, the loss would be computed only on the assistant-generated tokens:\n\n\n    assistant_think\n    assistant_tool\n    assistant_answer\n\n\nwhile the system and user messages would be masked out from the loss.\n\nIs this approach correct, or is there a better way to structure the training data for reasoning and tool-calling behavior?\n\nMy second question is about reinforcement learning.\n\nAfter completing supervised fine-tuning (SFT) on the dataset described above, should I also incorporate RL to further train the model on e.g. when a tool should or should not be called?\n\nIf so:\n\n  * What advantages would RL provide over SFT alone for tool-use and reasoning?\n\n  * How would you design the reward function?\n\n  * Under what circumstances is RL actually necessary, and when is SFT sufficient?\n\n\n\n\nI would appreciate any practical advice, papers, blog posts, or open-source examples related to training reasoning and tool-calling models.",
  "title": "Finetuning a Reasoning LLM with Supervised or Reinforcement Learning?"
}