External Publication

Finetuning a Reasoning LLM with Supervised or Reinforcement Learning?

Hugging Face Forums [Unofficial] June 1, 2026

Hello,

I have a task to fine-tune small LLMs on annotated conversational data. The dataset contains not only the final answers, but also reasoning traces and tool-calling decisions (i.e., when the model should think and when it should call a tool).

I am wondering what the best training approach would be and why.

My current dataset is stored in a chat format similar to this:

system
user
assistant_think
assistant_tool
assistant_answer

user
assistant_think
assistant_tool
assistant_answer
...

My current idea is to split each conversation into multiple training samples. For example, if a conversation contains two user turns, I would create two samples:

Sample 1

system
user
assistant_think
assistant_tool
assistant_answer

Sample 2

system
user
assistant_think
assistant_tool
assistant_answer

user
assistant_think
assistant_tool
assistant_answer

In other words, each sample contains all previous conversation history up to the assistant response being trained.

For training, the loss would be computed only on the assistant-generated tokens:

assistant_think
assistant_tool
assistant_answer

while the system and user messages would be masked out from the loss.

Is this approach correct, or is there a better way to structure the training data for reasoning and tool-calling behavior?

My second question is about reinforcement learning.

After completing supervised fine-tuning (SFT) on the dataset described above, should I also incorporate RL to further train the model on e.g. when a tool should or should not be called?

If so:

What advantages would RL provide over SFT alone for tool-use and reasoning?
How would you design the reward function?
Under what circumstances is RL actually necessary, and when is SFT sufficient?

I would appreciate any practical advice, papers, blog posts, or open-source examples related to training reasoning and tool-calling models.

Discussion in the ATmosphere