What prompt format/template should I use for training Unsloth/Phi-3.5-mini-instruct?
Hmm… personally, when the model has one available (very old models sometimes do not) I think using the Chat Template is often the safer choice:
Short answer
I would keep your raw {input, output} data structure, but I would not use the hand-written
### Input: ...
### Output: ...
<|endoftext|>
string as my default training format for Phi-3.5-mini-instruct.
For this model, I would start with this route instead:
- keep your raw examples as
{input, output}; - convert each row into
messages; - render those
messageswith the tokenizer/model chat template; - use
add_generation_prompt=Falsefor training; - use the same chat-style structure again when serving the model through your API.
The key idea is to separate how you store your dataset from the final text/token format the model sees.
A minimal shape would be:
import json
messages = [
{
"role": "system",
"content": (
"You convert receipt data into the requested spending insight JSON. "
"Return only valid JSON matching the expected output shape."
),
},
{
"role": "user",
"content": json.dumps(example["input"], ensure_ascii=False),
},
{
"role": "assistant",
"content": json.dumps(example["output"], ensure_ascii=False),
},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
Then use that rendered text in your dataset.
This is not a claim that your current ### Input / ### Output string can never train. It probably can train as plain text-completion data. I would just treat the Phi-3.5 chat template as the lower-risk default for an already instruction-tuned chat model.
More detail: why I would start from the Phi-3.5 chat template (click for more details)
Recommended data flow
I would structure it like this:
raw {input, output}
-> messages: system / user / assistant
-> tokenizer.apply_chat_template(...)
-> text column
-> SFTTrainer
So your current raw data can stay mostly as-is. The main change is the formatting function.
def format_example(example, tokenizer):
messages = [
{
"role": "system",
"content": (
"You convert receipt data into the requested spending insight JSON. "
"Return only valid JSON matching the expected output shape."
),
},
{
"role": "user",
"content": json.dumps(example["input"], ensure_ascii=False),
},
{
"role": "assistant",
"content": json.dumps(example["output"], ensure_ascii=False),
},
]
return {
"text": tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
}
Before training, print one rendered example.
formatted = format_example(train_data[0], tokenizer)
print(formatted["text"])
You want to confirm that the final string actually looks like a Phi-style chat conversation, not like a mixture of multiple prompt formats.
More detail: training format vs inference/API format (click for more details)
Minimal checks before a real training run
Before starting a full job, I would inspect the tokenizer and one formatted sample.
print("=== chat template ===")
print(tokenizer.chat_template)
print("=== eos token ===")
print(tokenizer.eos_token)
print("=== special tokens ===")
print(tokenizer.special_tokens_map)
print("=== rendered example ===")
print(text)
Things I would check:
| Check | Why |
|---|---|
| The rendered text contains a user turn | Confirms the input JSON is in the expected role. |
| The rendered text contains an assistant turn | Confirms the output JSON is treated as the target answer. |
| The assistant JSON is complete | Avoids training on truncated/malformed outputs. |
| Turn boundaries are consistent | Avoids mixing multiple stop/end conventions. |
| You are not manually inserting extra special tokens | Avoids EOS / stop-token confusion. |
| The same message structure can be used by the API | Avoids train/serve mismatch. |
More detail: special tokens, <|end|>, and <|endoftext|> (click for more details)
If you like ### Input, keep it inside the user message
If you prefer the readability of headings like ### Input, I would put those inside the user content, not use them as the outer conversation template.
For example:
messages = [
{
"role": "system",
"content": "Return only the requested JSON object.",
},
{
"role": "user",
"content": (
"### Input\n"
+ json.dumps(example["input"], ensure_ascii=False)
+ "\n\n### Task\n"
+ "Generate the spending insight JSON."
),
},
{
"role": "assistant",
"content": json.dumps(example["output"], ensure_ascii=False),
},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
This keeps the outer format aligned with the model’s chat template, while still letting your task content have readable headings.
Optional layer: response-only / assistant-only loss (click for more details) Optional layer: JSON reliability for an API (click for more details)
Practical summary
My first implementation would be:
import json
def format_example(example, tokenizer):
messages = [
{
"role": "system",
"content": (
"You convert receipt data into the requested spending insight JSON. "
"Return only valid JSON matching the expected output shape."
),
},
{
"role": "user",
"content": json.dumps(example["input"], ensure_ascii=False),
},
{
"role": "assistant",
"content": json.dumps(example["output"], ensure_ascii=False),
},
]
return {
"text": tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
}
Then inspect before training:
formatted = format_example(train_data[0], tokenizer)
print(formatted["text"])
And for API inference, use the same message structure, but without the assistant answer:
messages = [
{
"role": "system",
"content": (
"You convert receipt data into the requested spending insight JSON. "
"Return only valid JSON matching the expected output shape."
),
},
{
"role": "user",
"content": json.dumps(input_payload, ensure_ascii=False),
},
]
If using Transformers directly:
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
)
So my practical recommendation would be:
- keep
{input, output}as the raw data format; - convert rows into
messages; - use the Phi-3.5 tokenizer chat template for the final training text;
- use
add_generation_prompt=Falsefor training; - use the same chat structure for API inference;
- do not manually append
<|endoftext|>until you have inspected what the tokenizer already emits; - treat response-only loss and production JSON validation as separate follow-up layers.
References (click for more details)
Discussion in the ATmosphere