{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifryrqa3bskowcvrgyxijnoukt74ai7es53taydfzngy42wwxfjhu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mp3spxzxhoc2"
},
"path": "/t/what-prompt-format-template-should-i-use-for-training-unsloth-phi-3-5-mini-instruct/177144#post_2",
"publishedAt": "2026-06-25T05:13:39.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"(click for more details)"
],
"textContent": "Hmm… personally, when the model has one available (very old models sometimes do not) I think using the Chat Template is often the safer choice:\n\n* * *\n\n## Short answer\n\nI would keep your raw `{input, output}` data structure, but I would not use the hand-written\n\n\n ### Input: ...\n ### Output: ...\n <|endoftext|>\n\n\nstring as my default training format for `Phi-3.5-mini-instruct`.\n\nFor this model, I would start with this route instead:\n\n 1. keep your raw examples as `{input, output}`;\n 2. convert each row into `messages`;\n 3. render those `messages` with the tokenizer/model chat template;\n 4. use `add_generation_prompt=False` for training;\n 5. use the same chat-style structure again when serving the model through your API.\n\n\n\nThe key idea is to separate **how you store your dataset** from **the final text/token format the model sees**.\n\nA minimal shape would be:\n\n\n import json\n\n messages = [\n {\n \"role\": \"system\",\n \"content\": (\n \"You convert receipt data into the requested spending insight JSON. \"\n \"Return only valid JSON matching the expected output shape.\"\n ),\n },\n {\n \"role\": \"user\",\n \"content\": json.dumps(example[\"input\"], ensure_ascii=False),\n },\n {\n \"role\": \"assistant\",\n \"content\": json.dumps(example[\"output\"], ensure_ascii=False),\n },\n ]\n\n text = tokenizer.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=False,\n )\n\n\nThen use that rendered `text` in your dataset.\n\nThis is not a claim that your current `### Input / ### Output` string can never train. It probably can train as plain text-completion data. I would just treat the Phi-3.5 chat template as the lower-risk default for an already instruction-tuned chat model.\n\nMore detail: why I would start from the Phi-3.5 chat template (click for more details)\n\n## Recommended data flow\n\nI would structure it like this:\n\n\n raw {input, output}\n -> messages: system / user / assistant\n -> tokenizer.apply_chat_template(...)\n -> text column\n -> SFTTrainer\n\n\nSo your current raw data can stay mostly as-is. The main change is the formatting function.\n\n\n def format_example(example, tokenizer):\n messages = [\n {\n \"role\": \"system\",\n \"content\": (\n \"You convert receipt data into the requested spending insight JSON. \"\n \"Return only valid JSON matching the expected output shape.\"\n ),\n },\n {\n \"role\": \"user\",\n \"content\": json.dumps(example[\"input\"], ensure_ascii=False),\n },\n {\n \"role\": \"assistant\",\n \"content\": json.dumps(example[\"output\"], ensure_ascii=False),\n },\n ]\n\n return {\n \"text\": tokenizer.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=False,\n )\n }\n\n\nBefore training, print one rendered example.\n\n\n formatted = format_example(train_data[0], tokenizer)\n print(formatted[\"text\"])\n\n\nYou want to confirm that the final string actually looks like a Phi-style chat conversation, not like a mixture of multiple prompt formats.\n\nMore detail: training format vs inference/API format (click for more details)\n\n## Minimal checks before a real training run\n\nBefore starting a full job, I would inspect the tokenizer and one formatted sample.\n\n\n print(\"=== chat template ===\")\n print(tokenizer.chat_template)\n\n print(\"=== eos token ===\")\n print(tokenizer.eos_token)\n\n print(\"=== special tokens ===\")\n print(tokenizer.special_tokens_map)\n\n print(\"=== rendered example ===\")\n print(text)\n\n\nThings I would check:\n\nCheck | Why\n---|---\nThe rendered text contains a user turn | Confirms the input JSON is in the expected role.\nThe rendered text contains an assistant turn | Confirms the output JSON is treated as the target answer.\nThe assistant JSON is complete | Avoids training on truncated/malformed outputs.\nTurn boundaries are consistent | Avoids mixing multiple stop/end conventions.\nYou are not manually inserting extra special tokens | Avoids EOS / stop-token confusion.\nThe same message structure can be used by the API | Avoids train/serve mismatch.\n\nMore detail: special tokens, `<|end|>`, and `<|endoftext|>` (click for more details)\n\n## If you like `### Input`, keep it inside the user message\n\nIf you prefer the readability of headings like `### Input`, I would put those inside the `user` content, not use them as the outer conversation template.\n\nFor example:\n\n\n messages = [\n {\n \"role\": \"system\",\n \"content\": \"Return only the requested JSON object.\",\n },\n {\n \"role\": \"user\",\n \"content\": (\n \"### Input\\n\"\n + json.dumps(example[\"input\"], ensure_ascii=False)\n + \"\\n\\n### Task\\n\"\n + \"Generate the spending insight JSON.\"\n ),\n },\n {\n \"role\": \"assistant\",\n \"content\": json.dumps(example[\"output\"], ensure_ascii=False),\n },\n ]\n\n text = tokenizer.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=False,\n )\n\n\nThis keeps the outer format aligned with the model’s chat template, while still letting your task content have readable headings.\n\nOptional layer: response-only / assistant-only loss (click for more details) Optional layer: JSON reliability for an API (click for more details)\n\n## Practical summary\n\nMy first implementation would be:\n\n\n import json\n\n def format_example(example, tokenizer):\n messages = [\n {\n \"role\": \"system\",\n \"content\": (\n \"You convert receipt data into the requested spending insight JSON. \"\n \"Return only valid JSON matching the expected output shape.\"\n ),\n },\n {\n \"role\": \"user\",\n \"content\": json.dumps(example[\"input\"], ensure_ascii=False),\n },\n {\n \"role\": \"assistant\",\n \"content\": json.dumps(example[\"output\"], ensure_ascii=False),\n },\n ]\n\n return {\n \"text\": tokenizer.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=False,\n )\n }\n\n\nThen inspect before training:\n\n\n formatted = format_example(train_data[0], tokenizer)\n print(formatted[\"text\"])\n\n\nAnd for API inference, use the same message structure, but without the assistant answer:\n\n\n messages = [\n {\n \"role\": \"system\",\n \"content\": (\n \"You convert receipt data into the requested spending insight JSON. \"\n \"Return only valid JSON matching the expected output shape.\"\n ),\n },\n {\n \"role\": \"user\",\n \"content\": json.dumps(input_payload, ensure_ascii=False),\n },\n ]\n\n\nIf using Transformers directly:\n\n\n inputs = tokenizer.apply_chat_template(\n messages,\n tokenize=True,\n add_generation_prompt=True,\n return_tensors=\"pt\",\n )\n\n\nSo my practical recommendation would be:\n\n * keep `{input, output}` as the raw data format;\n * convert rows into `messages`;\n * use the Phi-3.5 tokenizer chat template for the final training text;\n * use `add_generation_prompt=False` for training;\n * use the same chat structure for API inference;\n * do not manually append `<|endoftext|>` until you have inspected what the tokenizer already emits;\n * treat response-only loss and production JSON validation as separate follow-up layers.\n\nReferences (click for more details)",
"title": "What prompt format/template should I use for training Unsloth/Phi-3.5-mini-instruct?"
}