Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifryrqa3bskowcvrgyxijnoukt74ai7es53taydfzngy42wwxfjhu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mp3spxzxhoc2"
  },
  "path": "/t/what-prompt-format-template-should-i-use-for-training-unsloth-phi-3-5-mini-instruct/177144#post_2",
  "publishedAt": "2026-06-25T05:13:39.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "(click for more details)"
  ],
  "textContent": "Hmm… personally, when the model has one available (very old models sometimes do not) I think using the Chat Template is often the safer choice:\n\n* * *\n\n## Short answer\n\nI would keep your raw `{input, output}` data structure, but I would not use the hand-written\n\n\n    ### Input: ...\n    ### Output: ...\n    <|endoftext|>\n\n\nstring as my default training format for `Phi-3.5-mini-instruct`.\n\nFor this model, I would start with this route instead:\n\n  1. keep your raw examples as `{input, output}`;\n  2. convert each row into `messages`;\n  3. render those `messages` with the tokenizer/model chat template;\n  4. use `add_generation_prompt=False` for training;\n  5. use the same chat-style structure again when serving the model through your API.\n\n\n\nThe key idea is to separate **how you store your dataset** from **the final text/token format the model sees**.\n\nA minimal shape would be:\n\n\n    import json\n\n    messages = [\n        {\n            \"role\": \"system\",\n            \"content\": (\n                \"You convert receipt data into the requested spending insight JSON. \"\n                \"Return only valid JSON matching the expected output shape.\"\n            ),\n        },\n        {\n            \"role\": \"user\",\n            \"content\": json.dumps(example[\"input\"], ensure_ascii=False),\n        },\n        {\n            \"role\": \"assistant\",\n            \"content\": json.dumps(example[\"output\"], ensure_ascii=False),\n        },\n    ]\n\n    text = tokenizer.apply_chat_template(\n        messages,\n        tokenize=False,\n        add_generation_prompt=False,\n    )\n\n\nThen use that rendered `text` in your dataset.\n\nThis is not a claim that your current `### Input / ### Output` string can never train. It probably can train as plain text-completion data. I would just treat the Phi-3.5 chat template as the lower-risk default for an already instruction-tuned chat model.\n\nMore detail: why I would start from the Phi-3.5 chat template (click for more details)\n\n## Recommended data flow\n\nI would structure it like this:\n\n\n    raw {input, output}\n        -> messages: system / user / assistant\n        -> tokenizer.apply_chat_template(...)\n        -> text column\n        -> SFTTrainer\n\n\nSo your current raw data can stay mostly as-is. The main change is the formatting function.\n\n\n    def format_example(example, tokenizer):\n        messages = [\n            {\n                \"role\": \"system\",\n                \"content\": (\n                    \"You convert receipt data into the requested spending insight JSON. \"\n                    \"Return only valid JSON matching the expected output shape.\"\n                ),\n            },\n            {\n                \"role\": \"user\",\n                \"content\": json.dumps(example[\"input\"], ensure_ascii=False),\n            },\n            {\n                \"role\": \"assistant\",\n                \"content\": json.dumps(example[\"output\"], ensure_ascii=False),\n            },\n        ]\n\n        return {\n            \"text\": tokenizer.apply_chat_template(\n                messages,\n                tokenize=False,\n                add_generation_prompt=False,\n            )\n        }\n\n\nBefore training, print one rendered example.\n\n\n    formatted = format_example(train_data[0], tokenizer)\n    print(formatted[\"text\"])\n\n\nYou want to confirm that the final string actually looks like a Phi-style chat conversation, not like a mixture of multiple prompt formats.\n\nMore detail: training format vs inference/API format (click for more details)\n\n## Minimal checks before a real training run\n\nBefore starting a full job, I would inspect the tokenizer and one formatted sample.\n\n\n    print(\"=== chat template ===\")\n    print(tokenizer.chat_template)\n\n    print(\"=== eos token ===\")\n    print(tokenizer.eos_token)\n\n    print(\"=== special tokens ===\")\n    print(tokenizer.special_tokens_map)\n\n    print(\"=== rendered example ===\")\n    print(text)\n\n\nThings I would check:\n\nCheck | Why\n---|---\nThe rendered text contains a user turn | Confirms the input JSON is in the expected role.\nThe rendered text contains an assistant turn | Confirms the output JSON is treated as the target answer.\nThe assistant JSON is complete | Avoids training on truncated/malformed outputs.\nTurn boundaries are consistent | Avoids mixing multiple stop/end conventions.\nYou are not manually inserting extra special tokens | Avoids EOS / stop-token confusion.\nThe same message structure can be used by the API | Avoids train/serve mismatch.\n\nMore detail: special tokens, `<|end|>`, and `<|endoftext|>` (click for more details)\n\n## If you like `### Input`, keep it inside the user message\n\nIf you prefer the readability of headings like `### Input`, I would put those inside the `user` content, not use them as the outer conversation template.\n\nFor example:\n\n\n    messages = [\n        {\n            \"role\": \"system\",\n            \"content\": \"Return only the requested JSON object.\",\n        },\n        {\n            \"role\": \"user\",\n            \"content\": (\n                \"### Input\\n\"\n                + json.dumps(example[\"input\"], ensure_ascii=False)\n                + \"\\n\\n### Task\\n\"\n                + \"Generate the spending insight JSON.\"\n            ),\n        },\n        {\n            \"role\": \"assistant\",\n            \"content\": json.dumps(example[\"output\"], ensure_ascii=False),\n        },\n    ]\n\n    text = tokenizer.apply_chat_template(\n        messages,\n        tokenize=False,\n        add_generation_prompt=False,\n    )\n\n\nThis keeps the outer format aligned with the model’s chat template, while still letting your task content have readable headings.\n\nOptional layer: response-only / assistant-only loss (click for more details) Optional layer: JSON reliability for an API (click for more details)\n\n## Practical summary\n\nMy first implementation would be:\n\n\n    import json\n\n    def format_example(example, tokenizer):\n        messages = [\n            {\n                \"role\": \"system\",\n                \"content\": (\n                    \"You convert receipt data into the requested spending insight JSON. \"\n                    \"Return only valid JSON matching the expected output shape.\"\n                ),\n            },\n            {\n                \"role\": \"user\",\n                \"content\": json.dumps(example[\"input\"], ensure_ascii=False),\n            },\n            {\n                \"role\": \"assistant\",\n                \"content\": json.dumps(example[\"output\"], ensure_ascii=False),\n            },\n        ]\n\n        return {\n            \"text\": tokenizer.apply_chat_template(\n                messages,\n                tokenize=False,\n                add_generation_prompt=False,\n            )\n        }\n\n\nThen inspect before training:\n\n\n    formatted = format_example(train_data[0], tokenizer)\n    print(formatted[\"text\"])\n\n\nAnd for API inference, use the same message structure, but without the assistant answer:\n\n\n    messages = [\n        {\n            \"role\": \"system\",\n            \"content\": (\n                \"You convert receipt data into the requested spending insight JSON. \"\n                \"Return only valid JSON matching the expected output shape.\"\n            ),\n        },\n        {\n            \"role\": \"user\",\n            \"content\": json.dumps(input_payload, ensure_ascii=False),\n        },\n    ]\n\n\nIf using Transformers directly:\n\n\n    inputs = tokenizer.apply_chat_template(\n        messages,\n        tokenize=True,\n        add_generation_prompt=True,\n        return_tensors=\"pt\",\n    )\n\n\nSo my practical recommendation would be:\n\n  * keep `{input, output}` as the raw data format;\n  * convert rows into `messages`;\n  * use the Phi-3.5 tokenizer chat template for the final training text;\n  * use `add_generation_prompt=False` for training;\n  * use the same chat structure for API inference;\n  * do not manually append `<|endoftext|>` until you have inspected what the tokenizer already emits;\n  * treat response-only loss and production JSON validation as separate follow-up layers.\n\nReferences (click for more details)",
  "title": "What prompt format/template should I use for training Unsloth/Phi-3.5-mini-instruct?"
}