{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifmgskqh7pvnts64at3oljnalo7xeaqrxyqochnd5aduyqytdtobm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmh2hxfvwma2"
  },
  "path": "/t/date-format-for-tine-tuning-ai-models/176116#post_6",
  "publishedAt": "2026-05-22T13:35:48.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face TRL SFTTrainer docs",
    "Hugging Face TRL chat templates docs",
    "Hugging Face Transformers chat templates docs",
    "Hugging Face RAG docs",
    "Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?",
    "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs",
    "Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge",
    "vLLM structured outputs",
    "Outlines structured generation",
    "Ragas faithfulness metric",
    "Ragas context precision metric",
    "TRL SFTTrainer docs",
    "TRL chat templates docs",
    "TRL issue on generation markers for assistant-only loss",
    "Transformers chat templates docs",
    "chat template documentation",
    "TRL issue #5471: generation markers for common model families",
    "Hugging Face Advanced RAG cookbook",
    "Code a simple RAG from scratch",
    "LM Format Enforcer",
    "llguidance",
    "Ragas context precision",
    "Ragas faithfulness",
    "Hugging Face RAG evaluation cookbook"
  ],
  "textContent": "Are you perhaps trying to have the LLM handle databases and strict data management on its own? This is generally a task that LLMs aren’t well-suited for. While it may be unavoidable if there are specific constraints, it’s generally more reliable to divide these tasks among different systems:\n\n* * *\n\n## Date format for fine-tuning Qwen/Gemma/Llama models: likely not just a date-format issue\n\nI do **not** think there is a hidden “correct date format” that must be discovered separately for Qwen, Gemma, or Llama.\n\nA consistent date format is still important. But if you already tried several formats, standardized the data, and still see the fine-tuned model inventing wrong or invalid dates, then the problem is probably **outside date formatting**.\n\nThe likely issue is one or more of these:\n\n  1. You are asking SFT to teach **closed-book factual recall** :\n`\\<person\\> + \\<company\\> -> exact start/end dates`.\n\n  2. The model may learn the **answer shape** but not the **exact factual mapping** :\nit learns “answer with a date range,” but not the correct date range.\n\n  3. The correct date may appear in `input_ids`, but not in the supervised `labels`.\n\n  4. The chat template / assistant mask may be wrong for the model family.\n\n  5. The output is unconstrained, so invalid dates such as `2021-13` are still possible.\n\n  6. The actual dates may need to live in a database, search index, RAG system, or structured record store rather than only in the model weights.\n\n\n\n\nSo my practical recommendation is:\n\n> Use one clean date format, but do not rely on fine-tuning alone to memorize exact dates.\n>  Use structured lookup or RAG for the facts.\n>  Use SFT to teach the model how to use provided records, normalize dates, return JSON, and abstain when no record exists.\n>  Use validation or constrained decoding to prevent invalid dates.\n\nUseful background links:\n\n  * Hugging Face TRL SFTTrainer docs\n  * Hugging Face TRL chat templates docs\n  * Hugging Face Transformers chat templates docs\n  * Hugging Face RAG docs\n  * Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?\n  * Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs\n  * Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge\n  * vLLM structured outputs\n  * Outlines structured generation\n  * Ragas faithfulness metric\n  * Ragas context precision metric\n\n\n\n* * *\n\n## 1. The date format I would use\n\nIf the source data only has month and year, use month precision:\n\n\n    From 2021-01 to 2021-12\n\n\nor, better, structured JSON:\n\n\n    {\n      \"start_date\": \"2021-01\",\n      \"start_precision\": \"month\",\n      \"end_date\": \"2021-12\",\n      \"end_precision\": \"month\"\n    }\n\n\nIf the source data truly has exact days, use exact ISO dates:\n\n\n    From 2021-01-01 to 2021-12-01\n\n\nor:\n\n\n    {\n      \"start_date\": \"2021-01-01\",\n      \"start_precision\": \"day\",\n      \"end_date\": \"2021-12-01\",\n      \"end_precision\": \"day\"\n    }\n\n\nDo **not** silently convert this:\n\n\n    January 2021 to December 2021\n\n\ninto this:\n\n\n    2021-01-01 to 2021-12-01\n\n\nunless those exact days are actually known.\n\nThat would add fake precision. The model may learn that unknown days should be invented as `01`, which is usually not what you want.\n\nFor employment histories, resumes, HR-style records, contracts, and timelines, I would usually use:\n\n\n    YYYY-MM\n\n\nfor month-level records, and keep a separate precision field.\n\n* * *\n\n## 2. Why the format changes did not fix it\n\nThe tested formats mix several patterns:\n\n\n    From January 2021 till December 2021.\n    From 01, 2021 till 12, 2021.\n    From 01-01-2021 till 12-01-2021.\n    From January 2021 01-01-2021 till December 2021 12-01-2021.\n\n\nThat inconsistency is worth fixing. It gives the model multiple competing surface patterns.\n\nHowever, once you standardized to something like:\n\n\n    From 2021-01 to 2021-12\n\n\nand the model still invented wrong or invalid dates, that strongly suggests that date format was not the main bottleneck.\n\nThere are two separate tasks here:\n\n\n    Task A:\n    Normalize \"January 2021\" to \"2021-01\".\n\n    Task B:\n    Know that <person> worked for <company> from 2021-01 to 2021-12.\n\n\nTask A is date normalization. Fine-tuning can handle that.\n\nTask B is factual recall. Fine-tuning is much less reliable for that.\n\nThe model may learn:\n\n\n    When asked about employment dates, answer with:\n    \"From YYYY-MM to YYYY-MM.\"\n\n\nbut fail to learn:\n\n\n    <person> + <company> = 2021-01 to 2021-12\n\n\nThat distinction is the core of the problem.\n\n* * *\n\n## 3. This is probably a closed-book factual-recall problem\n\nYour examples have this structure:\n\n\n    User:\n    When did XYZ work for ABC?\n\n    Assistant:\n    From 2021-01 to 2021-12.\n\n\nAt inference time, the model sees only:\n\n\n    When did XYZ work for ABC?\n\n\nIt does **not** see the employment record. Therefore, it must recover the fact from model weights:\n\n\n    XYZ + ABC -> 2021-01 to 2021-12\n\n\nThat is **closed-book factual recall**.\n\nClosed-book means the answer is not provided in the prompt, not retrieved from an external source, and not looked up from a database visible to the model. The model must answer from memory.\n\nThis is fragile for exact dates because dates are high-entropy factual values. A model can often learn that an answer should look like a date range, but the exact dates are not predictable from the words `XYZ` and `ABC`.\n\nThis matches existing research:\n\n  * Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? reports that LLMs struggle to acquire new factual knowledge through fine-tuning and that fine-tuning new knowledge can increase hallucination tendency.\n  * Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs finds that RAG consistently outperforms unsupervised fine-tuning on knowledge-intensive tasks, including entirely new knowledge.\n  * Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge is especially relevant if your facts are private, rare, internal, or low-frequency.\n\n\n\nIn other words, you may not be debugging a date-string issue. You may be seeing the limits of using SFT as a factual database.\n\n* * *\n\n## 4. `input_ids` matching the dataset is not enough\n\nChecking `input_ids` is useful. It tells you that tokenization did not completely destroy the date string.\n\nBut it does **not** prove the model is learning those dates.\n\nIn causal LM SFT, the answer text often appears in `input_ids`. That is normal. The real question is whether the answer tokens also appear in `labels` where loss is applied.\n\nMany SFT pipelines mask tokens by setting labels to `-100`. Tokens with label `-100` are ignored by the loss.\n\nSo this can happen:\n\n\n    input_ids:\n    ... From 2021-01 to 2021-12 ...\n\n    labels:\n    ... -100 -100 -100 -100 ...\n\n\nIn that case, the date is visible in the batch, but the model is not trained to generate it.\n\nThis is particularly important when using:\n\n  * `assistant_only_loss`\n  * `completion_only_loss`\n  * `DataCollatorForCompletionOnlyLM`\n  * packing\n  * custom chat templates\n  * Qwen/Gemma/Llama-specific templates\n\n\n\nSee:\n\n  * TRL SFTTrainer docs\n  * TRL chat templates docs\n  * TRL issue on generation markers for assistant-only loss\n  * Transformers chat templates docs\n\n\n\nRun this check before training:\n\n\n    batch = next(iter(trainer.get_train_dataloader()))\n\n    input_ids = batch[\"input_ids\"][0]\n    labels = batch[\"labels\"][0]\n\n    print(\"FULL INPUT\")\n    print(tokenizer.decode(input_ids, skip_special_tokens=False))\n\n    print(\"\\nSUPERVISED TOKENS ONLY\")\n    supervised_ids = labels[labels != -100]\n    print(tokenizer.decode(supervised_ids, skip_special_tokens=False))\n\n\nYou want to see the assistant answer in the supervised region:\n\n\n    From 2021-01 to 2021-12.\n\n\nor, if using JSON:\n\n\n    {\"start_date\":\"2021-01\",\"end_date\":\"2021-12\"}\n\n\nIf the date is not present in `labels[labels != -100]`, the model is not being trained to generate that date.\n\n* * *\n\n## 5. Chat templates are another likely cause\n\nQwen, Gemma, and Llama-style instruction models do not all expect the same prompt format.\n\nA raw text format like this:\n\n\n    user: When did XYZ work for ABC.\n    Assistant: From 2021-01 to 2021-12.\n\n\nmay not match the actual chat template expected by the model.\n\nHugging Face’s chat template documentation explains that chat models use model-specific control tokens. The same user/assistant conversation can be rendered differently for different model families.\n\nSo training examples should usually be represented structurally:\n\n\n    messages = [\n        {\"role\": \"user\", \"content\": \"When did XYZ work for ABC?\"},\n        {\"role\": \"assistant\", \"content\": \"From 2021-01 to 2021-12.\"},\n    ]\n\n\nThen render them using the model tokenizer:\n\n\n    text = tokenizer.apply_chat_template(\n        messages,\n        tokenize=False,\n        add_generation_prompt=False,\n    )\n\n    print(text)\n\n\nAt inference time:\n\n\n    messages = [\n        {\"role\": \"user\", \"content\": \"When did XYZ work for ABC?\"},\n    ]\n\n    prompt = tokenizer.apply_chat_template(\n        messages,\n        tokenize=False,\n        add_generation_prompt=True,\n    )\n\n    print(prompt)\n\n\nTraining and inference must be structurally compatible.\n\nA common failure is:\n\n\n    Training:\n    manual \"user:\" / \"Assistant:\" strings\n\n    Inference:\n    tokenizer.apply_chat_template(...)\n\n\nor:\n\n\n    Training:\n    one model family's chat template\n\n    Inference:\n    another model family's chat template\n\n\nThis can produce poor generations even when the data itself is correct.\n\n* * *\n\n## 6. Assistant-only loss and generation markers can matter\n\nIf you use `assistant_only_loss=True`, the trainer needs to know which tokens belong to the assistant response.\n\nThat often depends on the chat template producing an assistant-token mask. TRL documents that SFT with assistant-only loss requires `{% generation %}` and `{% endgeneration %}` markers around assistant output so that the loss mask can target only assistant tokens.\n\nUseful links:\n\n  * TRL SFTTrainer docs\n  * TRL chat templates docs\n  * TRL issue #5471: generation markers for common model families\n\n\n\nFor Qwen, Gemma, or Llama models, verify:\n\n\n    Does the tokenizer chat template support assistant masks?\n    Does the trainer actually supervise assistant tokens?\n    Does the supervised region include the date?\n    Does this still work if packing is enabled?\n\n\nFor debugging, temporarily disable packing:\n\n\n    packing = False\n\n\nPacking is useful later, but it makes it harder to inspect example boundaries, EOS behavior, truncation, and loss masks.\n\n* * *\n\n## 7. Fine-tuning alone can work only in narrow conditions\n\nFine-tuning alone may work if all of these are true:\n\n  * The number of facts is small.\n  * The facts are static.\n  * Each fact appears many times.\n  * You use many paraphrases per fact.\n  * The questions are highly repetitive.\n  * The use case can tolerate occasional wrong answers.\n  * The deployment is low-risk.\n  * You evaluate exact-match behavior carefully.\n\n\n\nExample of a narrow case where no-RAG SFT might be acceptable:\n\n\n    20 fictional people\n    20 fictional companies\n    many paraphrases per fact\n    internal demo\n    low consequence if an answer is occasionally wrong\n\n\nBut if you have:\n\n  * hundreds or thousands of people,\n  * many companies,\n  * similar names,\n  * changing records,\n  * private facts,\n  * audit requirements,\n  * exact-date requirements,\n\n\n\nthen fine-tuning alone is the wrong default.\n\nFor exact employment dates, the model should not be the only database.\n\n* * *\n\n## 8. Better approach: structured lookup or RAG\n\nFor employment dates, store the facts outside the model.\n\nA record should look something like this:\n\n\n    {\n      \"record_id\": \"employment_000123\",\n      \"person_id\": \"person_xyz\",\n      \"person_name\": \"XYZ\",\n      \"employer_id\": \"org_abc\",\n      \"employer_name\": \"ABC\",\n      \"start_date\": \"2021-01\",\n      \"start_precision\": \"month\",\n      \"end_date\": \"2021-12\",\n      \"end_precision\": \"month\",\n      \"is_current\": false\n    }\n\n\nThen, at inference time:\n\n  1. Parse the question.\n  2. Resolve the person and employer.\n  3. Retrieve the matching employment record.\n  4. Give the record to the model.\n  5. Ask the model to answer using only that record.\n  6. Validate the output.\n\n\n\nThis changes the task from:\n\n\n    The model must remember the date.\n\n\nto:\n\n\n    The model must read the date from the retrieved record and format it correctly.\n\n\nThat is a much better use of an LLM.\n\nSee:\n\n  * Hugging Face RAG docs\n  * Hugging Face Advanced RAG cookbook\n  * Code a simple RAG from scratch\n\n\n\nIf your data is already structured, start with **structured lookup** , not pure vector search.\n\nFor example:\n\n\n    SELECT start_date, end_date\n    FROM employment_records\n    WHERE person_id = 'person_xyz'\n    AND employer_id = 'org_abc';\n\n\nPure vector RAG is useful for unstructured text. But if you already have person IDs, employer IDs, start dates, and end dates, structured lookup is more reliable.\n\n* * *\n\n## 9. What SFT should do in the better design\n\nSFT is still useful. It is just useful for a different job.\n\nDo **not** primarily fine-tune the model to memorize this:\n\n\n    XYZ worked for ABC from 2021-01 to 2021-12.\n\n\nInstead, fine-tune it to do this:\n\n\n    Given retrieved records + question -> return the correct structured answer using only the records.\n\n\nA better SFT example:\n\n\n    {\n      \"messages\": [\n        {\n          \"role\": \"system\",\n          \"content\": \"Use only the provided employment records. Return JSON only. Do not invent dates.\"\n        },\n        {\n          \"role\": \"user\",\n          \"content\": \"Records:\\n[{\\\"person\\\":\\\"XYZ\\\",\\\"employer\\\":\\\"ABC\\\",\\\"start\\\":\\\"January 2021\\\",\\\"end\\\":\\\"December 2021\\\"}]\\n\\nQuestion: When did XYZ work for ABC?\"\n        },\n        {\n          \"role\": \"assistant\",\n          \"content\": \"{\\\"record_found\\\":true,\\\"start_date\\\":\\\"2021-01\\\",\\\"start_precision\\\":\\\"month\\\",\\\"end_date\\\":\\\"2021-12\\\",\\\"end_precision\\\":\\\"month\\\"}\"\n        }\n      ]\n    }\n\n\nThis teaches:\n\n  * use context,\n  * normalize dates,\n  * preserve precision,\n  * return schema-compatible JSON,\n  * do not invent dates,\n  * abstain when no record exists.\n\n\n\nThat is a much better SFT target than closed-book memorization.\n\n* * *\n\n## 10. Add no-answer and distractor examples\n\nA common mistake is training only examples where every question has an answer.\n\nThat teaches the model:\n\n\n    Always produce a date.\n\n\nThen when it does not know, it still invents one.\n\nYou need examples where the correct output is null.\n\nExample:\n\n\n    {\n      \"messages\": [\n        {\n          \"role\": \"system\",\n          \"content\": \"Use only the provided employment records. Return JSON only.\"\n        },\n        {\n          \"role\": \"user\",\n          \"content\": \"Records:\\n[]\\n\\nQuestion: When did XYZ work for ABC?\"\n        },\n        {\n          \"role\": \"assistant\",\n          \"content\": \"{\\\"record_found\\\":false,\\\"start_date\\\":null,\\\"end_date\\\":null,\\\"reason\\\":\\\"No matching employment record was provided.\\\"}\"\n        }\n      ]\n    }\n\n\nYou also need distractor examples:\n\n\n    {\n      \"records\": [\n        {\n          \"person\": \"XYZ\",\n          \"employer\": \"ABD\",\n          \"start_date\": \"2021-01\",\n          \"end_date\": \"2021-12\"\n        },\n        {\n          \"person\": \"XYX\",\n          \"employer\": \"ABC\",\n          \"start_date\": \"2020-01\",\n          \"end_date\": \"2020-12\"\n        }\n      ],\n      \"question\": \"When did XYZ work for ABC?\",\n      \"answer\": {\n        \"record_found\": false,\n        \"start_date\": null,\n        \"end_date\": null\n      }\n    }\n\n\nThis teaches the model not to combine the right person from one record with the right company from another.\n\nThat matters a lot for employment data, where names and organizations can be similar.\n\n* * *\n\n## 11. Invalid dates need validation or constrained decoding\n\nEven after SFT, the model can still produce:\n\n\n    2021-13\n    2021-00\n    2021-02-31\n    2021/01\n    January 2021 2021-01\n\n\nFor machine-readable outputs, prompting is not enough.\n\nUse one or more of:\n\n  * JSON Schema,\n  * regex-constrained decoding,\n  * grammar-constrained decoding,\n  * post-generation validation,\n  * retry on invalid output.\n\n\n\nUseful links:\n\n  * vLLM structured outputs\n  * Outlines structured generation\n  * LM Format Enforcer\n  * llguidance\n\n\n\nFor month-level dates, validate with:\n\n\n    import re\n\n    MONTH_RE = re.compile(r\"^\\d{4}-(0[1-9]|1[0-2])$\")\n\n    def valid_month(value: str) -> bool:\n        return bool(MONTH_RE.fullmatch(value))\n\n\nFor exact ISO dates:\n\n\n    from datetime import date\n\n    def valid_iso_date(value: str) -> bool:\n        try:\n            date.fromisoformat(value)\n            return True\n        except ValueError:\n            return False\n\n\nAlso validate date ordering:\n\n\n    def month_key(value: str) -> tuple[int, int]:\n        year, month = value.split(\"-\")\n        return int(year), int(month)\n\n    assert month_key(start_date) <= month_key(end_date)\n\n\nImportant limitation:\n\n\n    Validation can reject invalid dates.\n    It cannot prove that a valid date is factually correct.\n\n\nFor factual correctness, you need the retrieved record or database.\n\n* * *\n\n## 12. Evaluation should be exact, not semantic\n\nFor this task, do not evaluate only by reading a few answers manually.\n\nUse exact metrics:\n\n  * `start_date` exact match\n  * `end_date` exact match\n  * `start_precision` exact match\n  * `end_precision` exact match\n  * `record_id` exact match\n  * invalid date rate\n  * malformed JSON rate\n  * false answer rate when no record exists\n  * wrong-person rate\n  * wrong-employer rate\n\n\n\nIf you use RAG, evaluate retrieval separately from generation.\n\nUseful RAG evaluation links:\n\n  * Ragas context precision\n  * Ragas faithfulness\n  * Hugging Face RAG evaluation cookbook\n\n\n\nFor this task:\n\n\n    Context precision:\n    Did retrieval find the correct employment record?\n\n    Faithfulness:\n    Did the model answer using only the retrieved record?\n\n    Exact match:\n    Did the final start_date and end_date match the expected dates?\n\n\nAll three matter.\n\n* * *\n\n## 13. Debug checklist for the current SFT setup\n\n### Check 1: print the rendered chat template\n\n\n    messages = [\n        {\"role\": \"user\", \"content\": \"When did XYZ work for ABC?\"},\n        {\"role\": \"assistant\", \"content\": \"From 2021-01 to 2021-12.\"},\n    ]\n\n    rendered = tokenizer.apply_chat_template(\n        messages,\n        tokenize=False,\n        add_generation_prompt=False,\n    )\n\n    print(rendered)\n\n\nVerify:\n\n  * Is this the model’s expected template?\n  * Does it include the assistant answer?\n  * Does it include duplicated special tokens?\n  * Does it match the inference template?\n\n\n\n* * *\n\n### Check 2: print supervised tokens\n\n\n    batch = next(iter(trainer.get_train_dataloader()))\n\n    input_ids = batch[\"input_ids\"][0]\n    labels = batch[\"labels\"][0]\n\n    print(\"FULL INPUT\")\n    print(tokenizer.decode(input_ids, skip_special_tokens=False))\n\n    print(\"\\nSUPERVISED TOKENS ONLY\")\n    print(tokenizer.decode(labels[labels != -100], skip_special_tokens=False))\n\n\nVerify that the supervised tokens include:\n\n\n    From 2021-01 to 2021-12.\n\n\nor the expected JSON answer.\n\n* * *\n\n### Check 3: count supervised tokens\n\n\n    num_total = labels.numel()\n    num_supervised = (labels != -100).sum().item()\n\n    print(\"total tokens:\", num_total)\n    print(\"supervised tokens:\", num_supervised)\n    print(\"supervised ratio:\", num_supervised / num_total)\n\n\nRed flags:\n\n  * `supervised tokens = 0`\n  * only special tokens are supervised\n  * prompt tokens are supervised but answer tokens are ignored\n  * answer date is missing from supervised tokens\n\n\n\n* * *\n\n### Check 4: disable packing temporarily\n\n\n    packing = False\n\n\nInspect one example at a time.\n\nPacking is useful later, but it makes debugging masks, boundaries, EOS, and truncation harder.\n\n* * *\n\n### Check 5: run a tiny overfit test\n\nCreate one synthetic example with unusual dates:\n\n\n    Person_AAA worked for Company_BBB from 2091-03 to 2092-07.\n\n\nTrain on that one example.\n\nAsk:\n\n\n    When did Person_AAA work for Company_BBB?\n\n\nExpected answer:\n\n\n    From 2091-03 to 2092-07.\n\n\nIf the model cannot memorize one example, the problem is probably not date format. It is likely one of:\n\n  * labels,\n  * masking,\n  * chat template,\n  * adapter loading,\n  * learning rate,\n  * LoRA target modules,\n  * EOS/truncation,\n  * inference prompt mismatch.\n\n\n\n* * *\n\n### Check 6: run a ten-example overfit test\n\nTrain on 10 synthetic examples with unusual dates.\n\nEvaluate on the same exact prompts.\n\nIf it fails, your training pipeline is probably broken.\n\nIf it succeeds on exact prompts but fails on paraphrases, the model memorized the prompt surface rather than robustly learning entity-date associations.\n\n* * *\n\n### Check 7: run a context-present test\n\nCompare these two prompts.\n\nClosed-book:\n\n\n    When did XYZ work for ABC?\n\n\nContext-present:\n\n\n    Record:\n    XYZ worked for ABC from 2021-01 to 2021-12.\n\n    Question:\n    When did XYZ work for ABC?\n\n\nIf context-present works and closed-book fails, the solution is retrieval or structured lookup, not more date-format changes.\n\n* * *\n\n## 14. Recommended target format\n\nInstead of this as the main target:\n\n\n    user: When did XYZ work for ABC.\n    assistant: From 2021-01 to 2021-12.\n\n\nuse a context-grounded format:\n\n\n    {\n      \"messages\": [\n        {\n          \"role\": \"system\",\n          \"content\": \"Answer employment-date questions using only the provided records. Return JSON only. Do not invent dates.\"\n        },\n        {\n          \"role\": \"user\",\n          \"content\": \"Records:\\n[{\\\"record_id\\\":\\\"employment_000123\\\",\\\"person\\\":\\\"XYZ\\\",\\\"employer\\\":\\\"ABC\\\",\\\"start_date\\\":\\\"2021-01\\\",\\\"start_precision\\\":\\\"month\\\",\\\"end_date\\\":\\\"2021-12\\\",\\\"end_precision\\\":\\\"month\\\"}]\\n\\nQuestion: When did XYZ work for ABC?\"\n        },\n        {\n          \"role\": \"assistant\",\n          \"content\": \"{\\\"record_found\\\":true,\\\"record_id\\\":\\\"employment_000123\\\",\\\"start_date\\\":\\\"2021-01\\\",\\\"start_precision\\\":\\\"month\\\",\\\"end_date\\\":\\\"2021-12\\\",\\\"end_precision\\\":\\\"month\\\"}\"\n        }\n      ]\n    }\n\n\nFor a missing record:\n\n\n    {\n      \"messages\": [\n        {\n          \"role\": \"system\",\n          \"content\": \"Answer employment-date questions using only the provided records. Return JSON only. Do not invent dates.\"\n        },\n        {\n          \"role\": \"user\",\n          \"content\": \"Records:\\n[]\\n\\nQuestion: When did XYZ work for ABC?\"\n        },\n        {\n          \"role\": \"assistant\",\n          \"content\": \"{\\\"record_found\\\":false,\\\"record_id\\\":null,\\\"start_date\\\":null,\\\"end_date\\\":null,\\\"reason\\\":\\\"No matching employment record was provided.\\\"}\"\n        }\n      ]\n    }\n\n\nThis trains the right behavior:\n\n  * use context,\n  * do not invent,\n  * return structured output,\n  * preserve precision,\n  * abstain when the record is absent.\n\n\n\nIt does not try to turn the model into the only source of truth.\n\n* * *\n\n## 15. Can fine-tuning alone solve it?\n\n### In theory\n\nYes, sometimes.\n\nFine-tuning alone can memorize a small number of static facts if:\n\n  * the dataset is small,\n  * facts are repeated many times,\n  * the same facts appear in many paraphrases,\n  * the evaluation questions are similar to training questions,\n  * occasional errors are acceptable.\n\n\n\n### In this case\n\nProbably not reliably.\n\nThe symptoms suggest that the model is either:\n\n  * not actually being supervised on the answer tokens,\n  * not receiving the correct chat template,\n  * learning the answer pattern but not the date mapping,\n  * or being asked to do a task that should use retrieval or structured lookup.\n\n\n\nEven if the SFT pipeline is fixed, fine-tuning alone is still not the best design if exact dates matter.\n\nFor employment dates, I would not trust model weights as the only source of truth.\n\n* * *\n\n## 16. Better architecture\n\nI would use this design:\n\n\n    User question\n      ↓\n    Entity extraction / normalization\n      ↓\n    Person and employer resolution\n      ↓\n    Structured lookup or RAG\n      ↓\n    Retrieved employment record(s)\n      ↓\n    LLM answers using only retrieved records\n      ↓\n    Strict JSON output\n      ↓\n    Date/schema validator\n      ↓\n    Final natural-language answer\n\n\nExample final prompt:\n\n\n    You answer employment-date questions using only the provided records.\n\n    Rules:\n    - Do not invent dates.\n    - If no exact matching record is provided, return null.\n    - Preserve date precision.\n    - Use YYYY-MM for month precision.\n    - Use YYYY-MM-DD for day precision.\n    - Return JSON only.\n\n    Records:\n    [\n      {\n        \"record_id\": \"employment_000123\",\n        \"person_name\": \"XYZ\",\n        \"employer_name\": \"ABC\",\n        \"start_date\": \"2021-01\",\n        \"start_precision\": \"month\",\n        \"end_date\": \"2021-12\",\n        \"end_precision\": \"month\"\n      }\n    ]\n\n    Question:\n    When did XYZ work for ABC?\n\n\nExpected output:\n\n\n    {\n      \"record_found\": true,\n      \"record_id\": \"employment_000123\",\n      \"person_name\": \"XYZ\",\n      \"employer_name\": \"ABC\",\n      \"start_date\": \"2021-01\",\n      \"start_precision\": \"month\",\n      \"end_date\": \"2021-12\",\n      \"end_precision\": \"month\",\n      \"answer\": \"XYZ worked for ABC from 2021-01 to 2021-12.\"\n    }\n\n\nThis system is easier to debug because you can inspect:\n\n  * Was the right record retrieved?\n  * Was the right record passed to the model?\n  * Did the model copy the right dates?\n  * Did validation pass?\n\n\n\nWith fine-tuning alone, a wrong date is much harder to diagnose.\n\n* * *\n\n## 17. Final recommendation\n\nUse this division of labor:\n\nComponent | Responsibility\n---|---\nStructured database / RAG | Store and retrieve actual employment dates\nSFT | Teach context use, date normalization, abstention, schema-following\nChat template | Ensure the model sees the correct conversation format\nLabels/masks | Ensure assistant answer tokens receive loss\nValidator | Reject invalid date strings\nEvaluator | Measure exact date correctness\n\nDo not keep searching for a model-specific date format. Use a clean format, but move the factual burden out of the model weights.\n\nThe best answer is:\n\n\n    Use YYYY-MM for month-level employment periods.\n    Use YYYY-MM-DD only for real day-level dates.\n    Verify labels, not only input_ids.\n    Use model-specific chat templates.\n    Do not expect SFT alone to reliably memorize arbitrary exact dates.\n    Use structured lookup or RAG for the facts.\n    Use SFT for behavior.\n    Use validation or constrained decoding for date validity.\n\n\n* * *\n\n## Short summary\n\n  * The date format should be consistent, preferably `YYYY-MM` for month-level data.\n  * If date-format cleanup did not fix the issue, the problem is likely not the date format.\n  * Seeing dates in `input_ids` does not prove the model is trained on them; inspect `labels != -100`.\n  * Qwen, Gemma, and Llama-style models need correct chat templates.\n  * Fine-tuning alone is weak for exact new/private factual knowledge.\n  * Use retrieval or structured lookup for actual employment dates.\n  * Fine-tune the model to use provided records, not to memorize all records.\n  * Add no-answer and distractor examples.\n  * Use JSON/schema/regex validation to prevent invalid dates.\n  * Evaluate with exact date match, not semantic similarity.\n\n",
  "title": "Date format for tine-tuning AI models"
}