External Publication

Date format for tine-tuning AI models

Hugging Face Forums [Unofficial] May 22, 2026

Are you perhaps trying to have the LLM handle databases and strict data management on its own? This is generally a task that LLMs aren’t well-suited for. While it may be unavoidable if there are specific constraints, it’s generally more reliable to divide these tasks among different systems:

Date format for fine-tuning Qwen/Gemma/Llama models: likely not just a date-format issue

I do not think there is a hidden “correct date format” that must be discovered separately for Qwen, Gemma, or Llama.

A consistent date format is still important. But if you already tried several formats, standardized the data, and still see the fine-tuned model inventing wrong or invalid dates, then the problem is probably outside date formatting.

The likely issue is one or more of these:

You are asking SFT to teach closed-book factual recall : \<person\> + \<company\> -> exact start/end dates.
The model may learn the answer shape but not the exact factual mapping : it learns “answer with a date range,” but not the correct date range.
The correct date may appear in input_ids, but not in the supervised labels.
The chat template / assistant mask may be wrong for the model family.
The output is unconstrained, so invalid dates such as 2021-13 are still possible.
The actual dates may need to live in a database, search index, RAG system, or structured record store rather than only in the model weights.

So my practical recommendation is:

Use one clean date format, but do not rely on fine-tuning alone to memorize exact dates. Use structured lookup or RAG for the facts. Use SFT to teach the model how to use provided records, normalize dates, return JSON, and abstain when no record exists. Use validation or constrained decoding to prevent invalid dates.

Useful background links:

Hugging Face TRL SFTTrainer docs
Hugging Face TRL chat templates docs
Hugging Face Transformers chat templates docs
Hugging Face RAG docs
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge
vLLM structured outputs
Outlines structured generation
Ragas faithfulness metric
Ragas context precision metric

1. The date format I would use

If the source data only has month and year, use month precision:

From 2021-01 to 2021-12

or, better, structured JSON:

{
  "start_date": "2021-01",
  "start_precision": "month",
  "end_date": "2021-12",
  "end_precision": "month"
}

If the source data truly has exact days, use exact ISO dates:

From 2021-01-01 to 2021-12-01

or:

{
  "start_date": "2021-01-01",
  "start_precision": "day",
  "end_date": "2021-12-01",
  "end_precision": "day"
}

Do not silently convert this:

January 2021 to December 2021

into this:

2021-01-01 to 2021-12-01

unless those exact days are actually known.

That would add fake precision. The model may learn that unknown days should be invented as 01, which is usually not what you want.

For employment histories, resumes, HR-style records, contracts, and timelines, I would usually use:

YYYY-MM

for month-level records, and keep a separate precision field.

2. Why the format changes did not fix it

The tested formats mix several patterns:

From January 2021 till December 2021.
From 01, 2021 till 12, 2021.
From 01-01-2021 till 12-01-2021.
From January 2021 01-01-2021 till December 2021 12-01-2021.

That inconsistency is worth fixing. It gives the model multiple competing surface patterns.

However, once you standardized to something like:

From 2021-01 to 2021-12

and the model still invented wrong or invalid dates, that strongly suggests that date format was not the main bottleneck.

There are two separate tasks here:

Task A:
Normalize "January 2021" to "2021-01".

Task B:
Know that <person> worked for <company> from 2021-01 to 2021-12.

Task A is date normalization. Fine-tuning can handle that.

Task B is factual recall. Fine-tuning is much less reliable for that.

The model may learn:

When asked about employment dates, answer with:
"From YYYY-MM to YYYY-MM."

but fail to learn:

<person> + <company> = 2021-01 to 2021-12

That distinction is the core of the problem.

3. This is probably a closed-book factual-recall problem

Your examples have this structure:

User:
When did XYZ work for ABC?

Assistant:
From 2021-01 to 2021-12.

At inference time, the model sees only:

When did XYZ work for ABC?

It does not see the employment record. Therefore, it must recover the fact from model weights:

XYZ + ABC -> 2021-01 to 2021-12

That is closed-book factual recall.

Closed-book means the answer is not provided in the prompt, not retrieved from an external source, and not looked up from a database visible to the model. The model must answer from memory.

This is fragile for exact dates because dates are high-entropy factual values. A model can often learn that an answer should look like a date range, but the exact dates are not predictable from the words XYZ and ABC.

This matches existing research:

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? reports that LLMs struggle to acquire new factual knowledge through fine-tuning and that fine-tuning new knowledge can increase hallucination tendency.
Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs finds that RAG consistently outperforms unsupervised fine-tuning on knowledge-intensive tasks, including entirely new knowledge.
Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge is especially relevant if your facts are private, rare, internal, or low-frequency.

In other words, you may not be debugging a date-string issue. You may be seeing the limits of using SFT as a factual database.

4. `input_ids` matching the dataset is not enough

Checking input_ids is useful. It tells you that tokenization did not completely destroy the date string.

But it does not prove the model is learning those dates.

In causal LM SFT, the answer text often appears in input_ids. That is normal. The real question is whether the answer tokens also appear in labels where loss is applied.

Many SFT pipelines mask tokens by setting labels to -100. Tokens with label -100 are ignored by the loss.

So this can happen:

input_ids:
... From 2021-01 to 2021-12 ...

labels:
... -100 -100 -100 -100 ...

In that case, the date is visible in the batch, but the model is not trained to generate it.

This is particularly important when using:

assistant_only_loss
completion_only_loss
DataCollatorForCompletionOnlyLM
packing
custom chat templates
Qwen/Gemma/Llama-specific templates

See:

TRL SFTTrainer docs
TRL chat templates docs
TRL issue on generation markers for assistant-only loss
Transformers chat templates docs

Run this check before training:

batch = next(iter(trainer.get_train_dataloader()))

input_ids = batch["input_ids"][0]
labels = batch["labels"][0]

print("FULL INPUT")
print(tokenizer.decode(input_ids, skip_special_tokens=False))

print("\nSUPERVISED TOKENS ONLY")
supervised_ids = labels[labels != -100]
print(tokenizer.decode(supervised_ids, skip_special_tokens=False))

You want to see the assistant answer in the supervised region:

From 2021-01 to 2021-12.

or, if using JSON:

{"start_date":"2021-01","end_date":"2021-12"}

If the date is not present in labels[labels != -100], the model is not being trained to generate that date.

5. Chat templates are another likely cause

Qwen, Gemma, and Llama-style instruction models do not all expect the same prompt format.

A raw text format like this:

user: When did XYZ work for ABC.
Assistant: From 2021-01 to 2021-12.

may not match the actual chat template expected by the model.

Hugging Face’s chat template documentation explains that chat models use model-specific control tokens. The same user/assistant conversation can be rendered differently for different model families.

So training examples should usually be represented structurally:

messages = [
    {"role": "user", "content": "When did XYZ work for ABC?"},
    {"role": "assistant", "content": "From 2021-01 to 2021-12."},
]

Then render them using the model tokenizer:

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False,
)

print(text)

At inference time:

messages = [
    {"role": "user", "content": "When did XYZ work for ABC?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

print(prompt)

Training and inference must be structurally compatible.

A common failure is:

Training:
manual "user:" / "Assistant:" strings

Inference:
tokenizer.apply_chat_template(...)

or:

Training:
one model family's chat template

Inference:
another model family's chat template

This can produce poor generations even when the data itself is correct.

6. Assistant-only loss and generation markers can matter

If you use assistant_only_loss=True, the trainer needs to know which tokens belong to the assistant response.

That often depends on the chat template producing an assistant-token mask. TRL documents that SFT with assistant-only loss requires {% generation %} and {% endgeneration %} markers around assistant output so that the loss mask can target only assistant tokens.

Useful links:

TRL SFTTrainer docs
TRL chat templates docs
TRL issue #5471: generation markers for common model families

For Qwen, Gemma, or Llama models, verify:

Does the tokenizer chat template support assistant masks?
Does the trainer actually supervise assistant tokens?
Does the supervised region include the date?
Does this still work if packing is enabled?

For debugging, temporarily disable packing:

packing = False

Packing is useful later, but it makes it harder to inspect example boundaries, EOS behavior, truncation, and loss masks.

7. Fine-tuning alone can work only in narrow conditions

Fine-tuning alone may work if all of these are true:

The number of facts is small.
The facts are static.
Each fact appears many times.
You use many paraphrases per fact.
The questions are highly repetitive.
The use case can tolerate occasional wrong answers.
The deployment is low-risk.
You evaluate exact-match behavior carefully.

Example of a narrow case where no-RAG SFT might be acceptable:

20 fictional people
20 fictional companies
many paraphrases per fact
internal demo
low consequence if an answer is occasionally wrong

But if you have:

hundreds or thousands of people,
many companies,
similar names,
changing records,
private facts,
audit requirements,
exact-date requirements,

then fine-tuning alone is the wrong default.

For exact employment dates, the model should not be the only database.

8. Better approach: structured lookup or RAG

For employment dates, store the facts outside the model.

A record should look something like this:

{
  "record_id": "employment_000123",
  "person_id": "person_xyz",
  "person_name": "XYZ",
  "employer_id": "org_abc",
  "employer_name": "ABC",
  "start_date": "2021-01",
  "start_precision": "month",
  "end_date": "2021-12",
  "end_precision": "month",
  "is_current": false
}

Then, at inference time:

Parse the question.
Resolve the person and employer.
Retrieve the matching employment record.
Give the record to the model.
Ask the model to answer using only that record.
Validate the output.

This changes the task from:

The model must remember the date.

to:

The model must read the date from the retrieved record and format it correctly.

That is a much better use of an LLM.

See:

Hugging Face RAG docs
Hugging Face Advanced RAG cookbook
Code a simple RAG from scratch

If your data is already structured, start with structured lookup , not pure vector search.

For example:

SELECT start_date, end_date
FROM employment_records
WHERE person_id = 'person_xyz'
AND employer_id = 'org_abc';

Pure vector RAG is useful for unstructured text. But if you already have person IDs, employer IDs, start dates, and end dates, structured lookup is more reliable.

9. What SFT should do in the better design

SFT is still useful. It is just useful for a different job.

Do not primarily fine-tune the model to memorize this:

XYZ worked for ABC from 2021-01 to 2021-12.

Instead, fine-tune it to do this:

Given retrieved records + question -> return the correct structured answer using only the records.

A better SFT example:

{
  "messages": [
    {
      "role": "system",
      "content": "Use only the provided employment records. Return JSON only. Do not invent dates."
    },
    {
      "role": "user",
      "content": "Records:\n[{\"person\":\"XYZ\",\"employer\":\"ABC\",\"start\":\"January 2021\",\"end\":\"December 2021\"}]\n\nQuestion: When did XYZ work for ABC?"
    },
    {
      "role": "assistant",
      "content": "{\"record_found\":true,\"start_date\":\"2021-01\",\"start_precision\":\"month\",\"end_date\":\"2021-12\",\"end_precision\":\"month\"}"
    }
  ]
}

This teaches:

use context,
normalize dates,
preserve precision,
return schema-compatible JSON,
do not invent dates,
abstain when no record exists.

That is a much better SFT target than closed-book memorization.

10. Add no-answer and distractor examples

A common mistake is training only examples where every question has an answer.

That teaches the model:

Always produce a date.

Then when it does not know, it still invents one.

You need examples where the correct output is null.

Example:

{
  "messages": [
    {
      "role": "system",
      "content": "Use only the provided employment records. Return JSON only."
    },
    {
      "role": "user",
      "content": "Records:\n[]\n\nQuestion: When did XYZ work for ABC?"
    },
    {
      "role": "assistant",
      "content": "{\"record_found\":false,\"start_date\":null,\"end_date\":null,\"reason\":\"No matching employment record was provided.\"}"
    }
  ]
}

You also need distractor examples:

{
  "records": [
    {
      "person": "XYZ",
      "employer": "ABD",
      "start_date": "2021-01",
      "end_date": "2021-12"
    },
    {
      "person": "XYX",
      "employer": "ABC",
      "start_date": "2020-01",
      "end_date": "2020-12"
    }
  ],
  "question": "When did XYZ work for ABC?",
  "answer": {
    "record_found": false,
    "start_date": null,
    "end_date": null
  }
}

This teaches the model not to combine the right person from one record with the right company from another.

That matters a lot for employment data, where names and organizations can be similar.

11. Invalid dates need validation or constrained decoding

Even after SFT, the model can still produce:

2021-13
2021-00
2021-02-31
2021/01
January 2021 2021-01

For machine-readable outputs, prompting is not enough.

Use one or more of:

JSON Schema,
regex-constrained decoding,
grammar-constrained decoding,
post-generation validation,
retry on invalid output.

Useful links:

vLLM structured outputs
Outlines structured generation
LM Format Enforcer
llguidance

For month-level dates, validate with:

import re

MONTH_RE = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")

def valid_month(value: str) -> bool:
    return bool(MONTH_RE.fullmatch(value))

For exact ISO dates:

from datetime import date

def valid_iso_date(value: str) -> bool:
    try:
        date.fromisoformat(value)
        return True
    except ValueError:
        return False

Also validate date ordering:

def month_key(value: str) -> tuple[int, int]:
    year, month = value.split("-")
    return int(year), int(month)

assert month_key(start_date) <= month_key(end_date)

Important limitation:

Validation can reject invalid dates.
It cannot prove that a valid date is factually correct.

For factual correctness, you need the retrieved record or database.

12. Evaluation should be exact, not semantic

For this task, do not evaluate only by reading a few answers manually.

Use exact metrics:

start_date exact match
end_date exact match
start_precision exact match
end_precision exact match
record_id exact match
invalid date rate
malformed JSON rate
false answer rate when no record exists
wrong-person rate
wrong-employer rate

If you use RAG, evaluate retrieval separately from generation.

Useful RAG evaluation links:

Ragas context precision
Ragas faithfulness
Hugging Face RAG evaluation cookbook

For this task:

Context precision:
Did retrieval find the correct employment record?

Faithfulness:
Did the model answer using only the retrieved record?

Exact match:
Did the final start_date and end_date match the expected dates?

All three matter.

13. Debug checklist for the current SFT setup

Check 1: print the rendered chat template

messages = [
    {"role": "user", "content": "When did XYZ work for ABC?"},
    {"role": "assistant", "content": "From 2021-01 to 2021-12."},
]

rendered = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False,
)

print(rendered)

Verify:

Is this the model’s expected template?
Does it include the assistant answer?
Does it include duplicated special tokens?
Does it match the inference template?

Check 2: print supervised tokens

batch = next(iter(trainer.get_train_dataloader()))

input_ids = batch["input_ids"][0]
labels = batch["labels"][0]

print("FULL INPUT")
print(tokenizer.decode(input_ids, skip_special_tokens=False))

print("\nSUPERVISED TOKENS ONLY")
print(tokenizer.decode(labels[labels != -100], skip_special_tokens=False))

Verify that the supervised tokens include:

From 2021-01 to 2021-12.

or the expected JSON answer.

Check 3: count supervised tokens

num_total = labels.numel()
num_supervised = (labels != -100).sum().item()

print("total tokens:", num_total)
print("supervised tokens:", num_supervised)
print("supervised ratio:", num_supervised / num_total)

Red flags:

supervised tokens = 0
only special tokens are supervised
prompt tokens are supervised but answer tokens are ignored
answer date is missing from supervised tokens

Check 4: disable packing temporarily

packing = False

Inspect one example at a time.

Packing is useful later, but it makes debugging masks, boundaries, EOS, and truncation harder.

Check 5: run a tiny overfit test

Create one synthetic example with unusual dates:

Person_AAA worked for Company_BBB from 2091-03 to 2092-07.

Train on that one example.

Ask:

When did Person_AAA work for Company_BBB?

Expected answer:

From 2091-03 to 2092-07.

If the model cannot memorize one example, the problem is probably not date format. It is likely one of:

labels,
masking,
chat template,
adapter loading,
learning rate,
LoRA target modules,
EOS/truncation,
inference prompt mismatch.

Check 6: run a ten-example overfit test

Train on 10 synthetic examples with unusual dates.

Evaluate on the same exact prompts.

If it fails, your training pipeline is probably broken.

If it succeeds on exact prompts but fails on paraphrases, the model memorized the prompt surface rather than robustly learning entity-date associations.

Check 7: run a context-present test

Compare these two prompts.

Closed-book:

When did XYZ work for ABC?

Context-present:

Record:
XYZ worked for ABC from 2021-01 to 2021-12.

Question:
When did XYZ work for ABC?

If context-present works and closed-book fails, the solution is retrieval or structured lookup, not more date-format changes.

14. Recommended target format

Instead of this as the main target:

user: When did XYZ work for ABC.
assistant: From 2021-01 to 2021-12.

use a context-grounded format:

{
  "messages": [
    {
      "role": "system",
      "content": "Answer employment-date questions using only the provided records. Return JSON only. Do not invent dates."
    },
    {
      "role": "user",
      "content": "Records:\n[{\"record_id\":\"employment_000123\",\"person\":\"XYZ\",\"employer\":\"ABC\",\"start_date\":\"2021-01\",\"start_precision\":\"month\",\"end_date\":\"2021-12\",\"end_precision\":\"month\"}]\n\nQuestion: When did XYZ work for ABC?"
    },
    {
      "role": "assistant",
      "content": "{\"record_found\":true,\"record_id\":\"employment_000123\",\"start_date\":\"2021-01\",\"start_precision\":\"month\",\"end_date\":\"2021-12\",\"end_precision\":\"month\"}"
    }
  ]
}

For a missing record:

{
  "messages": [
    {
      "role": "system",
      "content": "Answer employment-date questions using only the provided records. Return JSON only. Do not invent dates."
    },
    {
      "role": "user",
      "content": "Records:\n[]\n\nQuestion: When did XYZ work for ABC?"
    },
    {
      "role": "assistant",
      "content": "{\"record_found\":false,\"record_id\":null,\"start_date\":null,\"end_date\":null,\"reason\":\"No matching employment record was provided.\"}"
    }
  ]
}

This trains the right behavior:

use context,
do not invent,
return structured output,
preserve precision,
abstain when the record is absent.

It does not try to turn the model into the only source of truth.

15. Can fine-tuning alone solve it?

In theory

Yes, sometimes.

Fine-tuning alone can memorize a small number of static facts if:

the dataset is small,
facts are repeated many times,
the same facts appear in many paraphrases,
the evaluation questions are similar to training questions,
occasional errors are acceptable.

In this case

Probably not reliably.

The symptoms suggest that the model is either:

not actually being supervised on the answer tokens,
not receiving the correct chat template,
learning the answer pattern but not the date mapping,
or being asked to do a task that should use retrieval or structured lookup.

Even if the SFT pipeline is fixed, fine-tuning alone is still not the best design if exact dates matter.

For employment dates, I would not trust model weights as the only source of truth.

16. Better architecture

I would use this design:

User question
  ↓
Entity extraction / normalization
  ↓
Person and employer resolution
  ↓
Structured lookup or RAG
  ↓
Retrieved employment record(s)
  ↓
LLM answers using only retrieved records
  ↓
Strict JSON output
  ↓
Date/schema validator
  ↓
Final natural-language answer

Example final prompt:

You answer employment-date questions using only the provided records.

Rules:
- Do not invent dates.
- If no exact matching record is provided, return null.
- Preserve date precision.
- Use YYYY-MM for month precision.
- Use YYYY-MM-DD for day precision.
- Return JSON only.

Records:
[
  {
    "record_id": "employment_000123",
    "person_name": "XYZ",
    "employer_name": "ABC",
    "start_date": "2021-01",
    "start_precision": "month",
    "end_date": "2021-12",
    "end_precision": "month"
  }
]

Question:
When did XYZ work for ABC?

Expected output:

{
  "record_found": true,
  "record_id": "employment_000123",
  "person_name": "XYZ",
  "employer_name": "ABC",
  "start_date": "2021-01",
  "start_precision": "month",
  "end_date": "2021-12",
  "end_precision": "month",
  "answer": "XYZ worked for ABC from 2021-01 to 2021-12."
}

This system is easier to debug because you can inspect:

Was the right record retrieved?
Was the right record passed to the model?
Did the model copy the right dates?
Did validation pass?

With fine-tuning alone, a wrong date is much harder to diagnose.

17. Final recommendation

Use this division of labor:

Component	Responsibility
Structured database / RAG	Store and retrieve actual employment dates
SFT	Teach context use, date normalization, abstention, schema-following
Chat template	Ensure the model sees the correct conversation format
Labels/masks	Ensure assistant answer tokens receive loss
Validator	Reject invalid date strings
Evaluator	Measure exact date correctness

Do not keep searching for a model-specific date format. Use a clean format, but move the factual burden out of the model weights.

The best answer is:

Use YYYY-MM for month-level employment periods.
Use YYYY-MM-DD only for real day-level dates.
Verify labels, not only input_ids.
Use model-specific chat templates.
Do not expect SFT alone to reliably memorize arbitrary exact dates.
Use structured lookup or RAG for the facts.
Use SFT for behavior.
Use validation or constrained decoding for date validity.

Short summary

The date format should be consistent, preferably YYYY-MM for month-level data.
If date-format cleanup did not fix the issue, the problem is likely not the date format.
Seeing dates in input_ids does not prove the model is trained on them; inspect labels != -100.
Qwen, Gemma, and Llama-style models need correct chat templates.
Fine-tuning alone is weak for exact new/private factual knowledge.
Use retrieval or structured lookup for actual employment dates.
Fine-tune the model to use provided records, not to memorize all records.
Add no-answer and distractor examples.
Use JSON/schema/regex validation to prevent invalid dates.
Evaluate with exact date match, not semantic similarity.