Date format for tine-tuning AI models
Are you perhaps trying to have the LLM handle databases and strict data management on its own? This is generally a task that LLMs aren’t well-suited for. While it may be unavoidable if there are specific constraints, it’s generally more reliable to divide these tasks among different systems:
Date format for fine-tuning Qwen/Gemma/Llama models: likely not just a date-format issue
I do not think there is a hidden “correct date format” that must be discovered separately for Qwen, Gemma, or Llama.
A consistent date format is still important. But if you already tried several formats, standardized the data, and still see the fine-tuned model inventing wrong or invalid dates, then the problem is probably outside date formatting.
The likely issue is one or more of these:
You are asking SFT to teach closed-book factual recall :
\<person\> + \<company\> -> exact start/end dates.The model may learn the answer shape but not the exact factual mapping : it learns “answer with a date range,” but not the correct date range.
The correct date may appear in
input_ids, but not in the supervisedlabels.The chat template / assistant mask may be wrong for the model family.
The output is unconstrained, so invalid dates such as
2021-13are still possible.The actual dates may need to live in a database, search index, RAG system, or structured record store rather than only in the model weights.
So my practical recommendation is:
Use one clean date format, but do not rely on fine-tuning alone to memorize exact dates. Use structured lookup or RAG for the facts. Use SFT to teach the model how to use provided records, normalize dates, return JSON, and abstain when no record exists. Use validation or constrained decoding to prevent invalid dates.
Useful background links:
- Hugging Face TRL SFTTrainer docs
- Hugging Face TRL chat templates docs
- Hugging Face Transformers chat templates docs
- Hugging Face RAG docs
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
- Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge
- vLLM structured outputs
- Outlines structured generation
- Ragas faithfulness metric
- Ragas context precision metric
1. The date format I would use
If the source data only has month and year, use month precision:
From 2021-01 to 2021-12
or, better, structured JSON:
{
"start_date": "2021-01",
"start_precision": "month",
"end_date": "2021-12",
"end_precision": "month"
}
If the source data truly has exact days, use exact ISO dates:
From 2021-01-01 to 2021-12-01
or:
{
"start_date": "2021-01-01",
"start_precision": "day",
"end_date": "2021-12-01",
"end_precision": "day"
}
Do not silently convert this:
January 2021 to December 2021
into this:
2021-01-01 to 2021-12-01
unless those exact days are actually known.
That would add fake precision. The model may learn that unknown days should be invented as 01, which is usually not what you want.
For employment histories, resumes, HR-style records, contracts, and timelines, I would usually use:
YYYY-MM
for month-level records, and keep a separate precision field.
2. Why the format changes did not fix it
The tested formats mix several patterns:
From January 2021 till December 2021.
From 01, 2021 till 12, 2021.
From 01-01-2021 till 12-01-2021.
From January 2021 01-01-2021 till December 2021 12-01-2021.
That inconsistency is worth fixing. It gives the model multiple competing surface patterns.
However, once you standardized to something like:
From 2021-01 to 2021-12
and the model still invented wrong or invalid dates, that strongly suggests that date format was not the main bottleneck.
There are two separate tasks here:
Task A:
Normalize "January 2021" to "2021-01".
Task B:
Know that <person> worked for <company> from 2021-01 to 2021-12.
Task A is date normalization. Fine-tuning can handle that.
Task B is factual recall. Fine-tuning is much less reliable for that.
The model may learn:
When asked about employment dates, answer with:
"From YYYY-MM to YYYY-MM."
but fail to learn:
<person> + <company> = 2021-01 to 2021-12
That distinction is the core of the problem.
3. This is probably a closed-book factual-recall problem
Your examples have this structure:
User:
When did XYZ work for ABC?
Assistant:
From 2021-01 to 2021-12.
At inference time, the model sees only:
When did XYZ work for ABC?
It does not see the employment record. Therefore, it must recover the fact from model weights:
XYZ + ABC -> 2021-01 to 2021-12
That is closed-book factual recall.
Closed-book means the answer is not provided in the prompt, not retrieved from an external source, and not looked up from a database visible to the model. The model must answer from memory.
This is fragile for exact dates because dates are high-entropy factual values. A model can often learn that an answer should look like a date range, but the exact dates are not predictable from the words XYZ and ABC.
This matches existing research:
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? reports that LLMs struggle to acquire new factual knowledge through fine-tuning and that fine-tuning new knowledge can increase hallucination tendency.
- Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs finds that RAG consistently outperforms unsupervised fine-tuning on knowledge-intensive tasks, including entirely new knowledge.
- Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge is especially relevant if your facts are private, rare, internal, or low-frequency.
In other words, you may not be debugging a date-string issue. You may be seeing the limits of using SFT as a factual database.
4. input_ids matching the dataset is not enough
Checking input_ids is useful. It tells you that tokenization did not completely destroy the date string.
But it does not prove the model is learning those dates.
In causal LM SFT, the answer text often appears in input_ids. That is normal. The real question is whether the answer tokens also appear in labels where loss is applied.
Many SFT pipelines mask tokens by setting labels to -100. Tokens with label -100 are ignored by the loss.
So this can happen:
input_ids:
... From 2021-01 to 2021-12 ...
labels:
... -100 -100 -100 -100 ...
In that case, the date is visible in the batch, but the model is not trained to generate it.
This is particularly important when using:
assistant_only_losscompletion_only_lossDataCollatorForCompletionOnlyLM- packing
- custom chat templates
- Qwen/Gemma/Llama-specific templates
See:
- TRL SFTTrainer docs
- TRL chat templates docs
- TRL issue on generation markers for assistant-only loss
- Transformers chat templates docs
Run this check before training:
batch = next(iter(trainer.get_train_dataloader()))
input_ids = batch["input_ids"][0]
labels = batch["labels"][0]
print("FULL INPUT")
print(tokenizer.decode(input_ids, skip_special_tokens=False))
print("\nSUPERVISED TOKENS ONLY")
supervised_ids = labels[labels != -100]
print(tokenizer.decode(supervised_ids, skip_special_tokens=False))
You want to see the assistant answer in the supervised region:
From 2021-01 to 2021-12.
or, if using JSON:
{"start_date":"2021-01","end_date":"2021-12"}
If the date is not present in labels[labels != -100], the model is not being trained to generate that date.
5. Chat templates are another likely cause
Qwen, Gemma, and Llama-style instruction models do not all expect the same prompt format.
A raw text format like this:
user: When did XYZ work for ABC.
Assistant: From 2021-01 to 2021-12.
may not match the actual chat template expected by the model.
Hugging Face’s chat template documentation explains that chat models use model-specific control tokens. The same user/assistant conversation can be rendered differently for different model families.
So training examples should usually be represented structurally:
messages = [
{"role": "user", "content": "When did XYZ work for ABC?"},
{"role": "assistant", "content": "From 2021-01 to 2021-12."},
]
Then render them using the model tokenizer:
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
print(text)
At inference time:
messages = [
{"role": "user", "content": "When did XYZ work for ABC?"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print(prompt)
Training and inference must be structurally compatible.
A common failure is:
Training:
manual "user:" / "Assistant:" strings
Inference:
tokenizer.apply_chat_template(...)
or:
Training:
one model family's chat template
Inference:
another model family's chat template
This can produce poor generations even when the data itself is correct.
6. Assistant-only loss and generation markers can matter
If you use assistant_only_loss=True, the trainer needs to know which tokens belong to the assistant response.
That often depends on the chat template producing an assistant-token mask. TRL documents that SFT with assistant-only loss requires {% generation %} and {% endgeneration %} markers around assistant output so that the loss mask can target only assistant tokens.
Useful links:
- TRL SFTTrainer docs
- TRL chat templates docs
- TRL issue #5471: generation markers for common model families
For Qwen, Gemma, or Llama models, verify:
Does the tokenizer chat template support assistant masks?
Does the trainer actually supervise assistant tokens?
Does the supervised region include the date?
Does this still work if packing is enabled?
For debugging, temporarily disable packing:
packing = False
Packing is useful later, but it makes it harder to inspect example boundaries, EOS behavior, truncation, and loss masks.
7. Fine-tuning alone can work only in narrow conditions
Fine-tuning alone may work if all of these are true:
- The number of facts is small.
- The facts are static.
- Each fact appears many times.
- You use many paraphrases per fact.
- The questions are highly repetitive.
- The use case can tolerate occasional wrong answers.
- The deployment is low-risk.
- You evaluate exact-match behavior carefully.
Example of a narrow case where no-RAG SFT might be acceptable:
20 fictional people
20 fictional companies
many paraphrases per fact
internal demo
low consequence if an answer is occasionally wrong
But if you have:
- hundreds or thousands of people,
- many companies,
- similar names,
- changing records,
- private facts,
- audit requirements,
- exact-date requirements,
then fine-tuning alone is the wrong default.
For exact employment dates, the model should not be the only database.
8. Better approach: structured lookup or RAG
For employment dates, store the facts outside the model.
A record should look something like this:
{
"record_id": "employment_000123",
"person_id": "person_xyz",
"person_name": "XYZ",
"employer_id": "org_abc",
"employer_name": "ABC",
"start_date": "2021-01",
"start_precision": "month",
"end_date": "2021-12",
"end_precision": "month",
"is_current": false
}
Then, at inference time:
- Parse the question.
- Resolve the person and employer.
- Retrieve the matching employment record.
- Give the record to the model.
- Ask the model to answer using only that record.
- Validate the output.
This changes the task from:
The model must remember the date.
to:
The model must read the date from the retrieved record and format it correctly.
That is a much better use of an LLM.
See:
- Hugging Face RAG docs
- Hugging Face Advanced RAG cookbook
- Code a simple RAG from scratch
If your data is already structured, start with structured lookup , not pure vector search.
For example:
SELECT start_date, end_date
FROM employment_records
WHERE person_id = 'person_xyz'
AND employer_id = 'org_abc';
Pure vector RAG is useful for unstructured text. But if you already have person IDs, employer IDs, start dates, and end dates, structured lookup is more reliable.
9. What SFT should do in the better design
SFT is still useful. It is just useful for a different job.
Do not primarily fine-tune the model to memorize this:
XYZ worked for ABC from 2021-01 to 2021-12.
Instead, fine-tune it to do this:
Given retrieved records + question -> return the correct structured answer using only the records.
A better SFT example:
{
"messages": [
{
"role": "system",
"content": "Use only the provided employment records. Return JSON only. Do not invent dates."
},
{
"role": "user",
"content": "Records:\n[{\"person\":\"XYZ\",\"employer\":\"ABC\",\"start\":\"January 2021\",\"end\":\"December 2021\"}]\n\nQuestion: When did XYZ work for ABC?"
},
{
"role": "assistant",
"content": "{\"record_found\":true,\"start_date\":\"2021-01\",\"start_precision\":\"month\",\"end_date\":\"2021-12\",\"end_precision\":\"month\"}"
}
]
}
This teaches:
- use context,
- normalize dates,
- preserve precision,
- return schema-compatible JSON,
- do not invent dates,
- abstain when no record exists.
That is a much better SFT target than closed-book memorization.
10. Add no-answer and distractor examples
A common mistake is training only examples where every question has an answer.
That teaches the model:
Always produce a date.
Then when it does not know, it still invents one.
You need examples where the correct output is null.
Example:
{
"messages": [
{
"role": "system",
"content": "Use only the provided employment records. Return JSON only."
},
{
"role": "user",
"content": "Records:\n[]\n\nQuestion: When did XYZ work for ABC?"
},
{
"role": "assistant",
"content": "{\"record_found\":false,\"start_date\":null,\"end_date\":null,\"reason\":\"No matching employment record was provided.\"}"
}
]
}
You also need distractor examples:
{
"records": [
{
"person": "XYZ",
"employer": "ABD",
"start_date": "2021-01",
"end_date": "2021-12"
},
{
"person": "XYX",
"employer": "ABC",
"start_date": "2020-01",
"end_date": "2020-12"
}
],
"question": "When did XYZ work for ABC?",
"answer": {
"record_found": false,
"start_date": null,
"end_date": null
}
}
This teaches the model not to combine the right person from one record with the right company from another.
That matters a lot for employment data, where names and organizations can be similar.
11. Invalid dates need validation or constrained decoding
Even after SFT, the model can still produce:
2021-13
2021-00
2021-02-31
2021/01
January 2021 2021-01
For machine-readable outputs, prompting is not enough.
Use one or more of:
- JSON Schema,
- regex-constrained decoding,
- grammar-constrained decoding,
- post-generation validation,
- retry on invalid output.
Useful links:
- vLLM structured outputs
- Outlines structured generation
- LM Format Enforcer
- llguidance
For month-level dates, validate with:
import re
MONTH_RE = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")
def valid_month(value: str) -> bool:
return bool(MONTH_RE.fullmatch(value))
For exact ISO dates:
from datetime import date
def valid_iso_date(value: str) -> bool:
try:
date.fromisoformat(value)
return True
except ValueError:
return False
Also validate date ordering:
def month_key(value: str) -> tuple[int, int]:
year, month = value.split("-")
return int(year), int(month)
assert month_key(start_date) <= month_key(end_date)
Important limitation:
Validation can reject invalid dates.
It cannot prove that a valid date is factually correct.
For factual correctness, you need the retrieved record or database.
12. Evaluation should be exact, not semantic
For this task, do not evaluate only by reading a few answers manually.
Use exact metrics:
start_dateexact matchend_dateexact matchstart_precisionexact matchend_precisionexact matchrecord_idexact match- invalid date rate
- malformed JSON rate
- false answer rate when no record exists
- wrong-person rate
- wrong-employer rate
If you use RAG, evaluate retrieval separately from generation.
Useful RAG evaluation links:
- Ragas context precision
- Ragas faithfulness
- Hugging Face RAG evaluation cookbook
For this task:
Context precision:
Did retrieval find the correct employment record?
Faithfulness:
Did the model answer using only the retrieved record?
Exact match:
Did the final start_date and end_date match the expected dates?
All three matter.
13. Debug checklist for the current SFT setup
Check 1: print the rendered chat template
messages = [
{"role": "user", "content": "When did XYZ work for ABC?"},
{"role": "assistant", "content": "From 2021-01 to 2021-12."},
]
rendered = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
print(rendered)
Verify:
- Is this the model’s expected template?
- Does it include the assistant answer?
- Does it include duplicated special tokens?
- Does it match the inference template?
Check 2: print supervised tokens
batch = next(iter(trainer.get_train_dataloader()))
input_ids = batch["input_ids"][0]
labels = batch["labels"][0]
print("FULL INPUT")
print(tokenizer.decode(input_ids, skip_special_tokens=False))
print("\nSUPERVISED TOKENS ONLY")
print(tokenizer.decode(labels[labels != -100], skip_special_tokens=False))
Verify that the supervised tokens include:
From 2021-01 to 2021-12.
or the expected JSON answer.
Check 3: count supervised tokens
num_total = labels.numel()
num_supervised = (labels != -100).sum().item()
print("total tokens:", num_total)
print("supervised tokens:", num_supervised)
print("supervised ratio:", num_supervised / num_total)
Red flags:
supervised tokens = 0- only special tokens are supervised
- prompt tokens are supervised but answer tokens are ignored
- answer date is missing from supervised tokens
Check 4: disable packing temporarily
packing = False
Inspect one example at a time.
Packing is useful later, but it makes debugging masks, boundaries, EOS, and truncation harder.
Check 5: run a tiny overfit test
Create one synthetic example with unusual dates:
Person_AAA worked for Company_BBB from 2091-03 to 2092-07.
Train on that one example.
Ask:
When did Person_AAA work for Company_BBB?
Expected answer:
From 2091-03 to 2092-07.
If the model cannot memorize one example, the problem is probably not date format. It is likely one of:
- labels,
- masking,
- chat template,
- adapter loading,
- learning rate,
- LoRA target modules,
- EOS/truncation,
- inference prompt mismatch.
Check 6: run a ten-example overfit test
Train on 10 synthetic examples with unusual dates.
Evaluate on the same exact prompts.
If it fails, your training pipeline is probably broken.
If it succeeds on exact prompts but fails on paraphrases, the model memorized the prompt surface rather than robustly learning entity-date associations.
Check 7: run a context-present test
Compare these two prompts.
Closed-book:
When did XYZ work for ABC?
Context-present:
Record:
XYZ worked for ABC from 2021-01 to 2021-12.
Question:
When did XYZ work for ABC?
If context-present works and closed-book fails, the solution is retrieval or structured lookup, not more date-format changes.
14. Recommended target format
Instead of this as the main target:
user: When did XYZ work for ABC.
assistant: From 2021-01 to 2021-12.
use a context-grounded format:
{
"messages": [
{
"role": "system",
"content": "Answer employment-date questions using only the provided records. Return JSON only. Do not invent dates."
},
{
"role": "user",
"content": "Records:\n[{\"record_id\":\"employment_000123\",\"person\":\"XYZ\",\"employer\":\"ABC\",\"start_date\":\"2021-01\",\"start_precision\":\"month\",\"end_date\":\"2021-12\",\"end_precision\":\"month\"}]\n\nQuestion: When did XYZ work for ABC?"
},
{
"role": "assistant",
"content": "{\"record_found\":true,\"record_id\":\"employment_000123\",\"start_date\":\"2021-01\",\"start_precision\":\"month\",\"end_date\":\"2021-12\",\"end_precision\":\"month\"}"
}
]
}
For a missing record:
{
"messages": [
{
"role": "system",
"content": "Answer employment-date questions using only the provided records. Return JSON only. Do not invent dates."
},
{
"role": "user",
"content": "Records:\n[]\n\nQuestion: When did XYZ work for ABC?"
},
{
"role": "assistant",
"content": "{\"record_found\":false,\"record_id\":null,\"start_date\":null,\"end_date\":null,\"reason\":\"No matching employment record was provided.\"}"
}
]
}
This trains the right behavior:
- use context,
- do not invent,
- return structured output,
- preserve precision,
- abstain when the record is absent.
It does not try to turn the model into the only source of truth.
15. Can fine-tuning alone solve it?
In theory
Yes, sometimes.
Fine-tuning alone can memorize a small number of static facts if:
- the dataset is small,
- facts are repeated many times,
- the same facts appear in many paraphrases,
- the evaluation questions are similar to training questions,
- occasional errors are acceptable.
In this case
Probably not reliably.
The symptoms suggest that the model is either:
- not actually being supervised on the answer tokens,
- not receiving the correct chat template,
- learning the answer pattern but not the date mapping,
- or being asked to do a task that should use retrieval or structured lookup.
Even if the SFT pipeline is fixed, fine-tuning alone is still not the best design if exact dates matter.
For employment dates, I would not trust model weights as the only source of truth.
16. Better architecture
I would use this design:
User question
↓
Entity extraction / normalization
↓
Person and employer resolution
↓
Structured lookup or RAG
↓
Retrieved employment record(s)
↓
LLM answers using only retrieved records
↓
Strict JSON output
↓
Date/schema validator
↓
Final natural-language answer
Example final prompt:
You answer employment-date questions using only the provided records.
Rules:
- Do not invent dates.
- If no exact matching record is provided, return null.
- Preserve date precision.
- Use YYYY-MM for month precision.
- Use YYYY-MM-DD for day precision.
- Return JSON only.
Records:
[
{
"record_id": "employment_000123",
"person_name": "XYZ",
"employer_name": "ABC",
"start_date": "2021-01",
"start_precision": "month",
"end_date": "2021-12",
"end_precision": "month"
}
]
Question:
When did XYZ work for ABC?
Expected output:
{
"record_found": true,
"record_id": "employment_000123",
"person_name": "XYZ",
"employer_name": "ABC",
"start_date": "2021-01",
"start_precision": "month",
"end_date": "2021-12",
"end_precision": "month",
"answer": "XYZ worked for ABC from 2021-01 to 2021-12."
}
This system is easier to debug because you can inspect:
- Was the right record retrieved?
- Was the right record passed to the model?
- Did the model copy the right dates?
- Did validation pass?
With fine-tuning alone, a wrong date is much harder to diagnose.
17. Final recommendation
Use this division of labor:
| Component | Responsibility |
|---|---|
| Structured database / RAG | Store and retrieve actual employment dates |
| SFT | Teach context use, date normalization, abstention, schema-following |
| Chat template | Ensure the model sees the correct conversation format |
| Labels/masks | Ensure assistant answer tokens receive loss |
| Validator | Reject invalid date strings |
| Evaluator | Measure exact date correctness |
Do not keep searching for a model-specific date format. Use a clean format, but move the factual burden out of the model weights.
The best answer is:
Use YYYY-MM for month-level employment periods.
Use YYYY-MM-DD only for real day-level dates.
Verify labels, not only input_ids.
Use model-specific chat templates.
Do not expect SFT alone to reliably memorize arbitrary exact dates.
Use structured lookup or RAG for the facts.
Use SFT for behavior.
Use validation or constrained decoding for date validity.
Short summary
- The date format should be consistent, preferably
YYYY-MMfor month-level data. - If date-format cleanup did not fix the issue, the problem is likely not the date format.
- Seeing dates in
input_idsdoes not prove the model is trained on them; inspectlabels != -100. - Qwen, Gemma, and Llama-style models need correct chat templates.
- Fine-tuning alone is weak for exact new/private factual knowledge.
- Use retrieval or structured lookup for actual employment dates.
- Fine-tune the model to use provided records, not to memorize all records.
- Add no-answer and distractor examples.
- Use JSON/schema/regex validation to prevent invalid dates.
- Evaluate with exact date match, not semantic similarity.
Discussion in the ATmosphere