Is it possible to create a Résumé parser using a Huggingface model?
For now, version as of June 2026… This probably changes quite a bit depending on whether you specifically need to make it work with LayoutLMv3, or whether any model/tool is acceptable as long as the resume parsing goal is met:
TL;DR
I would split this into two different tracks:
If you specifically need LayoutLMv3 , the main problem is not just “which model?” or “which dataset?”. You need a dataset/pipeline with:
- page images or rendered PDF pages,
- OCR words,
- word-level bounding boxes,
- normalized LayoutLM-style boxes,
- word/token labels,
- reading order,
- and the same OCR/bbox preprocessing at training and inference time.
I did not find a clean public resume-specific dataset that is already “LayoutLMv3-ready” in that full sense.
If the goal is simply resume parsing / resume information extraction , then there are more practical resources now. I would look at:
- Alibaba SmartResume / Alibaba-EI/SmartResume
- NuExtract3
- PaddleOCR-VL / PaddleOCR-VL-1.6 collection
- sandeeppanem/resume-json-extraction-5k
- sandeeppanem/qwen3-0.6b-resume-json
- nimendraai/NuExtract-tiny-Resume-Data-Extractor
- oksomu/resume-ner
- amosify/resume-section-classifier-v1
- amosify/distilbert-resume-ner-v1
The most important practical distinction is:
LayoutLMv3 wants image + OCR words + boxes + labels. Many newer resume resources are text-to-JSON, NER, OCR, or document-parsing resources. Useful, but not the same training format.
Track A — If you must use LayoutLMv3
For LayoutLMv3, I would first check the expected input contract carefully.
Relevant docs:
- LayoutLMv3 docs
- LayoutLMv3 model card: microsoft/layoutlmv3-base
- HF forum: LayoutLMv3 for token classification
The important part is that LayoutLMv3 token classification is not just ordinary text NER. The processor/model path expects layout-aware inputs, usually something like:
{
"image": "<page image>",
"words": ["John", "Doe", "Software", "Engineer", "..."],
"boxes": [[x0, y0, x1, y1], ...],
"word_labels": ["B-NAME", "I-NAME", "B-TITLE", "I-TITLE", "..."]
}
The boxes should be word-level bounding boxes, normalized in the LayoutLM-style coordinate system, usually 0–1000 scale. The labels need to align with the OCR words, and then the tokenizer has to propagate word labels to subword tokens.
The forum thread above is worth reading because it points out a common failure mode: training with one kind of box/annotation setup, then using a different OCR/bbox setup at inference. That can break the model even if the training code looked fine.
Practical LayoutLMv3 checklist
If I had to make the LayoutLMv3 route work, I would do something like this:
1. Choose one OCR engine and freeze it.
Examples: Tesseract, EasyOCR, PaddleOCR, pdfplumber/PDF text extraction, etc.
2. Convert each resume page into:
- page image
- OCR words
- word-level bounding boxes
- reading order
3. Annotate those OCR words/boxes.
Do not annotate totally separate hand-drawn boxes unless you can reproduce the same boxes at inference.
4. Convert annotations into BIO/BILOU labels.
Example labels:
- B-NAME / I-NAME
- B-EMAIL / I-EMAIL
- B-PHONE / I-PHONE
- B-COMPANY / I-COMPANY
- B-JOB_TITLE / I-JOB_TITLE
- B-DEGREE / I-DEGREE
- B-INSTITUTION / I-INSTITUTION
- B-SKILL / I-SKILL
- O
5. Normalize boxes exactly as LayoutLMv3 expects.
6. Keep OCR, ordering, box normalization, truncation, and page splitting identical at training and inference.
7. Start with a small gold evaluation set before scaling.
In other words, for LayoutLMv3 the dataset problem is really an annotation and preprocessing contract problem.
Why ordinary resume text datasets are not enough for LayoutLMv3
A dataset like:
{
"text": "John Doe\nSoftware Engineer\n...",
"json": {
"name": "John Doe",
"title": "Software Engineer"
}
}
can be useful for text-to-JSON models or LLM fine-tuning, but it is not directly enough for LayoutLMv3 token classification because it is missing:
- page image,
- OCR words,
- word bounding boxes,
- word-level labels,
- reading order,
- box normalization,
- page-level segmentation.
You might still use a text-to-JSON model to create weak labels. For example:
resume PDF
→ OCR words + boxes
→ text-to-JSON extractor
→ extracted field values
→ string-match field values back to OCR words
→ weak BIO labels
→ human correction
→ LayoutLMv3 fine-tuning
But I would treat that as a weak-labeling/bootstrap approach, not as a clean substitute for a real gold dataset.
A LayoutLMv3-adjacent resource
One interesting LayoutLMv3-related resume resource is:
- Smutypi3/applai-layoutlmv3
I would not treat it as a complete resume field parser, but it is relevant because it uses LayoutLMv3 on resume PDFs and discusses a resume-oriented preprocessing pattern. It may be useful as a reference if you want to see how someone handled PDF words, boxes, and LayoutLMv3-style representations in the resume domain.
Track B — If any model/tool is acceptable
If the goal is simply “parse resumes into structured fields”, I would probably not start with LayoutLMv3. I would start with a pipeline view:
PDF / DOCX / image resume
→ OCR / PDF parsing / layout reconstruction
→ clean text or Markdown with reading order
→ section routing
→ structured extraction
→ JSON validation
→ evaluation on a small hand-checked set
A resume parser is often not one model. It is a pipeline.
The hard part may be upstream: converting a visually complex resume PDF into faithful text/Markdown/layout before extracting fields.
Strongest current resource: SmartResume
I would look at SmartResume first:
- GitHub: alibaba/SmartResume
- HF model repo: Alibaba-EI/SmartResume
- Paper: Layout-Aware Parsing Meets Efficient LLMs
- HF Papers page
Why it matters:
SmartResume is very close to the actual problem. It is not just a generic NER model. It treats resume parsing as a layout-aware pipeline:
resume PDF / image / Office document
→ OCR + PDF metadata extraction
→ layout detection
→ reading order reconstruction
→ structured information extraction with a compact LLM
The paper is especially useful because it frames the problem correctly:
- resumes have diverse layouts,
- resumes often have multi-column structures,
- reading order matters,
- LLM-only extraction can be expensive or brittle,
- standardized resume extraction datasets/evaluation tools are limited.
This is probably the best “goal-first” starting point I found.
General structured extraction: NuExtract3
Another strong candidate is:
- numind/NuExtract3
This is not resume-specific, but it is very relevant. It is a vision-language document understanding model for structured extraction and image-to-Markdown conversion.
The useful pattern is:
input document + JSON template + optional instructions
→ structured JSON output
For a resume, the template might look like:
{
"name": "verbatim-string",
"email": "email",
"phone": "verbatim-string",
"location": "verbatim-string",
"summary": "string",
"skills": ["verbatim-string"],
"education": [
{
"institution": "verbatim-string",
"degree": "verbatim-string",
"field": "verbatim-string",
"start_date": "date-time",
"end_date": "date-time"
}
],
"experience": [
{
"company": "verbatim-string",
"title": "verbatim-string",
"location": "verbatim-string",
"start_date": "date-time",
"end_date": "date-time",
"responsibilities": ["string"],
"achievements": ["string"]
}
],
"certifications": [
{
"name": "verbatim-string",
"issuer": "verbatim-string",
"date": "date-time"
}
]
}
I would still evaluate it on real resumes. But as a modern structured-extraction route, it is very relevant.
OCR / document parsing layer: PaddleOCR-VL, PaddleOCR 3.5, Docling, olmOCR
If your input is PDF/image resumes, I would also look at current OCR/document parsing tools. These are not resume parsers by themselves, but they may solve the most painful upstream step.
Useful resources:
- PaddleOCR-VL
- PaddleOCR-VL-1.6 collection
- PaddleOCR GitHub
- HF Blog: PaddleOCR 3.5 with Transformers backend
- Docling GitHub
- Docling docs
- Docling technical report
- olmOCR GitHub
- olmOCR paper
Why this matters:
A two-column resume, a sidebar resume, or a scanned resume can fail before the extraction model even sees the content correctly. If the OCR/layout step scrambles the reading order, a good NER or LLM extractor may still produce bad JSON.
So I would separate:
document parsing quality
from:
field extraction quality
They are related, but not the same problem.
Resume text → JSON resources
If you can already get reasonably clean text from the resume, there are more direct resources.
Qwen3 resume JSON dataset/model
- sandeeppanem/resume-json-extraction-5k
- sandeeppanem/qwen3-0.6b-resume-json
- GitHub: sandeeppanem/qwen3-resume-extraction
This route is useful if your pipeline is:
PDF/DOCX/image
→ text extraction
→ raw resume text
→ structured JSON extractor
The dataset is raw resume text to structured JSON. The model is a Qwen3-0.6B LoRA adapter for resume JSON extraction.
Caveats:
- It is not a LayoutLMv3 dataset.
- It does not solve OCR/layout.
- The model repo contains the LoRA adapter, so the base model is also needed.
- Long resumes and unusual formats still need evaluation.
Small local resume extractor
- nimendraai/NuExtract-tiny-Resume-Data-Extractor
This is a resume/CV structured extraction model based on NuExtract-tiny / Qwen2.5-0.5B. It is useful if you want a small local route, especially for raw text to JSON.
Caveat: check the model card carefully. It is trained on synthetic resumes, so I would not trust it without testing on real resumes from your target distribution.
NER / section-routing route
If you want something more deterministic and easier to debug than “LLM returns JSON”, a section classifier + NER pipeline may be easier to control.
Useful resources:
- oksomu/resume-ner
- amosify/resume-section-classifier-v1
- amosify/distilbert-resume-ner-v1
- yashpwr/resume-ner-training-data
- yashpwr/resume-ner-bert-v2
- HF docs: token classification / NER
A practical version of this route could be:
OCR/PDF text chunks
→ classify chunks into sections:
contact / summary / experience / education / skills / certifications / projects / etc.
→ run section-aware NER
→ normalize dates, phone, email, skills, company names
→ group entities into experience[] and education[]
→ validate JSON
This is less glamorous than a single end-to-end model, but easier to debug.
For example:
- Contact fields can often be handled with regex + NER.
- Skills can use NER + skill dictionaries.
- Experience needs grouping: company, title, dates, bullet points.
- Education needs grouping: institution, degree, field, dates/GPA.
The important caveat with many resume NER models is that reported scores may come from internal or narrow test sets. I would always create a small hand-labeled evaluation set from your actual target resumes.
Resource table
| Resource | Type | Best for | Not for | Notes |
|---|---|---|---|---|
| LayoutLMv3 docs | Model docs | Understanding LayoutLMv3 input contract | Finding resume data | Essential for image/words/boxes/labels |
| LayoutLMv3 forum thread | Forum/debugging | OCR/bbox train-inference consistency | Turnkey solution | Very relevant practical warning |
| SmartResume | Resume-specific system | Full resume parsing pipeline | Pure LayoutLMv3 training | Strongest goal-first candidate |
| Alibaba-EI/SmartResume | Model repo | Weights/resources for SmartResume | General NER | Includes resume extraction/layout components |
| Layout-Aware Parsing Meets Efficient LLMs | Paper | Modern resume extraction architecture | Drop-in code alone | Useful framing and evaluation discussion |
| NuExtract3 | General document extraction VLM | Template-based JSON extraction | Resume-specific guarantee | Strong candidate if model choice is flexible |
| PaddleOCR-VL | OCR/document parsing | Upstream PDF/image parsing | Resume field extraction by itself | Strong document parsing candidate |
| PaddleOCR GitHub | OCR/document stack | OCR/layout/table/formula/chart extraction | Resume-specific schema | Good ingestion layer |
| Docling | Document parser | PDF/DOCX/image to structured text/Markdown/JSON | Resume labels | Useful preprocessing layer |
| olmOCR | PDF-to-text/Markdown OCR | Clean reading order / linearized text | Resume JSON fields | Useful before extraction |
| resume-json-extraction-5k | Dataset | Resume text → JSON SFT | LayoutLMv3 training | Directly relevant for text route |
| qwen3-0.6b-resume-json | LoRA adapter | Lightweight resume JSON extraction | OCR/layout | Needs base Qwen3 model |
| NuExtract-tiny Resume | Small local extractor | Local raw-text resume JSON extraction | Robust PDF layout | Synthetic-data caveat |
| oksomu/resume-ner | NER + postprocess | Deterministic entity extraction route | Full layout parsing | Detailed card; evaluate externally |
| amosify section classifier | Text classifier | Section routing | Field extraction alone | Useful middle layer |
| amosify resume NER | NER model | Section-aware NER | Nested JSON alone | Pair with section routing |
| resume-parsing model tag | HF model search | Discovering current resume models | Exhaustive coverage | Some models are not tagged consistently |
Suggested practical pipelines
Pipeline 1: LayoutLMv3-only route
Use this if LayoutLMv3 is required.
resume PDF/image
→ render pages
→ OCR with one fixed engine
→ words + word boxes + reading order
→ annotate OCR words/boxes
→ BIO labels
→ LayoutLMv3 token classification
→ field grouping + post-processing
→ JSON validation
This is the most faithful LayoutLMv3 route, but also the most annotation-heavy.
Pipeline 2: Modern model-free route
Use this if the goal is just accurate resume parsing.
resume PDF/DOCX/image
→ Docling / PaddleOCR / olmOCR / SmartResume-style parsing
→ Markdown or layout-preserving text
→ NuExtract3 / SmartResume / Qwen3-resume-json / NER pipeline
→ structured JSON
→ validation and evaluation
This is probably the more practical route for most projects.
Pipeline 3: Hybrid route
Use this if you want to eventually train LayoutLMv3, but need a bootstrap path.
resume PDF/image
→ OCR words + boxes
→ text-to-JSON or NER extractor
→ map extracted field values back to OCR words
→ create weak BIO labels
→ manually correct a subset
→ train LayoutLMv3
→ evaluate against gold set
This can reduce annotation cost, but weak labels can be noisy.
Evaluation checklist
For resume parsing, I would not evaluate only “does it produce JSON?”. I would check:
| Aspect | Example |
|---|---|
| JSON validity | Does it always return parseable JSON? |
| Schema compliance | Does it follow the target schema exactly? |
| Field exact match | email, phone, URLs |
| Normalized match | dates, locations, company names |
| Semantic match | job titles, degree names, responsibilities |
| Array alignment | does each title match the correct company/date range? |
| Omission | did it miss an experience item or degree? |
| Hallucination | did it invent a company, skill, date, or degree? |
| Layout robustness | two-column resumes, sidebars, scanned PDFs |
| Long-document handling | multi-page resumes, truncation, repeated headers |
| Privacy handling | PII, consent, local processing, data retention |
For structured extraction evaluation ideas, you can also look at:
- ExtractBench
- ParseBench dataset
- OmniDocBench
These are not resume-specific, but they are useful for thinking about PDF-to-JSON and document parsing evaluation.
How I would keep searching
I would not rely only on the resume-parsing tag. Some newer models are weakly tagged or not tagged consistently.
Search across:
Hugging Face Models
- resume parser
- resume json
- cv parser
- resume ner
- qwen resume parser
- deepseek resume parser
- resume-parsing tag
Hugging Face Spaces
Search for resume parser demos, but inspect the implementation:
app.pyrequirements.txt- model calls
- OCR/PDF parsing method
- whether it handles PDF, DOCX, image, or only raw text
- whether there is any evaluation
Spaces are useful for implementation patterns, but I would not treat a demo as evidence of model quality.
Blogs / Papers / Posts
Also watch document parsing and OCR releases, not just resume-specific models.
Useful entry points:
- Hugging Face Papers
- Hugging Face Blog
- Hugging Face Posts
- PaddleOCR 3.5 HF blog
- OCR open models blog
The reason is that the hardest part may be converting the resume PDF/image into a faithful text/layout representation before the actual field extraction step.
My practical recommendation
If I were trying to solve this now, I would do this:
If LayoutLMv3 is mandatory:
- stop looking only for ordinary resume text datasets;
- build a small LayoutLMv3-style gold dataset with images, OCR words, boxes, and labels;
- keep train/inference OCR identical;
- possibly use a text extractor/NER model to create weak labels, then manually correct them.
If any model/tool is acceptable:
- start with SmartResume;
- test NuExtract3 with a resume JSON schema;
- use PaddleOCR-VL, Docling, or olmOCR if PDF/image ingestion is the bottleneck;
- compare against simpler text-to-JSON or NER routes like qwen3-0.6b-resume-json, NuExtract-tiny Resume, or oksomu/resume-ner.
In both cases:
- create a small hand-checked evaluation set from your actual target resumes;
- test two-column resumes, scanned resumes, multi-page resumes, and unusual layouts;
- evaluate omissions and hallucinations, not only JSON validity.
So my short answer would be:
I did not find a perfect public LayoutLMv3-ready resume dataset. But if the goal is resume parsing rather than specifically LayoutLMv3 training, the ecosystem is much better now: SmartResume, NuExtract3, PaddleOCR-VL/PaddleOCR, Qwen3 resume JSON, NuExtract-tiny Resume, and resume NER/section-routing models are all worth checking.
Discussion in the ATmosphere