External Publication

Is it possible to create a Résumé parser using a Huggingface model?

Hugging Face Forums [Unofficial] June 3, 2026

For now, version as of June 2026… This probably changes quite a bit depending on whether you specifically need to make it work with LayoutLMv3, or whether any model/tool is acceptable as long as the resume parsing goal is met:

TL;DR

I would split this into two different tracks:

If you specifically need LayoutLMv3 , the main problem is not just “which model?” or “which dataset?”. You need a dataset/pipeline with:
- page images or rendered PDF pages,
- OCR words,
- word-level bounding boxes,
- normalized LayoutLM-style boxes,
- word/token labels,
- reading order,
- and the same OCR/bbox preprocessing at training and inference time.

I did not find a clean public resume-specific dataset that is already “LayoutLMv3-ready” in that full sense.

If the goal is simply resume parsing / resume information extraction , then there are more practical resources now. I would look at:
- Alibaba SmartResume / Alibaba-EI/SmartResume
- NuExtract3
- PaddleOCR-VL / PaddleOCR-VL-1.6 collection
- sandeeppanem/resume-json-extraction-5k
- sandeeppanem/qwen3-0.6b-resume-json
- nimendraai/NuExtract-tiny-Resume-Data-Extractor
- oksomu/resume-ner
- amosify/resume-section-classifier-v1
- amosify/distilbert-resume-ner-v1

The most important practical distinction is:

LayoutLMv3 wants image + OCR words + boxes + labels. Many newer resume resources are text-to-JSON, NER, OCR, or document-parsing resources. Useful, but not the same training format.

Track A — If you must use LayoutLMv3

For LayoutLMv3, I would first check the expected input contract carefully.

Relevant docs:

LayoutLMv3 docs
LayoutLMv3 model card: microsoft/layoutlmv3-base
HF forum: LayoutLMv3 for token classification

The important part is that LayoutLMv3 token classification is not just ordinary text NER. The processor/model path expects layout-aware inputs, usually something like:

{
    "image": "<page image>",
    "words": ["John", "Doe", "Software", "Engineer", "..."],
    "boxes": [[x0, y0, x1, y1], ...],
    "word_labels": ["B-NAME", "I-NAME", "B-TITLE", "I-TITLE", "..."]
}

The boxes should be word-level bounding boxes, normalized in the LayoutLM-style coordinate system, usually 0–1000 scale. The labels need to align with the OCR words, and then the tokenizer has to propagate word labels to subword tokens.

The forum thread above is worth reading because it points out a common failure mode: training with one kind of box/annotation setup, then using a different OCR/bbox setup at inference. That can break the model even if the training code looked fine.

Practical LayoutLMv3 checklist

If I had to make the LayoutLMv3 route work, I would do something like this:

1. Choose one OCR engine and freeze it.
   Examples: Tesseract, EasyOCR, PaddleOCR, pdfplumber/PDF text extraction, etc.

2. Convert each resume page into:
   - page image
   - OCR words
   - word-level bounding boxes
   - reading order

3. Annotate those OCR words/boxes.
   Do not annotate totally separate hand-drawn boxes unless you can reproduce the same boxes at inference.

4. Convert annotations into BIO/BILOU labels.
   Example labels:
   - B-NAME / I-NAME
   - B-EMAIL / I-EMAIL
   - B-PHONE / I-PHONE
   - B-COMPANY / I-COMPANY
   - B-JOB_TITLE / I-JOB_TITLE
   - B-DEGREE / I-DEGREE
   - B-INSTITUTION / I-INSTITUTION
   - B-SKILL / I-SKILL
   - O

5. Normalize boxes exactly as LayoutLMv3 expects.

6. Keep OCR, ordering, box normalization, truncation, and page splitting identical at training and inference.

7. Start with a small gold evaluation set before scaling.

In other words, for LayoutLMv3 the dataset problem is really an annotation and preprocessing contract problem.

Why ordinary resume text datasets are not enough for LayoutLMv3

A dataset like:

{
  "text": "John Doe\nSoftware Engineer\n...",
  "json": {
    "name": "John Doe",
    "title": "Software Engineer"
  }
}

can be useful for text-to-JSON models or LLM fine-tuning, but it is not directly enough for LayoutLMv3 token classification because it is missing:

page image,
OCR words,
word bounding boxes,
word-level labels,
reading order,
box normalization,
page-level segmentation.

You might still use a text-to-JSON model to create weak labels. For example:

resume PDF
→ OCR words + boxes
→ text-to-JSON extractor
→ extracted field values
→ string-match field values back to OCR words
→ weak BIO labels
→ human correction
→ LayoutLMv3 fine-tuning

But I would treat that as a weak-labeling/bootstrap approach, not as a clean substitute for a real gold dataset.

A LayoutLMv3-adjacent resource

One interesting LayoutLMv3-related resume resource is:

Smutypi3/applai-layoutlmv3

I would not treat it as a complete resume field parser, but it is relevant because it uses LayoutLMv3 on resume PDFs and discusses a resume-oriented preprocessing pattern. It may be useful as a reference if you want to see how someone handled PDF words, boxes, and LayoutLMv3-style representations in the resume domain.

Track B — If any model/tool is acceptable

If the goal is simply “parse resumes into structured fields”, I would probably not start with LayoutLMv3. I would start with a pipeline view:

PDF / DOCX / image resume
→ OCR / PDF parsing / layout reconstruction
→ clean text or Markdown with reading order
→ section routing
→ structured extraction
→ JSON validation
→ evaluation on a small hand-checked set

A resume parser is often not one model. It is a pipeline.

The hard part may be upstream: converting a visually complex resume PDF into faithful text/Markdown/layout before extracting fields.

Strongest current resource: SmartResume

I would look at SmartResume first:

GitHub: alibaba/SmartResume
HF model repo: Alibaba-EI/SmartResume
Paper: Layout-Aware Parsing Meets Efficient LLMs
HF Papers page

Why it matters:

SmartResume is very close to the actual problem. It is not just a generic NER model. It treats resume parsing as a layout-aware pipeline:

resume PDF / image / Office document
→ OCR + PDF metadata extraction
→ layout detection
→ reading order reconstruction
→ structured information extraction with a compact LLM

The paper is especially useful because it frames the problem correctly:

resumes have diverse layouts,
resumes often have multi-column structures,
reading order matters,
LLM-only extraction can be expensive or brittle,
standardized resume extraction datasets/evaluation tools are limited.

This is probably the best “goal-first” starting point I found.

General structured extraction: NuExtract3

Another strong candidate is:

numind/NuExtract3

This is not resume-specific, but it is very relevant. It is a vision-language document understanding model for structured extraction and image-to-Markdown conversion.

The useful pattern is:

input document + JSON template + optional instructions
→ structured JSON output

For a resume, the template might look like:

{
  "name": "verbatim-string",
  "email": "email",
  "phone": "verbatim-string",
  "location": "verbatim-string",
  "summary": "string",
  "skills": ["verbatim-string"],
  "education": [
    {
      "institution": "verbatim-string",
      "degree": "verbatim-string",
      "field": "verbatim-string",
      "start_date": "date-time",
      "end_date": "date-time"
    }
  ],
  "experience": [
    {
      "company": "verbatim-string",
      "title": "verbatim-string",
      "location": "verbatim-string",
      "start_date": "date-time",
      "end_date": "date-time",
      "responsibilities": ["string"],
      "achievements": ["string"]
    }
  ],
  "certifications": [
    {
      "name": "verbatim-string",
      "issuer": "verbatim-string",
      "date": "date-time"
    }
  ]
}

I would still evaluate it on real resumes. But as a modern structured-extraction route, it is very relevant.

OCR / document parsing layer: PaddleOCR-VL, PaddleOCR 3.5, Docling, olmOCR

If your input is PDF/image resumes, I would also look at current OCR/document parsing tools. These are not resume parsers by themselves, but they may solve the most painful upstream step.

Useful resources:

PaddleOCR-VL
PaddleOCR-VL-1.6 collection
PaddleOCR GitHub
HF Blog: PaddleOCR 3.5 with Transformers backend
Docling GitHub
Docling docs
Docling technical report
olmOCR GitHub
olmOCR paper

Why this matters:

A two-column resume, a sidebar resume, or a scanned resume can fail before the extraction model even sees the content correctly. If the OCR/layout step scrambles the reading order, a good NER or LLM extractor may still produce bad JSON.

So I would separate:

document parsing quality

from:

field extraction quality

They are related, but not the same problem.

Resume text → JSON resources

If you can already get reasonably clean text from the resume, there are more direct resources.

Qwen3 resume JSON dataset/model

sandeeppanem/resume-json-extraction-5k
sandeeppanem/qwen3-0.6b-resume-json
GitHub: sandeeppanem/qwen3-resume-extraction

This route is useful if your pipeline is:

PDF/DOCX/image
→ text extraction
→ raw resume text
→ structured JSON extractor

The dataset is raw resume text to structured JSON. The model is a Qwen3-0.6B LoRA adapter for resume JSON extraction.

Caveats:

It is not a LayoutLMv3 dataset.
It does not solve OCR/layout.
The model repo contains the LoRA adapter, so the base model is also needed.
Long resumes and unusual formats still need evaluation.

Small local resume extractor

nimendraai/NuExtract-tiny-Resume-Data-Extractor

This is a resume/CV structured extraction model based on NuExtract-tiny / Qwen2.5-0.5B. It is useful if you want a small local route, especially for raw text to JSON.

Caveat: check the model card carefully. It is trained on synthetic resumes, so I would not trust it without testing on real resumes from your target distribution.

NER / section-routing route

If you want something more deterministic and easier to debug than “LLM returns JSON”, a section classifier + NER pipeline may be easier to control.

Useful resources:

oksomu/resume-ner
amosify/resume-section-classifier-v1
amosify/distilbert-resume-ner-v1
yashpwr/resume-ner-training-data
yashpwr/resume-ner-bert-v2
HF docs: token classification / NER

A practical version of this route could be:

OCR/PDF text chunks
→ classify chunks into sections:
   contact / summary / experience / education / skills / certifications / projects / etc.
→ run section-aware NER
→ normalize dates, phone, email, skills, company names
→ group entities into experience[] and education[]
→ validate JSON

This is less glamorous than a single end-to-end model, but easier to debug.

For example:

Contact fields can often be handled with regex + NER.
Skills can use NER + skill dictionaries.
Experience needs grouping: company, title, dates, bullet points.
Education needs grouping: institution, degree, field, dates/GPA.

The important caveat with many resume NER models is that reported scores may come from internal or narrow test sets. I would always create a small hand-labeled evaluation set from your actual target resumes.

Resource table

Resource	Type	Best for	Not for	Notes
LayoutLMv3 docs	Model docs	Understanding LayoutLMv3 input contract	Finding resume data	Essential for image/words/boxes/labels
LayoutLMv3 forum thread	Forum/debugging	OCR/bbox train-inference consistency	Turnkey solution	Very relevant practical warning
SmartResume	Resume-specific system	Full resume parsing pipeline	Pure LayoutLMv3 training	Strongest goal-first candidate
Alibaba-EI/SmartResume	Model repo	Weights/resources for SmartResume	General NER	Includes resume extraction/layout components
Layout-Aware Parsing Meets Efficient LLMs	Paper	Modern resume extraction architecture	Drop-in code alone	Useful framing and evaluation discussion
NuExtract3	General document extraction VLM	Template-based JSON extraction	Resume-specific guarantee	Strong candidate if model choice is flexible
PaddleOCR-VL	OCR/document parsing	Upstream PDF/image parsing	Resume field extraction by itself	Strong document parsing candidate
PaddleOCR GitHub	OCR/document stack	OCR/layout/table/formula/chart extraction	Resume-specific schema	Good ingestion layer
Docling	Document parser	PDF/DOCX/image to structured text/Markdown/JSON	Resume labels	Useful preprocessing layer
olmOCR	PDF-to-text/Markdown OCR	Clean reading order / linearized text	Resume JSON fields	Useful before extraction
resume-json-extraction-5k	Dataset	Resume text → JSON SFT	LayoutLMv3 training	Directly relevant for text route
qwen3-0.6b-resume-json	LoRA adapter	Lightweight resume JSON extraction	OCR/layout	Needs base Qwen3 model
NuExtract-tiny Resume	Small local extractor	Local raw-text resume JSON extraction	Robust PDF layout	Synthetic-data caveat
oksomu/resume-ner	NER + postprocess	Deterministic entity extraction route	Full layout parsing	Detailed card; evaluate externally
amosify section classifier	Text classifier	Section routing	Field extraction alone	Useful middle layer
amosify resume NER	NER model	Section-aware NER	Nested JSON alone	Pair with section routing
resume-parsing model tag	HF model search	Discovering current resume models	Exhaustive coverage	Some models are not tagged consistently

Suggested practical pipelines

Pipeline 1: LayoutLMv3-only route

Use this if LayoutLMv3 is required.

resume PDF/image
→ render pages
→ OCR with one fixed engine
→ words + word boxes + reading order
→ annotate OCR words/boxes
→ BIO labels
→ LayoutLMv3 token classification
→ field grouping + post-processing
→ JSON validation

This is the most faithful LayoutLMv3 route, but also the most annotation-heavy.

Pipeline 2: Modern model-free route

Use this if the goal is just accurate resume parsing.

resume PDF/DOCX/image
→ Docling / PaddleOCR / olmOCR / SmartResume-style parsing
→ Markdown or layout-preserving text
→ NuExtract3 / SmartResume / Qwen3-resume-json / NER pipeline
→ structured JSON
→ validation and evaluation

This is probably the more practical route for most projects.

Pipeline 3: Hybrid route

Use this if you want to eventually train LayoutLMv3, but need a bootstrap path.

resume PDF/image
→ OCR words + boxes
→ text-to-JSON or NER extractor
→ map extracted field values back to OCR words
→ create weak BIO labels
→ manually correct a subset
→ train LayoutLMv3
→ evaluate against gold set

This can reduce annotation cost, but weak labels can be noisy.

Evaluation checklist

For resume parsing, I would not evaluate only “does it produce JSON?”. I would check:

Aspect	Example
JSON validity	Does it always return parseable JSON?
Schema compliance	Does it follow the target schema exactly?
Field exact match	email, phone, URLs
Normalized match	dates, locations, company names
Semantic match	job titles, degree names, responsibilities
Array alignment	does each title match the correct company/date range?
Omission	did it miss an experience item or degree?
Hallucination	did it invent a company, skill, date, or degree?
Layout robustness	two-column resumes, sidebars, scanned PDFs
Long-document handling	multi-page resumes, truncation, repeated headers
Privacy handling	PII, consent, local processing, data retention

For structured extraction evaluation ideas, you can also look at:

ExtractBench
ParseBench dataset
OmniDocBench

These are not resume-specific, but they are useful for thinking about PDF-to-JSON and document parsing evaluation.

How I would keep searching

I would not rely only on the resume-parsing tag. Some newer models are weakly tagged or not tagged consistently.

Search across:

Hugging Face Models

resume parser
resume json
cv parser
resume ner
qwen resume parser
deepseek resume parser
resume-parsing tag

Hugging Face Spaces

Search for resume parser demos, but inspect the implementation:

app.py
requirements.txt
model calls
OCR/PDF parsing method
whether it handles PDF, DOCX, image, or only raw text
whether there is any evaluation

Spaces are useful for implementation patterns, but I would not treat a demo as evidence of model quality.

Blogs / Papers / Posts

Also watch document parsing and OCR releases, not just resume-specific models.

Useful entry points:

Hugging Face Papers
Hugging Face Blog
Hugging Face Posts
PaddleOCR 3.5 HF blog
OCR open models blog

The reason is that the hardest part may be converting the resume PDF/image into a faithful text/layout representation before the actual field extraction step.

My practical recommendation

If I were trying to solve this now, I would do this:

If LayoutLMv3 is mandatory:
- stop looking only for ordinary resume text datasets;
- build a small LayoutLMv3-style gold dataset with images, OCR words, boxes, and labels;
- keep train/inference OCR identical;
- possibly use a text extractor/NER model to create weak labels, then manually correct them.
If any model/tool is acceptable:
- start with SmartResume;
- test NuExtract3 with a resume JSON schema;
- use PaddleOCR-VL, Docling, or olmOCR if PDF/image ingestion is the bottleneck;
- compare against simpler text-to-JSON or NER routes like qwen3-0.6b-resume-json, NuExtract-tiny Resume, or oksomu/resume-ner.
In both cases:
- create a small hand-checked evaluation set from your actual target resumes;
- test two-column resumes, scanned resumes, multi-page resumes, and unusual layouts;
- evaluate omissions and hallucinations, not only JSON validity.

So my short answer would be:

I did not find a perfect public LayoutLMv3-ready resume dataset. But if the goal is resume parsing rather than specifically LayoutLMv3 training, the ecosystem is much better now: SmartResume, NuExtract3, PaddleOCR-VL/PaddleOCR, Qwen3 resume JSON, NuExtract-tiny Resume, and resume NER/section-routing models are all worth checking.