External Publication
Visit Post

Is it possible to create a Résumé parser using a Huggingface model?

Hugging Face Forums [Unofficial] June 3, 2026
Source

For now, version as of June 2026… This probably changes quite a bit depending on whether you specifically need to make it work with LayoutLMv3, or whether any model/tool is acceptable as long as the resume parsing goal is met:


TL;DR

I would split this into two different tracks:

  1. If you specifically need LayoutLMv3 , the main problem is not just “which model?” or “which dataset?”. You need a dataset/pipeline with:

    • page images or rendered PDF pages,
    • OCR words,
    • word-level bounding boxes,
    • normalized LayoutLM-style boxes,
    • word/token labels,
    • reading order,
    • and the same OCR/bbox preprocessing at training and inference time.

I did not find a clean public resume-specific dataset that is already “LayoutLMv3-ready” in that full sense.

  1. If the goal is simply resume parsing / resume information extraction , then there are more practical resources now. I would look at:

    • Alibaba SmartResume / Alibaba-EI/SmartResume
    • NuExtract3
    • PaddleOCR-VL / PaddleOCR-VL-1.6 collection
    • sandeeppanem/resume-json-extraction-5k
    • sandeeppanem/qwen3-0.6b-resume-json
    • nimendraai/NuExtract-tiny-Resume-Data-Extractor
    • oksomu/resume-ner
    • amosify/resume-section-classifier-v1
    • amosify/distilbert-resume-ner-v1

The most important practical distinction is:

LayoutLMv3 wants image + OCR words + boxes + labels. Many newer resume resources are text-to-JSON, NER, OCR, or document-parsing resources. Useful, but not the same training format.


Track A — If you must use LayoutLMv3

For LayoutLMv3, I would first check the expected input contract carefully.

Relevant docs:

  • LayoutLMv3 docs
  • LayoutLMv3 model card: microsoft/layoutlmv3-base
  • HF forum: LayoutLMv3 for token classification

The important part is that LayoutLMv3 token classification is not just ordinary text NER. The processor/model path expects layout-aware inputs, usually something like:

{
    "image": "<page image>",
    "words": ["John", "Doe", "Software", "Engineer", "..."],
    "boxes": [[x0, y0, x1, y1], ...],
    "word_labels": ["B-NAME", "I-NAME", "B-TITLE", "I-TITLE", "..."]
}

The boxes should be word-level bounding boxes, normalized in the LayoutLM-style coordinate system, usually 0–1000 scale. The labels need to align with the OCR words, and then the tokenizer has to propagate word labels to subword tokens.

The forum thread above is worth reading because it points out a common failure mode: training with one kind of box/annotation setup, then using a different OCR/bbox setup at inference. That can break the model even if the training code looked fine.

Practical LayoutLMv3 checklist

If I had to make the LayoutLMv3 route work, I would do something like this:

1. Choose one OCR engine and freeze it.
   Examples: Tesseract, EasyOCR, PaddleOCR, pdfplumber/PDF text extraction, etc.

2. Convert each resume page into:
   - page image
   - OCR words
   - word-level bounding boxes
   - reading order

3. Annotate those OCR words/boxes.
   Do not annotate totally separate hand-drawn boxes unless you can reproduce the same boxes at inference.

4. Convert annotations into BIO/BILOU labels.
   Example labels:
   - B-NAME / I-NAME
   - B-EMAIL / I-EMAIL
   - B-PHONE / I-PHONE
   - B-COMPANY / I-COMPANY
   - B-JOB_TITLE / I-JOB_TITLE
   - B-DEGREE / I-DEGREE
   - B-INSTITUTION / I-INSTITUTION
   - B-SKILL / I-SKILL
   - O

5. Normalize boxes exactly as LayoutLMv3 expects.

6. Keep OCR, ordering, box normalization, truncation, and page splitting identical at training and inference.

7. Start with a small gold evaluation set before scaling.

In other words, for LayoutLMv3 the dataset problem is really an annotation and preprocessing contract problem.

Why ordinary resume text datasets are not enough for LayoutLMv3

A dataset like:

{
  "text": "John Doe\nSoftware Engineer\n...",
  "json": {
    "name": "John Doe",
    "title": "Software Engineer"
  }
}

can be useful for text-to-JSON models or LLM fine-tuning, but it is not directly enough for LayoutLMv3 token classification because it is missing:

  • page image,
  • OCR words,
  • word bounding boxes,
  • word-level labels,
  • reading order,
  • box normalization,
  • page-level segmentation.

You might still use a text-to-JSON model to create weak labels. For example:

resume PDF
→ OCR words + boxes
→ text-to-JSON extractor
→ extracted field values
→ string-match field values back to OCR words
→ weak BIO labels
→ human correction
→ LayoutLMv3 fine-tuning

But I would treat that as a weak-labeling/bootstrap approach, not as a clean substitute for a real gold dataset.

A LayoutLMv3-adjacent resource

One interesting LayoutLMv3-related resume resource is:

  • Smutypi3/applai-layoutlmv3

I would not treat it as a complete resume field parser, but it is relevant because it uses LayoutLMv3 on resume PDFs and discusses a resume-oriented preprocessing pattern. It may be useful as a reference if you want to see how someone handled PDF words, boxes, and LayoutLMv3-style representations in the resume domain.


Track B — If any model/tool is acceptable

If the goal is simply “parse resumes into structured fields”, I would probably not start with LayoutLMv3. I would start with a pipeline view:

PDF / DOCX / image resume
→ OCR / PDF parsing / layout reconstruction
→ clean text or Markdown with reading order
→ section routing
→ structured extraction
→ JSON validation
→ evaluation on a small hand-checked set

A resume parser is often not one model. It is a pipeline.

The hard part may be upstream: converting a visually complex resume PDF into faithful text/Markdown/layout before extracting fields.


Strongest current resource: SmartResume

I would look at SmartResume first:

  • GitHub: alibaba/SmartResume
  • HF model repo: Alibaba-EI/SmartResume
  • Paper: Layout-Aware Parsing Meets Efficient LLMs
  • HF Papers page

Why it matters:

SmartResume is very close to the actual problem. It is not just a generic NER model. It treats resume parsing as a layout-aware pipeline:

resume PDF / image / Office document
→ OCR + PDF metadata extraction
→ layout detection
→ reading order reconstruction
→ structured information extraction with a compact LLM

The paper is especially useful because it frames the problem correctly:

  • resumes have diverse layouts,
  • resumes often have multi-column structures,
  • reading order matters,
  • LLM-only extraction can be expensive or brittle,
  • standardized resume extraction datasets/evaluation tools are limited.

This is probably the best “goal-first” starting point I found.


General structured extraction: NuExtract3

Another strong candidate is:

  • numind/NuExtract3

This is not resume-specific, but it is very relevant. It is a vision-language document understanding model for structured extraction and image-to-Markdown conversion.

The useful pattern is:

input document + JSON template + optional instructions
→ structured JSON output

For a resume, the template might look like:

{
  "name": "verbatim-string",
  "email": "email",
  "phone": "verbatim-string",
  "location": "verbatim-string",
  "summary": "string",
  "skills": ["verbatim-string"],
  "education": [
    {
      "institution": "verbatim-string",
      "degree": "verbatim-string",
      "field": "verbatim-string",
      "start_date": "date-time",
      "end_date": "date-time"
    }
  ],
  "experience": [
    {
      "company": "verbatim-string",
      "title": "verbatim-string",
      "location": "verbatim-string",
      "start_date": "date-time",
      "end_date": "date-time",
      "responsibilities": ["string"],
      "achievements": ["string"]
    }
  ],
  "certifications": [
    {
      "name": "verbatim-string",
      "issuer": "verbatim-string",
      "date": "date-time"
    }
  ]
}

I would still evaluate it on real resumes. But as a modern structured-extraction route, it is very relevant.


OCR / document parsing layer: PaddleOCR-VL, PaddleOCR 3.5, Docling, olmOCR

If your input is PDF/image resumes, I would also look at current OCR/document parsing tools. These are not resume parsers by themselves, but they may solve the most painful upstream step.

Useful resources:

  • PaddleOCR-VL
  • PaddleOCR-VL-1.6 collection
  • PaddleOCR GitHub
  • HF Blog: PaddleOCR 3.5 with Transformers backend
  • Docling GitHub
  • Docling docs
  • Docling technical report
  • olmOCR GitHub
  • olmOCR paper

Why this matters:

A two-column resume, a sidebar resume, or a scanned resume can fail before the extraction model even sees the content correctly. If the OCR/layout step scrambles the reading order, a good NER or LLM extractor may still produce bad JSON.

So I would separate:

document parsing quality

from:

field extraction quality

They are related, but not the same problem.


Resume text → JSON resources

If you can already get reasonably clean text from the resume, there are more direct resources.

Qwen3 resume JSON dataset/model

  • sandeeppanem/resume-json-extraction-5k
  • sandeeppanem/qwen3-0.6b-resume-json
  • GitHub: sandeeppanem/qwen3-resume-extraction

This route is useful if your pipeline is:

PDF/DOCX/image
→ text extraction
→ raw resume text
→ structured JSON extractor

The dataset is raw resume text to structured JSON. The model is a Qwen3-0.6B LoRA adapter for resume JSON extraction.

Caveats:

  • It is not a LayoutLMv3 dataset.
  • It does not solve OCR/layout.
  • The model repo contains the LoRA adapter, so the base model is also needed.
  • Long resumes and unusual formats still need evaluation.

Small local resume extractor

  • nimendraai/NuExtract-tiny-Resume-Data-Extractor

This is a resume/CV structured extraction model based on NuExtract-tiny / Qwen2.5-0.5B. It is useful if you want a small local route, especially for raw text to JSON.

Caveat: check the model card carefully. It is trained on synthetic resumes, so I would not trust it without testing on real resumes from your target distribution.


NER / section-routing route

If you want something more deterministic and easier to debug than “LLM returns JSON”, a section classifier + NER pipeline may be easier to control.

Useful resources:

  • oksomu/resume-ner
  • amosify/resume-section-classifier-v1
  • amosify/distilbert-resume-ner-v1
  • yashpwr/resume-ner-training-data
  • yashpwr/resume-ner-bert-v2
  • HF docs: token classification / NER

A practical version of this route could be:

OCR/PDF text chunks
→ classify chunks into sections:
   contact / summary / experience / education / skills / certifications / projects / etc.
→ run section-aware NER
→ normalize dates, phone, email, skills, company names
→ group entities into experience[] and education[]
→ validate JSON

This is less glamorous than a single end-to-end model, but easier to debug.

For example:

  • Contact fields can often be handled with regex + NER.
  • Skills can use NER + skill dictionaries.
  • Experience needs grouping: company, title, dates, bullet points.
  • Education needs grouping: institution, degree, field, dates/GPA.

The important caveat with many resume NER models is that reported scores may come from internal or narrow test sets. I would always create a small hand-labeled evaluation set from your actual target resumes.


Resource table

Resource Type Best for Not for Notes
LayoutLMv3 docs Model docs Understanding LayoutLMv3 input contract Finding resume data Essential for image/words/boxes/labels
LayoutLMv3 forum thread Forum/debugging OCR/bbox train-inference consistency Turnkey solution Very relevant practical warning
SmartResume Resume-specific system Full resume parsing pipeline Pure LayoutLMv3 training Strongest goal-first candidate
Alibaba-EI/SmartResume Model repo Weights/resources for SmartResume General NER Includes resume extraction/layout components
Layout-Aware Parsing Meets Efficient LLMs Paper Modern resume extraction architecture Drop-in code alone Useful framing and evaluation discussion
NuExtract3 General document extraction VLM Template-based JSON extraction Resume-specific guarantee Strong candidate if model choice is flexible
PaddleOCR-VL OCR/document parsing Upstream PDF/image parsing Resume field extraction by itself Strong document parsing candidate
PaddleOCR GitHub OCR/document stack OCR/layout/table/formula/chart extraction Resume-specific schema Good ingestion layer
Docling Document parser PDF/DOCX/image to structured text/Markdown/JSON Resume labels Useful preprocessing layer
olmOCR PDF-to-text/Markdown OCR Clean reading order / linearized text Resume JSON fields Useful before extraction
resume-json-extraction-5k Dataset Resume text → JSON SFT LayoutLMv3 training Directly relevant for text route
qwen3-0.6b-resume-json LoRA adapter Lightweight resume JSON extraction OCR/layout Needs base Qwen3 model
NuExtract-tiny Resume Small local extractor Local raw-text resume JSON extraction Robust PDF layout Synthetic-data caveat
oksomu/resume-ner NER + postprocess Deterministic entity extraction route Full layout parsing Detailed card; evaluate externally
amosify section classifier Text classifier Section routing Field extraction alone Useful middle layer
amosify resume NER NER model Section-aware NER Nested JSON alone Pair with section routing
resume-parsing model tag HF model search Discovering current resume models Exhaustive coverage Some models are not tagged consistently

Suggested practical pipelines

Pipeline 1: LayoutLMv3-only route

Use this if LayoutLMv3 is required.

resume PDF/image
→ render pages
→ OCR with one fixed engine
→ words + word boxes + reading order
→ annotate OCR words/boxes
→ BIO labels
→ LayoutLMv3 token classification
→ field grouping + post-processing
→ JSON validation

This is the most faithful LayoutLMv3 route, but also the most annotation-heavy.

Pipeline 2: Modern model-free route

Use this if the goal is just accurate resume parsing.

resume PDF/DOCX/image
→ Docling / PaddleOCR / olmOCR / SmartResume-style parsing
→ Markdown or layout-preserving text
→ NuExtract3 / SmartResume / Qwen3-resume-json / NER pipeline
→ structured JSON
→ validation and evaluation

This is probably the more practical route for most projects.

Pipeline 3: Hybrid route

Use this if you want to eventually train LayoutLMv3, but need a bootstrap path.

resume PDF/image
→ OCR words + boxes
→ text-to-JSON or NER extractor
→ map extracted field values back to OCR words
→ create weak BIO labels
→ manually correct a subset
→ train LayoutLMv3
→ evaluate against gold set

This can reduce annotation cost, but weak labels can be noisy.


Evaluation checklist

For resume parsing, I would not evaluate only “does it produce JSON?”. I would check:

Aspect Example
JSON validity Does it always return parseable JSON?
Schema compliance Does it follow the target schema exactly?
Field exact match email, phone, URLs
Normalized match dates, locations, company names
Semantic match job titles, degree names, responsibilities
Array alignment does each title match the correct company/date range?
Omission did it miss an experience item or degree?
Hallucination did it invent a company, skill, date, or degree?
Layout robustness two-column resumes, sidebars, scanned PDFs
Long-document handling multi-page resumes, truncation, repeated headers
Privacy handling PII, consent, local processing, data retention

For structured extraction evaluation ideas, you can also look at:

  • ExtractBench
  • ParseBench dataset
  • OmniDocBench

These are not resume-specific, but they are useful for thinking about PDF-to-JSON and document parsing evaluation.


How I would keep searching

I would not rely only on the resume-parsing tag. Some newer models are weakly tagged or not tagged consistently.

Search across:

Hugging Face Models

  • resume parser
  • resume json
  • cv parser
  • resume ner
  • qwen resume parser
  • deepseek resume parser
  • resume-parsing tag

Hugging Face Spaces

Search for resume parser demos, but inspect the implementation:

  • app.py
  • requirements.txt
  • model calls
  • OCR/PDF parsing method
  • whether it handles PDF, DOCX, image, or only raw text
  • whether there is any evaluation

Spaces are useful for implementation patterns, but I would not treat a demo as evidence of model quality.

Blogs / Papers / Posts

Also watch document parsing and OCR releases, not just resume-specific models.

Useful entry points:

  • Hugging Face Papers
  • Hugging Face Blog
  • Hugging Face Posts
  • PaddleOCR 3.5 HF blog
  • OCR open models blog

The reason is that the hardest part may be converting the resume PDF/image into a faithful text/layout representation before the actual field extraction step.


My practical recommendation

If I were trying to solve this now, I would do this:

  1. If LayoutLMv3 is mandatory:

    • stop looking only for ordinary resume text datasets;
    • build a small LayoutLMv3-style gold dataset with images, OCR words, boxes, and labels;
    • keep train/inference OCR identical;
    • possibly use a text extractor/NER model to create weak labels, then manually correct them.
  2. If any model/tool is acceptable:

    • start with SmartResume;
    • test NuExtract3 with a resume JSON schema;
    • use PaddleOCR-VL, Docling, or olmOCR if PDF/image ingestion is the bottleneck;
    • compare against simpler text-to-JSON or NER routes like qwen3-0.6b-resume-json, NuExtract-tiny Resume, or oksomu/resume-ner.
  3. In both cases:

    • create a small hand-checked evaluation set from your actual target resumes;
    • test two-column resumes, scanned resumes, multi-page resumes, and unusual layouts;
    • evaluate omissions and hallucinations, not only JSON validity.

So my short answer would be:

I did not find a perfect public LayoutLMv3-ready resume dataset. But if the goal is resume parsing rather than specifically LayoutLMv3 training, the ecosystem is much better now: SmartResume, NuExtract3, PaddleOCR-VL/PaddleOCR, Qwen3 resume JSON, NuExtract-tiny Resume, and resume NER/section-routing models are all worth checking.

Discussion in the ATmosphere

Loading comments...