External Publication

Visit Post

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 11, 2026

Source

Hmm, after looking into it, it seems to be something like this:

Short answer

I would separate the two questions.

For the first question:

Is it true that only larger models can maintain attention and follow noisy/long instructions reliably?

Mostly yes, in practice. Larger models usually handle messy, long, multi-constraint prompts better. But I would not phrase it as “only large models can do it.” A 0.8B model can still be useful if the product design reduces the burden on the model.

For the second question:

Should the n-gram model handle noisy Wikipedia tails, or should I clean them first?

Clean them first.

The n-gram model should be a quality scorer , not a garbage collector.

If the text has good Persian prose followed by reference/citation garbage, that is usually a boundary/truncation problem , not necessarily a reason to reject the whole document.

1. About 0.8B models and noisy prompts

Your intuition is reasonable.

A 0.8B model should not be expected to handle the same prompt complexity as a strong 7B, 8B, 14B, or frontier model.

The problem is not only Persian. Even much larger models can fail when prompts are:

long
messy
ambiguous
internally contradictory
full of irrelevant context
full of embedded instructions
multi-step
multi-constraint
noisy or poorly formatted

This is related to several known issues:

Lost in the Middle: models may fail to use information reliably when it appears in the middle of long contexts.
Instruction-following robustness / prompt injection: models may struggle to distinguish which instructions to follow and which to ignore.
Instruction-following survey: instruction following is a broad and still nontrivial problem.
Long-context instruction following: longer context windows do not automatically solve instruction adherence.

So I would agree with your concern:

If an 8B model struggles with noisy, unclear, long prompts, a 0.8B model will probably struggle more.

But the practical answer is not simply “give up.” The answer is:

Do not design the system so that the 0.8B model has to solve everything inside one messy prompt.

2. Reduce the burden on the 0.8B model

For a small model, the system design matters a lot.

Instead of asking the model to handle this:

long noisy user prompt
+ mixed task
+ unclear context
+ multiple instructions
+ irrelevant text
+ long document
+ expected structured answer

try to convert it into this:

short clean task
+ one clear instruction
+ limited relevant context
+ simple expected format

A 0.8B assistant can become much more useful if you do preprocessing before the prompt reaches the model.

Practical design

Problem	Better design
User gives long messy text	clean/split/summarize before model call
User asks multiple things	split into subtasks
Prompt contains irrelevant context	retrieve/select only relevant spans
Prompt is unclear	ask a clarification question
Prompt has many constraints	use a template with explicit fields
Long document QA	use short retrieved chunks
Math	use calculator/tool when possible
Tool calling	use strict schema and small examples
Grammar help	classify the grammar task first, then answer

For example, instead of:

User gives a long messy paragraph and asks the model to understand everything, correct grammar, summarize, answer questions, and explain.

you can make a pipeline:

input
  -> normalize
  -> detect task type
  -> remove irrelevant noise
  -> split into smaller chunks
  -> select the useful part
  -> send short structured prompt to model

For a small model, this kind of pipeline is often more important than trying to make the model “smart enough” to handle arbitrary mess.

3. Train on realistic noise, not arbitrary garbage

There is a difference between useful robustness data and garbage data.

Good noisy SFT examples

Good noisy examples teach the model to handle realistic user input:

typos
informal Persian
missing punctuation
mixed Persian-English terms
short unclear question
student misconception
slightly messy formatting

Bad noisy examples

Bad examples teach the model to imitate broken data:

citation fragments
broken references
HTML leftovers
random source names
duplicated lines
malformed dates
garbled bibliography text
OCR garbage
truncated sentences

The first kind can be useful for SFT.

The second kind should usually be removed from CPT data and from the “good” n-gram corpus.

So I would use this rule:

Noise type	Use in training?
natural human typo	maybe yes
informal Persian	yes, if target includes it
student mistake	yes, for tutor SFT
unclear user question	yes, if assistant learns to clarify
Wikipedia reference tail	no, remove or use as bad data
broken source list	no
malformed citation numbers	no
duplicated boilerplate	no
random mixed-language bibliography	no

4. About your Wikipedia example

The example you showed looks like this:

The beginning is normal Persian prose.
The ending becomes reference/citation garbage.
There are broken spaced numbers like ۲ ۸, ۱ ۳ ۵ ۷, ۱ ۳ ۳ ۲.
Source names and article titles are glued into the sentence.
The passage boundary seems wrong.

So I would not treat that as a pure “bad document” problem.

It is more like:

good prose + bad trailing span

That means the best operation is often:

keep the good part
truncate the bad tail

not:

reject the whole passage

and not:

let the n-gram model figure it out

A good cleaning system should detect that the text changes from normal prose into citation/reference fragments.

5. Clean before training the Good n-gram model

For the Good n-gram model, only train on text that you would be happy for the model to imitate.

If the Good n-gram model sees citation tails, it may learn that citation tails are normal Persian.

So I would do this:

raw Wikipedia text
  -> markup/HTML/boilerplate cleanup
  -> paragraph split
  -> line/sentence split
  -> reference-tail removal
  -> Persian normalization
  -> deduplication
  -> quality scoring
  -> Good KenLM training corpus

Only after this should you train the Good n-gram model.

This is consistent with how large open corpora are usually built. For example:

RedPajama discusses preprocessing Wikipedia to remove hyperlinks, comments, and formatting boilerplate.
FineWeb emphasizes filtering and deduplication as central parts of dataset construction.
FinerWeb-10BT shows that line-level filtering can improve data quality and training efficiency.

The practical lesson is:

Filtering is not something you add only at the end. It is part of corpus construction.

6. Use line/sentence-level cleaning, not only document-level cleaning

Your example is exactly why document-level filtering is not enough.

A document may contain:

good paragraph
good paragraph
good paragraph
bad reference tail

If you only classify the whole document as good/bad, you lose useful text.

Instead, use smaller units:

Unit	Use
document	broad quality / source metadata
paragraph	main CPT unit
line	boilerplate/reference detection
sentence	fine-grained truncation
span	remove bad tail after good prose

For Wikipedia-like data, I would do:

article
  -> sections
  -> paragraphs
  -> lines/sentences
  -> score each unit
  -> remove bad units
  -> optionally merge clean neighboring units

This is especially useful for tails like:

... normal Persian sentence. BBC Persian Abrahamian Modern Iran p۱ ۲ ۲ ...

The normal sentence can be kept. The reference tail should be removed.

7. Heuristics for reference-tail detection

You can start with simple heuristics.

Reject or truncate spans with:

many isolated numbers
too many digits
too many parentheses/brackets
URL / DOI / ISBN / ISSN patterns
English-heavy reference fragments
source names glued into Persian prose
bibliography-like patterns
repeated source names
very high punctuation ratio
very low Persian-letter ratio
abnormal spaced digits

Examples of suspicious patterns:

۲ ۸
۱ ۳ ۵ ۷
۱ ۳ ۳ ۲
p۱ ۲ ۲
ص ۲ ۸ ۳
BBC Persian
Modern Iran
ISBN
ISSN
doi
http
www

For Persian Wikipedia specifically, also watch for section/reference terms, but do not use them too naively.

Words like:

منابع
ارجاع
پیوند
جستارهای وابسته
پانویس
کتابشناسی

can indicate reference sections, but context matters.

For example:

منابع طبیعی ایران

is normal content, not a reference section.

So I would use these words mostly as:

section heading / line-level / end-of-article signal

not as a global document rejection rule.

8. Truncation is often better than rejection

For your example, I would probably do something like:

Before:
<good Persian prose>. <good Persian prose>. <citation/source garbage> <broken numbers> <bibliography tail>

After:
<good Persian prose>. <good Persian prose>.

A practical rule:

If a paragraph starts as good Persian prose but later becomes citation-like,
truncate from the first suspicious boundary.

Possible boundary signals:

sudden English source title
sudden bibliography author/title/page pattern
many spaced digits
multiple source names in a row
Persian sentence without punctuation followed by reference fragments

This is not perfect, but it is much better than letting the n-gram model learn the garbage.

9. Good LM / Bad LM setup

Your n-gram idea can still be useful.

I would use two n-gram models:

Good LM

Train on:

clean Persian prose
clean Wikipedia paragraphs
curated educational text
high-confidence manually accepted examples

Bad LM

Train on:

reference tails
citation fragments
boilerplate
broken OCR-like text
mixed-language bibliography
malformed Wikipedia tails
rejected OSCAR chunks

Then score candidates with both.

A candidate is better if:

Good LM likes it
Bad LM does not like it

Conceptually:

score = bad_lm_score - good_lm_score

or any similar ratio/difference.

Do not overthink the formula at first. The important idea is:

Good LM should model what you want. Bad LM should model what you want to remove.

This is better than a single perplexity threshold.

10. Persian normalization

Before scoring with n-gram models, normalize Persian consistently.

Useful tools:

Hazm
Hazm GitHub
PersianTools
Lucene PersianNormalizer

Things to normalize:

Arabic/Persian ي/ی
Arabic/Persian ك/ک
heh variants
Arabic/Persian digits
diacritics
extra tatweel/kashida
extra spaces
weird zero-width characters
punctuation spacing
half-space / ZWNJ

For example, Hazm’s normalizer is useful for standard Persian text normalization, including spacing and ZWNJ-related normalization.

11. Do not simply remove all ZWNJ

For Persian, ZWNJ is not just random noise.

It can be meaningful in words like:

کتاب‌ها
می‌روم
خانه‌ای
رفته‌ام

So I would not simply delete every zero-width non-joiner.

Better:

normalize/correct ZWNJ
remove weird repeated zero-width characters
standardize Unicode form
collapse multiple zero-width chars
keep valid Persian ZWNJ where appropriate

Persian word segmentation and ZWNJ recognition are real NLP problems; see, for example, Joint Persian Word Segmentation Correction and ZWNJ Recognition.

Practical rule:

Normalize ZWNJ; do not blindly remove it.

12. Digits: pick a convention

Persian corpora often contain mixed:

Persian digits: ۱۲۳
Arabic digits: ١٢٣
Latin digits: 123
broken spaced digits: ۱ ۲ ۳

You should choose a convention.

For CPT prose, Persian digits may be natural.

For JSON, tool calling, math verification, and metadata, Latin digits are often easier.

A practical approach:

Context	Suggested convention
raw Persian prose CPT	Persian digits are okay
math internal answer field	Latin digits
JSON/tool arguments	Latin digits
final Persian display answer	Persian digits are okay
metadata	Latin digits
broken spaced digits	fix if obvious, otherwise reject/truncate

The important thing is consistency.

Do not let this happen randomly:

۲ 8 ٣ ۴ 5

unless you intentionally want mixed-digit robustness data.

For your example, ۲ ۸ مرداد should probably become:

۲۸ مرداد

and ۱ ۳ ۵ ۷ should become:

۱۳۵۷

if you are confident it is a date/year.

But if the number sequence is ambiguous, reject or truncate that span.

13. A simple cleaning pipeline for your current case

I would implement something like this:

1. Extract text
2. Normalize Unicode
3. Normalize Persian letters
4. Normalize digits
5. Split into paragraphs
6. Split paragraphs into sentences/lines
7. Detect reference-like lines/spans
8. Truncate bad tails
9. Remove very bad paragraphs
10. Deduplicate
11. Train Good LM on clean accepted text
12. Train Bad LM on rejected tails/noise
13. Score new chunks
14. Manually audit samples

More concrete:

input paragraph
  -> sentence split
  -> for each sentence/span:
       calculate Persian letter ratio
       calculate digit ratio
       calculate Latin ratio
       calculate punctuation ratio
       detect citation markers
       detect spaced-digit patterns
       detect source-name/reference tail
  -> if bad tail starts after good text:
       keep text before bad tail
  -> else if whole paragraph is bad:
       reject
  -> else:
       keep

14. Example pseudo-code

Very rough pseudo-code:

import re

PERSIAN_LETTERS = r"آ-ی"

def persian_ratio(text):
    letters = re.findall(f"[{PERSIAN_LETTERS}]", text)
    chars = [c for c in text if not c.isspace()]
    return len(letters) / max(1, len(chars))

def digit_ratio(text):
    digits = re.findall(r"[0-9۰-۹٠-٩]", text)
    chars = [c for c in text if not c.isspace()]
    return len(digits) / max(1, len(chars))

def has_spaced_digits(text):
    # examples like "۱ ۳ ۵ ۷" or "۲ ۸"
    return bool(re.search(r"[0-9۰-۹٠-٩](\s+[0-9۰-۹٠-٩]){1,}", text))

def looks_reference_like(text):
    patterns = [
        r"http",
        r"www\.",
        r"doi",
        r"ISBN",
        r"ISSN",
        r"BBC Persian",
        r"Modern Iran",
        r"\bp\s*[0-9۰-۹٠-٩]",
        r"ص\s*[0-9۰-۹٠-٩]",
    ]
    return any(re.search(p, text, flags=re.IGNORECASE) for p in patterns)

def is_bad_tail(sentence):
    if looks_reference_like(sentence):
        return True
    if has_spaced_digits(sentence) and digit_ratio(sentence) > 0.08:
        return True
    if persian_ratio(sentence) < 0.45:
        return True
    return False

def truncate_bad_tail(sentences):
    clean = []
    for sent in sentences:
        if is_bad_tail(sent) and len(clean) > 0:
            break
        if not is_bad_tail(sent):
            clean.append(sent)
    return " ".join(clean)

This is only a starting point. You would need to tune it by looking at accepted/rejected samples.

15. Manual audit is still necessary

Do not trust the cleaner blindly.

After each cleaning rule change, sample:

100 accepted chunks
100 rejected chunks
100 truncated chunks

Then inspect:

Sample type	What to check
accepted	Did garbage survive?
rejected	Did good Persian get removed?
truncated	Did truncation cut at the right place?
borderline	Should this become a new rule?

This is the same kind of loop you are already doing:

clean -> review -> adjust rules -> clean again

That loop is the correct approach.

16. For /7: how to train the small model for noisy prompts

For the 0.8B model, I would not train it on arbitrary noisy prompts.

I would train it on controlled noisy prompts.

Examples:

Good robustness SFT

User has typo -> assistant still answers.
User asks unclear question -> assistant asks clarification.
User includes irrelevant sentence -> assistant focuses on main question.
User asks two things -> assistant separates them.
User provides messy Persian -> assistant normalizes meaning.

Bad robustness SFT

User prompt contains random citation garbage -> assistant imitates it.
User prompt has broken reference tail -> assistant treats it as meaningful.
User prompt has unrelated source list -> assistant summarizes garbage.

For a small model, the best behavior is often:

I cannot reliably answer from this messy text. Please provide a clearer sentence.

or:

The first part is understandable, but the ending looks like broken reference text.

That is a valid assistant behavior.

The goal is not to make the 0.8B model magically robust to every bad prompt. The goal is to make it fail gracefully.

17. Product-level strategy for a 0.8B assistant

For a low-end-device assistant, I would design guardrails around the model.

raw input
  -> normalization
  -> task classifier
  -> noise detector
  -> chunk selector
  -> small model
  -> output validator

If the input is too messy:

ask clarification

If the input is too long:

split/summarize first

If it contains reference garbage:

remove or warn

If it asks for math:

use calculator/tool if available

If it asks for grammar:

route to grammar-tutor prompt template

Small models work better when the surrounding system reduces ambiguity.

18. Bottom line

For the noisy prompt question:

Larger models generally handle long, noisy, multi-constraint prompts better. But a 0.8B model can still be useful if you reduce the prompt burden with preprocessing, task routing, shorter context, templates, and controlled robustness SFT.

For the n-gram cleaning question:

Do not let the n-gram model handle structural garbage by itself. Remove obvious reference/citation/wiki-tail noise first. Train the Good LM only on text you want the model to imitate. Optionally train a Bad LM on the garbage you want to detect.

The most important rule is:

The n-gram model should be a quality scorer, not a garbage collector.