How can i build a High Quality dataset?
Hmm, after looking into it, it seems to be something like this:
Short answer
I would separate the two questions.
For the first question:
Is it true that only larger models can maintain attention and follow noisy/long instructions reliably?
Mostly yes, in practice. Larger models usually handle messy, long, multi-constraint prompts better. But I would not phrase it as “only large models can do it.” A 0.8B model can still be useful if the product design reduces the burden on the model.
For the second question:
Should the n-gram model handle noisy Wikipedia tails, or should I clean them first?
Clean them first.
The n-gram model should be a quality scorer , not a garbage collector.
If the text has good Persian prose followed by reference/citation garbage, that is usually a boundary/truncation problem , not necessarily a reason to reject the whole document.
1. About 0.8B models and noisy prompts
Your intuition is reasonable.
A 0.8B model should not be expected to handle the same prompt complexity as a strong 7B, 8B, 14B, or frontier model.
The problem is not only Persian. Even much larger models can fail when prompts are:
- long
- messy
- ambiguous
- internally contradictory
- full of irrelevant context
- full of embedded instructions
- multi-step
- multi-constraint
- noisy or poorly formatted
This is related to several known issues:
- Lost in the Middle: models may fail to use information reliably when it appears in the middle of long contexts.
- Instruction-following robustness / prompt injection: models may struggle to distinguish which instructions to follow and which to ignore.
- Instruction-following survey: instruction following is a broad and still nontrivial problem.
- Long-context instruction following: longer context windows do not automatically solve instruction adherence.
So I would agree with your concern:
If an 8B model struggles with noisy, unclear, long prompts, a 0.8B model will probably struggle more.
But the practical answer is not simply “give up.” The answer is:
Do not design the system so that the 0.8B model has to solve everything inside one messy prompt.
2. Reduce the burden on the 0.8B model
For a small model, the system design matters a lot.
Instead of asking the model to handle this:
long noisy user prompt
+ mixed task
+ unclear context
+ multiple instructions
+ irrelevant text
+ long document
+ expected structured answer
try to convert it into this:
short clean task
+ one clear instruction
+ limited relevant context
+ simple expected format
A 0.8B assistant can become much more useful if you do preprocessing before the prompt reaches the model.
Practical design
| Problem | Better design |
|---|---|
| User gives long messy text | clean/split/summarize before model call |
| User asks multiple things | split into subtasks |
| Prompt contains irrelevant context | retrieve/select only relevant spans |
| Prompt is unclear | ask a clarification question |
| Prompt has many constraints | use a template with explicit fields |
| Long document QA | use short retrieved chunks |
| Math | use calculator/tool when possible |
| Tool calling | use strict schema and small examples |
| Grammar help | classify the grammar task first, then answer |
For example, instead of:
User gives a long messy paragraph and asks the model to understand everything, correct grammar, summarize, answer questions, and explain.
you can make a pipeline:
input
-> normalize
-> detect task type
-> remove irrelevant noise
-> split into smaller chunks
-> select the useful part
-> send short structured prompt to model
For a small model, this kind of pipeline is often more important than trying to make the model “smart enough” to handle arbitrary mess.
3. Train on realistic noise, not arbitrary garbage
There is a difference between useful robustness data and garbage data.
Good noisy SFT examples
Good noisy examples teach the model to handle realistic user input:
typos
informal Persian
missing punctuation
mixed Persian-English terms
short unclear question
student misconception
slightly messy formatting
Bad noisy examples
Bad examples teach the model to imitate broken data:
citation fragments
broken references
HTML leftovers
random source names
duplicated lines
malformed dates
garbled bibliography text
OCR garbage
truncated sentences
The first kind can be useful for SFT.
The second kind should usually be removed from CPT data and from the “good” n-gram corpus.
So I would use this rule:
| Noise type | Use in training? |
|---|---|
| natural human typo | maybe yes |
| informal Persian | yes, if target includes it |
| student mistake | yes, for tutor SFT |
| unclear user question | yes, if assistant learns to clarify |
| Wikipedia reference tail | no, remove or use as bad data |
| broken source list | no |
| malformed citation numbers | no |
| duplicated boilerplate | no |
| random mixed-language bibliography | no |
4. About your Wikipedia example
The example you showed looks like this:
- The beginning is normal Persian prose.
- The ending becomes reference/citation garbage.
- There are broken spaced numbers like
۲ ۸,۱ ۳ ۵ ۷,۱ ۳ ۳ ۲. - Source names and article titles are glued into the sentence.
- The passage boundary seems wrong.
So I would not treat that as a pure “bad document” problem.
It is more like:
good prose + bad trailing span
That means the best operation is often:
keep the good part
truncate the bad tail
not:
reject the whole passage
and not:
let the n-gram model figure it out
A good cleaning system should detect that the text changes from normal prose into citation/reference fragments.
5. Clean before training the Good n-gram model
For the Good n-gram model, only train on text that you would be happy for the model to imitate.
If the Good n-gram model sees citation tails, it may learn that citation tails are normal Persian.
So I would do this:
raw Wikipedia text
-> markup/HTML/boilerplate cleanup
-> paragraph split
-> line/sentence split
-> reference-tail removal
-> Persian normalization
-> deduplication
-> quality scoring
-> Good KenLM training corpus
Only after this should you train the Good n-gram model.
This is consistent with how large open corpora are usually built. For example:
- RedPajama discusses preprocessing Wikipedia to remove hyperlinks, comments, and formatting boilerplate.
- FineWeb emphasizes filtering and deduplication as central parts of dataset construction.
- FinerWeb-10BT shows that line-level filtering can improve data quality and training efficiency.
The practical lesson is:
Filtering is not something you add only at the end. It is part of corpus construction.
6. Use line/sentence-level cleaning, not only document-level cleaning
Your example is exactly why document-level filtering is not enough.
A document may contain:
good paragraph
good paragraph
good paragraph
bad reference tail
If you only classify the whole document as good/bad, you lose useful text.
Instead, use smaller units:
| Unit | Use |
|---|---|
| document | broad quality / source metadata |
| paragraph | main CPT unit |
| line | boilerplate/reference detection |
| sentence | fine-grained truncation |
| span | remove bad tail after good prose |
For Wikipedia-like data, I would do:
article
-> sections
-> paragraphs
-> lines/sentences
-> score each unit
-> remove bad units
-> optionally merge clean neighboring units
This is especially useful for tails like:
... normal Persian sentence. BBC Persian Abrahamian Modern Iran p۱ ۲ ۲ ...
The normal sentence can be kept. The reference tail should be removed.
7. Heuristics for reference-tail detection
You can start with simple heuristics.
Reject or truncate spans with:
many isolated numbers
too many digits
too many parentheses/brackets
URL / DOI / ISBN / ISSN patterns
English-heavy reference fragments
source names glued into Persian prose
bibliography-like patterns
repeated source names
very high punctuation ratio
very low Persian-letter ratio
abnormal spaced digits
Examples of suspicious patterns:
۲ ۸
۱ ۳ ۵ ۷
۱ ۳ ۳ ۲
p۱ ۲ ۲
ص ۲ ۸ ۳
BBC Persian
Modern Iran
ISBN
ISSN
doi
http
www
For Persian Wikipedia specifically, also watch for section/reference terms, but do not use them too naively.
Words like:
منابع
ارجاع
پیوند
جستارهای وابسته
پانویس
کتابشناسی
can indicate reference sections, but context matters.
For example:
منابع طبیعی ایران
is normal content, not a reference section.
So I would use these words mostly as:
section heading / line-level / end-of-article signal
not as a global document rejection rule.
8. Truncation is often better than rejection
For your example, I would probably do something like:
Before:
<good Persian prose>. <good Persian prose>. <citation/source garbage> <broken numbers> <bibliography tail>
After:
<good Persian prose>. <good Persian prose>.
A practical rule:
If a paragraph starts as good Persian prose but later becomes citation-like,
truncate from the first suspicious boundary.
Possible boundary signals:
sudden English source title
sudden bibliography author/title/page pattern
many spaced digits
multiple source names in a row
Persian sentence without punctuation followed by reference fragments
This is not perfect, but it is much better than letting the n-gram model learn the garbage.
9. Good LM / Bad LM setup
Your n-gram idea can still be useful.
I would use two n-gram models:
Good LM
Train on:
clean Persian prose
clean Wikipedia paragraphs
curated educational text
high-confidence manually accepted examples
Bad LM
Train on:
reference tails
citation fragments
boilerplate
broken OCR-like text
mixed-language bibliography
malformed Wikipedia tails
rejected OSCAR chunks
Then score candidates with both.
A candidate is better if:
Good LM likes it
Bad LM does not like it
Conceptually:
score = bad_lm_score - good_lm_score
or any similar ratio/difference.
Do not overthink the formula at first. The important idea is:
Good LM should model what you want. Bad LM should model what you want to remove.
This is better than a single perplexity threshold.
10. Persian normalization
Before scoring with n-gram models, normalize Persian consistently.
Useful tools:
- Hazm
- Hazm GitHub
- PersianTools
- Lucene PersianNormalizer
Things to normalize:
Arabic/Persian ي/ی
Arabic/Persian ك/ک
heh variants
Arabic/Persian digits
diacritics
extra tatweel/kashida
extra spaces
weird zero-width characters
punctuation spacing
half-space / ZWNJ
For example, Hazm’s normalizer is useful for standard Persian text normalization, including spacing and ZWNJ-related normalization.
11. Do not simply remove all ZWNJ
For Persian, ZWNJ is not just random noise.
It can be meaningful in words like:
کتابها
میروم
خانهای
رفتهام
So I would not simply delete every zero-width non-joiner.
Better:
normalize/correct ZWNJ
remove weird repeated zero-width characters
standardize Unicode form
collapse multiple zero-width chars
keep valid Persian ZWNJ where appropriate
Persian word segmentation and ZWNJ recognition are real NLP problems; see, for example, Joint Persian Word Segmentation Correction and ZWNJ Recognition.
Practical rule:
Normalize ZWNJ; do not blindly remove it.
12. Digits: pick a convention
Persian corpora often contain mixed:
Persian digits: ۱۲۳
Arabic digits: ١٢٣
Latin digits: 123
broken spaced digits: ۱ ۲ ۳
You should choose a convention.
For CPT prose, Persian digits may be natural.
For JSON, tool calling, math verification, and metadata, Latin digits are often easier.
A practical approach:
| Context | Suggested convention |
|---|---|
| raw Persian prose CPT | Persian digits are okay |
| math internal answer field | Latin digits |
| JSON/tool arguments | Latin digits |
| final Persian display answer | Persian digits are okay |
| metadata | Latin digits |
| broken spaced digits | fix if obvious, otherwise reject/truncate |
The important thing is consistency.
Do not let this happen randomly:
۲ 8 ٣ ۴ 5
unless you intentionally want mixed-digit robustness data.
For your example, ۲ ۸ مرداد should probably become:
۲۸ مرداد
and ۱ ۳ ۵ ۷ should become:
۱۳۵۷
if you are confident it is a date/year.
But if the number sequence is ambiguous, reject or truncate that span.
13. A simple cleaning pipeline for your current case
I would implement something like this:
1. Extract text
2. Normalize Unicode
3. Normalize Persian letters
4. Normalize digits
5. Split into paragraphs
6. Split paragraphs into sentences/lines
7. Detect reference-like lines/spans
8. Truncate bad tails
9. Remove very bad paragraphs
10. Deduplicate
11. Train Good LM on clean accepted text
12. Train Bad LM on rejected tails/noise
13. Score new chunks
14. Manually audit samples
More concrete:
input paragraph
-> sentence split
-> for each sentence/span:
calculate Persian letter ratio
calculate digit ratio
calculate Latin ratio
calculate punctuation ratio
detect citation markers
detect spaced-digit patterns
detect source-name/reference tail
-> if bad tail starts after good text:
keep text before bad tail
-> else if whole paragraph is bad:
reject
-> else:
keep
14. Example pseudo-code
Very rough pseudo-code:
import re
PERSIAN_LETTERS = r"آ-ی"
def persian_ratio(text):
letters = re.findall(f"[{PERSIAN_LETTERS}]", text)
chars = [c for c in text if not c.isspace()]
return len(letters) / max(1, len(chars))
def digit_ratio(text):
digits = re.findall(r"[0-9۰-۹٠-٩]", text)
chars = [c for c in text if not c.isspace()]
return len(digits) / max(1, len(chars))
def has_spaced_digits(text):
# examples like "۱ ۳ ۵ ۷" or "۲ ۸"
return bool(re.search(r"[0-9۰-۹٠-٩](\s+[0-9۰-۹٠-٩]){1,}", text))
def looks_reference_like(text):
patterns = [
r"http",
r"www\.",
r"doi",
r"ISBN",
r"ISSN",
r"BBC Persian",
r"Modern Iran",
r"\bp\s*[0-9۰-۹٠-٩]",
r"ص\s*[0-9۰-۹٠-٩]",
]
return any(re.search(p, text, flags=re.IGNORECASE) for p in patterns)
def is_bad_tail(sentence):
if looks_reference_like(sentence):
return True
if has_spaced_digits(sentence) and digit_ratio(sentence) > 0.08:
return True
if persian_ratio(sentence) < 0.45:
return True
return False
def truncate_bad_tail(sentences):
clean = []
for sent in sentences:
if is_bad_tail(sent) and len(clean) > 0:
break
if not is_bad_tail(sent):
clean.append(sent)
return " ".join(clean)
This is only a starting point. You would need to tune it by looking at accepted/rejected samples.
15. Manual audit is still necessary
Do not trust the cleaner blindly.
After each cleaning rule change, sample:
100 accepted chunks
100 rejected chunks
100 truncated chunks
Then inspect:
| Sample type | What to check |
|---|---|
| accepted | Did garbage survive? |
| rejected | Did good Persian get removed? |
| truncated | Did truncation cut at the right place? |
| borderline | Should this become a new rule? |
This is the same kind of loop you are already doing:
clean -> review -> adjust rules -> clean again
That loop is the correct approach.
16. For /7: how to train the small model for noisy prompts
For the 0.8B model, I would not train it on arbitrary noisy prompts.
I would train it on controlled noisy prompts.
Examples:
Good robustness SFT
User has typo -> assistant still answers.
User asks unclear question -> assistant asks clarification.
User includes irrelevant sentence -> assistant focuses on main question.
User asks two things -> assistant separates them.
User provides messy Persian -> assistant normalizes meaning.
Bad robustness SFT
User prompt contains random citation garbage -> assistant imitates it.
User prompt has broken reference tail -> assistant treats it as meaningful.
User prompt has unrelated source list -> assistant summarizes garbage.
For a small model, the best behavior is often:
I cannot reliably answer from this messy text. Please provide a clearer sentence.
or:
The first part is understandable, but the ending looks like broken reference text.
That is a valid assistant behavior.
The goal is not to make the 0.8B model magically robust to every bad prompt. The goal is to make it fail gracefully.
17. Product-level strategy for a 0.8B assistant
For a low-end-device assistant, I would design guardrails around the model.
raw input
-> normalization
-> task classifier
-> noise detector
-> chunk selector
-> small model
-> output validator
If the input is too messy:
ask clarification
If the input is too long:
split/summarize first
If it contains reference garbage:
remove or warn
If it asks for math:
use calculator/tool if available
If it asks for grammar:
route to grammar-tutor prompt template
Small models work better when the surrounding system reduces ambiguity.
18. Bottom line
For the noisy prompt question:
Larger models generally handle long, noisy, multi-constraint prompts better. But a 0.8B model can still be useful if you reduce the prompt burden with preprocessing, task routing, shorter context, templates, and controlled robustness SFT.
For the n-gram cleaning question:
Do not let the n-gram model handle structural garbage by itself. Remove obvious reference/citation/wiki-tail noise first. Train the Good LM only on text you want the model to imitate. Optionally train a Bad LM on the garbage you want to detect.
The most important rule is:
The n-gram model should be a quality scorer, not a garbage collector.
Discussion in the ATmosphere