{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigsorjnnmmr7xh7i7aotdwwmha62eygrl2qwaaqxkr6o4rrdlv2ce",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnwoddjw6yk2"
},
"path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_7",
"publishedAt": "2026-06-10T10:40:44.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hello again! Sorry for the late reply.\n\nI’ve been cleaning my Persian CPT data and iterating through a **`[CPT clean -> review]`** loop, and overall things are going very well. But I’ve had one question on my mind for a long time.\n\nIn the local models I tested, even relatively strong ones like **`Gemma 4 E4B UD-Q4`** struggled with noisy or unclear prompts. The model would often ignore parts of my instructions or fail to follow everything I said consistently. That made me wonder: if even large companies like Google could not make an `8B` model that reliably handles noisy prompts, how could I expect a `0.8B` model to do so while also maintaining attention across long, messy prompts?\n\n**So my question is:** is it really true that only larger models can maintain attention well and follow instructions more reliably?",
"title": "How can i build a High Quality dataset?"
}