Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigsorjnnmmr7xh7i7aotdwwmha62eygrl2qwaaqxkr6o4rrdlv2ce",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnwoddjw6yk2"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_7",
  "publishedAt": "2026-06-10T10:40:44.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hello again! Sorry for the late reply.\n\nI’ve been cleaning my Persian CPT data and iterating through a **`[CPT clean -> review]`** loop, and overall things are going very well. But I’ve had one question on my mind for a long time.\n\nIn the local models I tested, even relatively strong ones like **`Gemma 4 E4B UD-Q4`** struggled with noisy or unclear prompts. The model would often ignore parts of my instructions or fail to follow everything I said consistently. That made me wonder: if even large companies like Google could not make an `8B` model that reliably handles noisy prompts, how could I expect a `0.8B` model to do so while also maintaining attention across long, messy prompts?\n\n**So my question is:** is it really true that only larger models can maintain attention well and follow instructions more reliably?",
  "title": "How can i build a High Quality dataset?"
}