How can i build a High Quality dataset?
Hello again! Sorry for the late reply.
I’ve been cleaning my Persian CPT data and iterating through a [CPT clean -> review] loop, and overall things are going very well. But I’ve had one question on my mind for a long time.
In the local models I tested, even relatively strong ones like Gemma 4 E4B UD-Q4 struggled with noisy or unclear prompts. The model would often ignore parts of my instructions or fail to follow everything I said consistently. That made me wonder: if even large companies like Google could not make an 8B model that reliably handles noisy prompts, how could I expect a 0.8B model to do so while also maintaining attention across long, messy prompts?
So my question is: is it really true that only larger models can maintain attention well and follow instructions more reliably?
Discussion in the ATmosphere