{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiaeq75mtzktt6qjytoneyuelx5cy2qypp4kky5iwe4hmdw5t437du",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mlgqlr2fbea2"
},
"path": "/t/collection-of-gpt-image-generator-2-0-issues-bugs-and-work-around-tips-check-first-post/1379535?page=12#post_243",
"publishedAt": "2026-05-09T16:02:38.000Z",
"site": "https://community.openai.com",
"textContent": "Yes, that makes a lot of sense!\n\nThe multimodal model works differently from a diffusion model. I haven’t fully understood it yet, but as far as I understand it, the word tokens are converted more directly into image tokens. (A diffusion model searches for an image in noise. That is why diffusion models are more creative on the one hand, but less geometrically precise, the newer methods are more precise but less creative, at least that was the case with 1.0.)\n\nThey must have found a way to create complex and complete images from simple word tokens. So the LLM must somehow internally expand the simple words into something more detailed. If one understands this internal expansion better, one can prompt images much more effectively.\n\nThis visual expansion made me believe for a long time that some kind of prompt improvement was still taking place. And I am still not entirely sure whether ChatGPT does that. At least it claims that it does, but that could be a hallucination. _j says no.\n\nWhen you make the LLM hallucinate based on a non-existent trigger or a very vague input, you can see how the system tries to create associations. Same with images.\n\nThat is similar to what you did, only in a certain sense the other way around. You have a complex text and look at which words are important to the LLM when summarizing it. With the right words, you can guide not only an LLM, but also an image generator much better.\n\nFinding the right words for DallE was sometimes like searching for gemstones.\nThis will remain the same with the new model, it is still a LLM only now creating 2D image tokens, instead of 1D word tokens.\n(And I like the result more than text. )",
"title": "Collection of GPT-image-generator 2.0 issues, bugs, and work-around tips (check first post)"
}