Raw Record Source

{
  "path": "/posts/2025/images-as-context/index",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "tags": [
    "context",
    "multi-modal"
  ],
  "$type": "site.standard.document",
  "title": "Images as Context",
  "updatedAt": "2025-12-09T15:00:12.000Z",
  "publishedAt": "2025-12-06T15:00:12.000Z",
  "textContent": "import ChatContainer from '@components/prose/ChatContainer.astro';\nimport ChatMessage from '@components/prose/ChatMessage.astro';\nimport windowImage from './images/window.png';\nimport modernOsImage from './images/modern-os.png';\n\nimport Opus45 from './components/opus-4.5.astro';\nimport Gemini3Preview from './components/gemini-3-preview.astro';\nimport Gemini3PreviewDetails from './components/gemini-3-preview-details.astro';\n\nWorking with coding agents has been a dance of context management.\nThese days, if an agent loop isn't producing the result I want, it's more often than not a problem of context rather than a shortcoming of the language model or agent scaffold/harness.\n\nBecause of this, I've started thinking about context as the object of value in building software and doing things with a computer.\nOf all the ideas that exist in the universe, the context you are working with provides the directional leanings for the problem you are dealing with and the implicit shape a solution could take.\n\nContext also comes with a variety of information densities.\nThis week, I tried to provide the Macintosh Human Interface Guidelines from 1992 as context to Claude Code to coax out a CSS library that would allow me to style a website with the design patterns from Mac OS 9.\nAfter chopping up the 400+ page PDF into single-page PDFs, Claude Code navigated these pages via the table of contents it found on one of those pages, then created a markdown file with its findings from the design.\n\nThe results were far from pixel perfect, but the concept, that context can be grouped, refined, transformed, and carried around between agents crystallized for me from the experiment.\n\nMy working thesis is the agent struggled to implement pixel perfect design, not because it can't, but because I hadn't provided it a well defined design specification.\nAnd what I had provided had relatively low information signal to solve that problem.\n\nHowever, if I were to refine a design specification out of this more diluted context, that could become a high information density piece of context that would be portable to communicate the idea with clarity to any agent in the context of any problem I was trying to solve using this design system.\n\nThe image problem\n\nThe context problem is not quite that simple though.\n\nWhen it comes to reproducing something in code, an image is rarely enough to get a pixel perfect design.\nWhereas language descriptions, especially those with specific measurements, seem to get better results.\n\nI'm not the only one who has noticed this behavior.\n\nHere's a straightforward example I tried for this screenshot of a text document in Mac OS 9, prompting claude-opus-4-5-20251101.\n\n<ChatContainer model=\"claude-opus-4-5-20251101\" user=\"Me\">\n  <ChatMessage role=\"user\">\n    given the provided image of the text editor in mac os 9, output a pixel perfect reproduction in html and css\n\n    <img src={windowImage.src} alt=\"A screenshot of the text editor in Mac OS 9\" class=\"chat-image-component\" />\n\n  </ChatMessage>\n</ChatContainer>\n\nHere's claude-opus-4-5-20251101's output.\n\n<Opus45 />\n\nHere is gemini-3-pro-preview's output.\n\n<Gemini3Preview />\n\nI chose these specific models because they are the top performing models on DesignArena.\n\nBoth results are obviously inspired by the source image, but also not close to pixel perfect.\nAnd I suspect we're not exactly starting from zero-knowledge in this case.\nThese models know roughly what Mac OS 9 looks like and can describe it in language:\n\n> ...\n> Key Visual Elements\n>\n> Window Design\n>\n> - Light gray textured backgrounds with subtle horizontal pinstripes\n> - Rounded rectangular title bars with a striped/ridged texture\n>   ...\n\nAt least in broad strokes.\n\nWhile I give Gemini credit for doing the best job I've seen on this particular task by a language model, it's still not close to a pixel perfect reproduction.\n\nWords work better\n\nSo how do we do better?\nWe could modify the image to annotate focus points for the model, but the image is _what we want_.\nThe model just isn't quite giving it to us.\n\nSo what can we do?\nWe can use words.\n\n<ChatContainer model=\"gemini-3-pro-preview\" user=\"Me\">\n  <ChatMessage role=\"user\">\n    given the provided image of the text editor in mac os 9, output a pixel perfect reproduction in html and css. take particular care in crafting the buttons in the top right corner. these are not modern operating system buttons. the right most has two black horizontal lines going through the center with a small gap between them. the button to the left of that has the upper left hand quadrant outlined in black. be sure to get these buttons pixel perfect to the description.\n\n    <img src={windowImage.src} alt=\"A screenshot of the text editor in Mac OS 9\" class=\"chat-image-component\" />\n\n  </ChatMessage>\n</ChatContainer>\n\nWith Gemini 3 Pro Preview, we get this:\n\n<Gemini3PreviewDetails />\n\nNot exactly right, but meaningfully closer (as far as the buttons are concerned).\nThe words help get the model closer to the desired result.\nThis is the pattern we've become familiar with when working with coding agents.\n\nThe image alone is not quite sufficient to get the desired output.\nWe need to follow up with words to improve clarity if we want pixel perfect.\n\nThis recognition is informative because it suggests that text-based representations of concepts are more effective for getting models to produce the desired output in code.\nText provides an easier means to steer incorrect output and seems to capture a specification more losslessly and portably than an image.\n\nSince we're looking for language-like output, language input seems to be the most effective way to steer.\nIn the case of an image model, like Nano Banana Pro, if you want a near pixel perfect reproduction of an image in a modified environment, you just give the model the image.\n\n<ChatContainer model=\"gemini-3-pro-image-preview\" user=\"Me\">\n  <ChatMessage role=\"user\">\n    place the provided image of the text editor in mac os 9 in a hyper modern operating system of the future in 2095. keep the provided window design and buttons pixel perfect.\n\n    <img src={windowImage.src} alt=\"A screenshot of the text editor in Mac OS 9\" />\n\n  </ChatMessage>\n  <ChatMessage role=\"assistant\">\n    <img src={modernOsImage.src} alt=\"A screenshot of the text editor in Mac OS 9 placed in a modern operating system\" />\n  </ChatMessage>\n</ChatContainer>\n\nNote: the model still isn't quite perfect as it adds disabled arrows on the horizontal scrollbar that aren't there in the original image.\n\nIf you want an image, give an image.\nIf you want text, give text.\n\nThe state of the art in both areas is quite good.\n\nImages are helpful but not (yet) sufficient context to produce code\n\nImages are pixels on a screen.\nIf you want to introduce a new visual element to an image, you work in these pixels and these pixels alone.\nAt least, until models start creating images with layers.\n\nTo reproduce an image as a website, a translation of the image into code that a browser renders is necessary.\nThat feels a lot more complicated to me.\n\nIt's possible this is, in a way, a data problem.\nBased on my research, there aren't that many attempts to reimplement the Mac OS 9 design system on the web.\nIf there were, the model could probably execute this task without issue, roughly reproducing from example projects.\n\nThe models' failures seem like failures to generalize for some edge cases.\nMac OS 9 button icons are unusual looking.\nNothing uses these today.\nThe UI pattern did not endure.\n\nA model can be prompted with words to reproduce button designs like this.\nBut they balk when asked to describe or reproduce it themselves given the image alone or even just the concept, e.g. \"describe the two buttons in the top right corner of the windows in the Mac OS 9 Platinum UI\".\nThey just don't seem to really know what they are the way they seem to know about other things in depth.\n\nThis challenge continues to be an interesting one that shows up as I attempt to transform ideas from context into different code projects.",
  "canonicalUrl": "https://www.danielcorin.com/posts/2025/images-as-context/index"
}