Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigm3ucmlgzcx22mf3zclcaw6y74omy5a5duoo4hl5u4xbytzv2tv4",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mlhs6gp2njz2"
  },
  "path": "/t/observed-mismatch-between-chatgpt-image-editing-ui-and-full-frame-regeneration-behavior/1380598#post_1",
  "publishedAt": "2026-05-10T01:27:58.000Z",
  "site": "https://community.openai.com",
  "textContent": "Report Summary\n\nThis report organizes only the facts observable by the user regarding the process presented as “image editing” within the ChatGPT application.\n\nThe conclusion is clear.\n\nThis process does not perform localized edits on the original image uploaded by the user.\n\nThe process that is actually invoked is image_gen.text2im. On the returned side, DALL-E generation metadata is displayed; even when edit_op: “inpainting” appears, the output is not a localized edit, but a full-frame regeneration.\n\nMoreover, at an earlier stage, the original image file itself is not transmitted, retained, or referenced in its original form.\n\nTherefore, the “image editing” observed in this chat is not editing of the original image.\n\nIt is a text-to-image full-frame regeneration using a reduced and converted derivative image as reference input.\n\nFinal Conclusion\n\nThe original image file uploaded by the user is not processed as-is.\n\nAt the upload stage, ChatGPT handles a reduced and converted derivative image distinct from the original.\n\nThe tool invoked during image processing is image_gen.text2im.\n\nEvery returned result displays DALL-E generation metadata.\n\nEven when edit_op: “inpainting” is displayed, the actual output is not localized editing but full-frame regeneration.\n\nEven when the correction area is explicitly specified, the process proceeds on the premise of masking, and inpainting is displayed, the entire image—including areas outside the specified region—changes at the pixel level.\n\nThe hash of the output image is also entirely different from that of the original.\n\nTherefore, this is not “image editing.” Nor is it editing based on the original image. It is image_gen.text2im / T2I full-frame regeneration using a reduced and converted derivative image as input.\n\nObserved Facts\n\nThe original image file itself is not transmitted as-is.\n\nThe user is using an image-upload feature described as permitting uploads of up to 20 MB.\n\nHowever, actual network monitoring showed that even when a large image was selected and uploaded, the amount of data transferred was only about 300 KB.\n\nThis is decisive.\n\nIf a 20 MB-class, or even several-megabyte, original image file were being sent to the server as-is, a corresponding amount of network traffic should occur.\n\nSince only about 300 KB of data is transmitted, the original image file itself is not being sent as-is.\n\nAt this point, the premise that “the original image is uploaded as-is and that original image is then edited” collapses.\n\nThe original image and the image handled on ChatGPT’s side are different objects.\n\nThe original image information on the user’s side was as follows:\n\nFilename: 1000045047_x4_drawing.png\n\nFormat: PNG\n\nResolution: 2048 × 2048\n\nSize: 5.58 MB\n\nSHA-1: 69ba09b9718bc43947e0f6510bab65319e3e0a42\n\nSHA-256: 2d6a15d7deb517c5e8885512ec73d79bd2535d5d5311a8e76a793fed391ec114\n\nBy contrast, the image accessible to the assistant within this conversation was as follows:\n\nFormat: JPEG\n\nResolution: 1536 × 1536\n\nSize: 420,655 bytes\n\nSHA-1: deff635b673de90cbadf603ce81c548cb2a805a9\n\nSHA-256: 0239d63859547149e61e5c987897291713593da222a63f7f0635e3bc0bce4d53\n\nThe format, resolution, file size, and hashes all fail to match.\n\nIn other words, what the assistant and the image-processing side are referencing is not the user’s original image file itself.\n\nIt is a reduced and converted derivative image created during the upload stage or internal expansion stage.\n\nThe explanation that the image is “temporarily compressed for transmission and later restored to the original” is untenable.\n\nIt is not credible to claim that an image of 20 MB, or even several megabytes, is reduced to approximately 300 KB for transmission and then later perfectly restored for use as the original.\n\nFor such an explanation to hold, the following would be necessary:\n\nThe original image must be losslessly recoverable from the transmitted data.\n\nThe restored image must contain pixels identical to those of the original.\n\nThe hashes must also match the original image.\n\nIn reality, however, the image accessible to the assistant does not match the original in format, resolution, file size, or hash.\n\nTherefore, this is not “temporary compression.”\n\nThe original image is not sent as-is, nor is it restored to the original.\n\nA derivative image is created, and that derivative image becomes the object of processing.\n\nThere is no indication that the original image file is reacquired or re-expanded during image editing.\n\nOne might argue that, even if only a lightweight derivative image is sent at upload time, the system later retrieves the original image file or equivalent original-quality data during the image-editing operation and processes it at high quality.\n\nThis argument also fails.\n\nWhen image editing was actually executed:\n\nThe tool invoked was image_gen.text2im.\n\nThe returned image was approximately two megapixels.\n\nNo increase in network traffic corresponding to an image file of that size was observed before or after the operation.\n\nOnly lightweight control or text-output traffic appeared to be occurring.\n\nThe downloaded image after generation was likewise an approximately two-megapixel image.\n\nIf the original image file were being reacquired or re-expanded during editing, network traffic corresponding to the image size should have occurred.\n\nIt did not.\n\nTherefore, the original image file is not being used even at the image-editing stage.\n\nWhat is used during editing is the derivative image handled within the chat.\n\nThe invoked tool is image_gen.text2im, not an image-editing tool.\n\nAlthough the feature is being used as image editing, the tool actually invoked by the assistant was image_gen.text2im.\n\nThis is the name of a text-to-image process.\n\nTherefore, at least according to the execution information observable by the user, the invoked process is not “image editing” but “text-to-image.”\n\nThis point is critically important.\n\nIf the operation were localized editing or inpainting, the process name or process structure should correspond to that function.\n\nIn reality, however, the invoked process is text2im.\n\nEvery returned result displays DALL-E generation metadata.\n\nUpon examining the images returned as generation results in this chat, DALL-E generation metadata was displayed in all 16 of the 16 confirmed cases.\n\nIn other words, although the feature is being used in the context of GPT Images / ChatGPT Images 2.0 image editing within the ChatGPT application, the returned metadata is always DALL-E generation metadata.\n\nThe important point here is not speculation about whether DALL·E is truly operating internally.\n\nThe observable fact is that the metadata visible to the user is consistently DALL-E generation metadata.\n\nThe displayed context and the returned metadata are not aligned.\n\nThe process is invoked as text2im, returned as inpainting, and produces full-frame regeneration.\n\nIn some returned metadata, edit_op: “inpainting” was displayed.\n\nHowever, the tool actually invoked was image_gen.text2im.\n\nThus, the observable correspondence is as follows:\n\nInvoked process name: image_gen.text2im\n\nReturned metadata: edit_op: “inpainting”\n\nActual output: full-frame regeneration\n\nThis is fundamentally inconsistent.\n\nA process invoked as text-to-image is labeled on return as inpainting, while the output is not a localized edit but an image whose entire frame has changed at the pixel level.\n\nTherefore, the process name, returned metadata, and actual result do not agree.\n\nAt least in this observation, this is not inpainting in the sense expected by the user.\n\nThe correction area was explicitly specified.\n\nThe problem is not that “the user gave vague instructions.”\n\nIn fact, across multiple attempts, the user clearly specified the following:\n\nWhich area should be corrected\n\nWhich areas should be preserved\n\nOnly the lower body\n\nOnly from the waist downward\n\nPreserve the face, hair, upper body, and background\n\nPreserve the clothing\n\nDo not alter anything outside the specified area\n\nUse a mask\n\nProceed on the premise of inpainting\n\nIn other words, the target area for editing was not ambiguous.\n\nThe premise of localized editing and inpainting was stated clearly.\n\nEven so, the results changed regions far beyond the specified area.\n\nTherefore, this problem did not occur because the correction area had not been specified.\n\nThe entire image, including unspecified regions, changes at the pixel level.\n\nThis is the most serious practical harm.\n\nWhen the original and output images are compared, not only the specified region but the entire frame, including areas outside the specified region, has changed at the pixel level.\n\nThe following elements changed:\n\nBackground\n\nHair\n\nFace\n\nOutfit\n\nContours\n\nColoring\n\nOrnaments\n\nShape of shadows\n\nComposition\n\nLegs\n\nShoes\n\nThis is not merely a case of slight influence around the edited area.\n\nThe entire image has been reconstructed.\n\nIn localized editing, the majority of the unspecified regions should preserve the original pixels, or at least a structure very close to them.\n\nThat is not what occurred here.\n\nTherefore, this is not localized editing.\n\nThe hash of the output image is also entirely different.\n\nThe original image and the output image differ not only visually, but also lack continuity as files.\n\nThe hash of the output image is completely different from that of the original.\n\nThis is significant.\n\nIf localized editing were replacing only a portion of the image while preserving most of the original, one would expect at least some continuity as an edited result based on the original image.\n\nIn reality, however, all three of the following are true:\n\nThe entire image changes at the pixel level.\n\nUnspecified regions also change comprehensively.\n\nThe output image hash is entirely different.\n\nTherefore, this is not “the result of partially editing the original image.”\n\nIt is a newly generated image created with reference to the original.\n\nThe resolution is not consistent.\n\nAlthough the original image is uploaded at roughly one megapixel or higher resolution, the processed and returned images are handled at around two megapixels, or after being converted to another resolution.\n\nThe important point is that the resolution of the input image does not match the resolution of the processing target or returned image.\n\nThis is not the behavior of localized editing.\n\nRather than using the original image itself as the base for partial editing, the system appears to transfer the image into a different resolution regime and reconstruct it there.\n\nTherefore, at minimum, this process is not “editing the original image itself.”\n\nAspect-ratio and canvas specifications do not function as independent factors.\n\nOrdinarily, the conditions passed to an image engine should include structured parameters handled separately from the prompt text itself.\n\nAt minimum, the following should be treated as independent factors:\n\nAspect ratio\n\nCanvas size\n\nReference image\n\nImage to be edited\n\nMask or target editing area\n\nStyle-preservation conditions\n\nIn practice, however, the conditions specified by the user do not operate rigorously as independent factors.\n\nAspect-ratio specifications are not reliably obeyed.\n\nCanvas conditions are not passed through as-is.\n\nThe editing area is not fixed.\n\nThis is because conditions that ought to be handled as independent control factors are instead forced into the prompt text, and even that text itself is summarized or compressed.\n\nAs a result, size, ratio, editing range, preservation conditions, and style conditions are dropped, weakened, or entangled.\n\nThis input design is broken.\n\nThe user input, the assistant-created prompt, the tool call, and the prompt in the returned metadata do not match.\n\nEven when the user explicitly sends text and states, “treat this as the prompt,” that text is not necessarily used as the actual input to the image engine.\n\nThe assistant translates it into English, adds supplementary details, appends conditions, and sends a different text to the tool.\n\nAn additional problem is that, in some cases, the returned metadata shows prompt: “” as an empty field.\n\nThus, at least within the range observable by the user, the following do not match:\n\nThe user’s input text\n\nThe prompt text created by the assistant\n\nThe prompt used in the image-tool call\n\nThe prompt shown in the returned metadata\n\nUnder these conditions, the user cannot verify what was actually supplied to the image engine.\n\nReproducibility and transparency are not achieved.\n\nThe actual result is not “correction” but a full reinterpretation each time.\n\nEven when localized corrections are requested for fingers, the face, the lower body, or similar elements, parts that were not specified are reinterpreted each time.\n\nTypically, the following were affected:\n\nDirectionality of the face\n\nHair color\n\nRibbons\n\nClothing\n\nBackground density\n\nStructure of the painted planes\n\nLeg structure\n\nShoe shape\n\nIn other words, the workflow is not “preserve the parts that have been fixed, then correct only the remaining unfixed parts.”\n\nInstead, the entire image is reinterpreted each time, and even previously corrected parts regress.\n\nThis is not image editing; it is the behavior of regeneration.\n\nFragmented and mosaic-like coloring arises not as a failure of localized editing, but as a side effect of full-frame regeneration.\n\nThe outputs repeatedly exhibited breakdowns in coloring such as the following:\n\nSmall fragmentary shadows\n\nMosaic-like coloring\n\nSpeckled highlights\n\nClusters of tiny paint fragments\n\nA glaring, glittering texture\n\nUnnaturally high density\n\nEven after repeatedly specifying “flat coloring,” “no mosaic-like coloring,” “organize into large planes,” and “do not subdivide,” the problem did not stop.\n\nThis is because the system is not editing the specified local area, but regenerating the entire frame.\n\nNeither preservation of the coloring nor localized retention is functioning.\n\nAs a result, the overall coloring style is reconstructed every time.\n\nEven at the chat-thumbnail stage, the original image data is not handled as-is.\n\nFrom the moment the image is displayed in the chat, it is already no longer the original image itself.\n\nWhat is displayed is a thumbnail or otherwise processed derivative image.\n\nAfter that, even when the image engine is invoked, no network traffic corresponding to the image size occurs.\n\nIn other words, the image-system data visible in the chat is itself being used as the processing target, and the original image file is not being fetched again.\n\nThe image ultimately downloaded is, in the end, a separately generated image.\n\nThe entire flow is consistent not with “editing the original image,” but with “regeneration using a derivative image as reference.”\n\nAlthough presented as image editing, the actual process is image_gen.text2im / T2I full-frame regeneration.\n\nSummarizing the observed facts above, the processing structure is consistent:\n\nThe original image file itself is not sent.\n\nThe original image file itself is not retained or reacquired.\n\nWhat is referenced is a reduced and converted derivative image.\n\nThe invoked tool is image_gen.text2im.\n\nThe returned metadata is DALL-E generation metadata.\n\nEven with edit_op: “inpainting”, localized editing is not achieved.\n\nThe entire frame, including unspecified areas, changes at the pixel level.\n\nThe hash becomes entirely different.\n\nTherefore, the process observed in this chat is not image editing.\n\nIt is image_gen.text2im / T2I full-frame regeneration using a reduced and converted derivative image as input.\n\nRelated Input-System Issues\n\nIn voice input, fixed text not spoken by the user is transmitted.\n\nSeparate from the image-related issues, there was also a serious anomaly in input processing.\n\nDuring voice input, the UI displays a waveform and appears to be processing audio input.\n\nIn reality, however, the spoken content is not transmitted; instead, fixed text such as the following is sent:\n\n“This transcript may contain references to ChatGPT, OpenAI, DALL·E, GPT-3, GPT-4.”\n\n“This transcript may include references to ChatGPT, OpenAI, DALL·E, GPT-3, GPT-4.”\n\nThis is not the user’s speech.\n\nNor is it a mere speech-recognition mistranscription.\n\nAn internal boilerplate sentence or notice is being transmitted as user input.\n\nThus, not only in the image-generation system but also in input processing, the state shown in the UI and the content actually transmitted do not match.\n\nWhat This Report Demonstrates\n\nThis is not a mere quality issue.\n\nNor is it simply a matter of “a bad prompt,” “overly complex instructions,” or “the editing area expanding.”\n\nThe essence of the problem is as follows:\n\nThe original image itself is not sent.\n\nThe original image itself is not retained or reacquired.\n\nA reduced and converted derivative image becomes the processing target.\n\nThe invoked process is image_gen.text2im.\n\nThe returned data is DALL-E generation metadata.\n\nEven when inpainting is displayed, the result is not localized editing.\n\nThe entire image, including unspecified areas, changes at the pixel level.\n\nThe hash also becomes entirely different.\n\nNevertheless, in the UI context, the operation is treated as “image editing.”\n\nTherefore, this is a problem in which the description “image editing” does not match the actual processing performed.\n\nIt is a transparency problem, an input-design problem, and a discrepancy between functional labeling and real behavior.\n\nRequests\n\nClearly state whether the original image file itself is actually transmitted, retained, and referenced.\n\nIf the image is converted into a derivative image after upload, clearly disclose that specification.\n\nClearly explain why the invoked tool is image_gen.text2im.\n\nClearly explain why DALL-E generation metadata is returned.\n\nClearly explain the conditions under which edit_op: “inpainting” is displayed, and what it actually means.\n\nClearly state whether the process is localized editing or full-frame regeneration.\n\nClearly explain how masks and target editing areas are actually handled.\n\nClearly explain how independent factors such as aspect ratio, size, and style-preservation conditions are passed to the engine.\n\nClearly explain the relationship among the user input, the assistant-generated prompt, the actual engine input, and the prompt shown in the returned metadata.\n\nExplain the input anomaly in which internal boilerplate text is inserted during voice input.\n\nClosing Statement\n\nThe process observed in this chat is not editing of the original image.\n\nIt is image_gen.text2im / T2I full-frame regeneration using a reduced and converted derivative image as reference.\n\nMoreover, it has been observed in the following form:\n\nIt is invoked as image_gen.text2im.\n\nIt returns DALL-E generation metadata.\n\nIt may even be displayed as inpainting.\n\nIn reality, it is not localized editing.\n\nThe entire frame, including unspecified regions, changes at the pixel level.\n\nThe hash becomes entirely different.\n\nUnder these conditions, presenting the feature as “image editing” is inaccurate.\n\nAllowing users to treat it as image editing without clearly disclosing the actual processing gives rise to misunderstanding.\n\nThis report demonstrates that such misunderstanding is supported by observable facts.",
  "title": "Observed mismatch between ChatGPT image-editing UI and full-frame regeneration behavior"
}