Raw Record Source

{
  "path": "/posts/2024/vlms-hallucinate/index",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "tags": [
    "vlms",
    "model_hallucination"
  ],
  "$type": "site.standard.document",
  "title": "VLMs Hallucinate",
  "updatedAt": "2024-08-16T22:44:57.000Z",
  "publishedAt": "2024-08-16T22:44:57.000Z",
  "textContent": "import Chat from '@components/prose/Chat.astro';\nimport { Tweet } from 'astro-embed';\nimport vlmsHallucinateImage from './images/vlms-hallucinate.png';\n\nI've done some experimentation extracting structured data from documents using VLMs.\nA summary of one approach I've tried can be found in my repo, impulse.\nI've found using Protobufs to be a relatively effective approach for extracting values from documents.\nThe high-level idea is you write a Protobuf as your target data model then use that Protobuf itself as most of the prompt[^1].\nI discussed the approach in more detail in this post so I am going to jump right into it.\n\nThe problem\n\nWhen relevant contextual data is available in an image, it can be hard to prevent a VLM from hallucinating values that plausibly could be in the image, even when they're not.\nFor reference, the image below is a png version of receipt-no-tax-or-totals.pdf which I will reference later.\nThe data is fake, generated by a model to look realistic.\nThe example:\n\n<Chat\n  model=\"gpt-4o-2024-08-06\"\n  messages={[\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: '\\n\\nUsing the provided content and images, extract an instance of Receipt as JSON in adherence to the above schema.\\nNo talk. JSON only.',\n        },\n        {\n          type: 'image_url',\n          image_url: {\n            url: vlmsHallucinateImage.src,\n          },\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: '',\n        },\n      ],\n    },\n  ]}\n/>\n\nLooking at the model output, we see subtotal, tax and total in the response.\n\nclaude-3.5-sonnet has the same challenges.\n\n<Chat\n  model=\"claude-3-5-sonnet-20240620\"\n  messages={[\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: '\\n\\nUsing the provided content and images, extract an instance of Receipt as JSON in adherence to the above schema.\\nNo talk. JSON only.',\n        },\n        {\n          type: 'image_url',\n          image_url: {\n            url: vlmsHallucinateImage.src,\n          },\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: '',\n        },\n      ],\n    },\n  ]}\n/>\n\nWhy?\n\nIn the Protobuf, we specified these fields as optional, yet the model has output them anyway[^2].\nThis doesn't happen every time but it happens far more often than I would like.\n\nWe could handwave these results away and say more prompt engineering would help but it doesn't seem like it does, at least not reliably.\n\nIt's been difficult to find stability to this particular extraction.\nThere are a lot of elements that can be varied.\n\n- the schema (Protobuf, Pydantic, JSON schema, etc.)\n- the surrounding prompt, e.g. \"... No talk. JSON only\".\n- the image quality\n- whether the image has the subtotal, tax and total labels (with the values always missing)\n\nI tried 4 different Protobufs against 5 different receipt PDFs.\nreceipt-original.pdf was a standard receipt with all the data you would expect.\nBoth models consistently extract the data correctly from these -- 8/8 tests were successful.\n\n| Model                      | Proto                                                                                                                            | PDF                                                                                                              | Result |\n| -------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | ------ |\n| gpt-4o-2024-08-06          | receipt.proto                             | receipt-original.pdf | ✅     |\n| gpt-4o-2024-08-06          | receipt_comments.proto           | receipt-original.pdf | ✅     |\n| gpt-4o-2024-08-06          | receipt_item_comments.proto | receipt-original.pdf | ✅     |\n| gpt-4o-2024-08-06          | receipt_optionals.proto         | receipt-original.pdf | ✅     |\n| claude-3-5-sonnet-20240620 | receipt.proto                             | receipt-original.pdf | ✅     |\n| claude-3-5-sonnet-20240620 | receipt_comments.proto           | receipt-original.pdf | ✅     |\n| claude-3-5-sonnet-20240620 | receipt_item_comments.proto | receipt-original.pdf | ✅     |\n| claude-3-5-sonnet-20240620 | receipt_optionals.proto         | receipt-original.pdf | ✅     |\n\nAs soon as I removed the values for subtotal, tax and total, we start seeing the hallucinations.\nI tried examples with just the values removed and with the values and labels removed.\nWe see test failures (hallucinations) by both models across all of these examples.\n\n| Model                      | Proto                                                                                                                            | PDF                                                                                                                              | Result |\n| -------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | ------ |\n| gpt-4o-2024-08-06          | receipt.proto                             | receipt-no-tax-or-totals.pdf | ❌     |\n| gpt-4o-2024-08-06          | receipt_comments.proto           | receipt-no-tax-or-totals.pdf | ❌     |\n| gpt-4o-2024-08-06          | receipt_item_comments.proto | receipt-no-tax-or-totals.pdf | ✅     |\n| gpt-4o-2024-08-06          | receipt_optionals.proto         | receipt-no-tax-or-totals.pdf | ❌     |\n| gpt-4o-2024-08-06          | receipt.proto                             | receipt-no-total-labels.pdf   | ❌     |\n| gpt-4o-2024-08-06          | receipt_comments.proto           | receipt-no-total-labels.pdf   | ✅     |\n| gpt-4o-2024-08-06          | receipt_item_comments.proto | receipt-no-total-labels.pdf   | ✅     |\n| gpt-4o-2024-08-06          | receipt_optionals.proto         | receipt-no-total-labels.pdf   | ✅     |\n| gpt-4o-2024-08-06          | receipt.proto                             | receipt-wild-numbers.pdf         | ❌     |\n| gpt-4o-2024-08-06          | receipt_comments.proto           | receipt-wild-numbers.pdf         | ✅     |\n| gpt-4o-2024-08-06          | receipt_item_comments.proto | receipt-wild-numbers.pdf         | ✅     |\n| gpt-4o-2024-08-06          | receipt_optionals.proto         | receipt-wild-numbers.pdf         | ❌     |\n| claude-3-5-sonnet-20240620 | receipt.proto                             | receipt-no-tax-or-totals.pdf | ❌     |\n| claude-3-5-sonnet-20240620 | receipt_comments.proto           | receipt-no-tax-or-totals.pdf | ✅     |\n| claude-3-5-sonnet-20240620 | receipt_item_comments.proto | receipt-no-tax-or-totals.pdf | ✅     |\n| claude-3-5-sonnet-20240620 | receipt_optionals.proto         | receipt-no-tax-or-totals.pdf | ❌     |\n| claude-3-5-sonnet-20240620 | receipt.proto                             | receipt-no-total-labels.pdf   | ❌     |\n| claude-3-5-sonnet-20240620 | receipt_comments.proto           | receipt-no-total-labels.pdf   | ✅     |\n| claude-3-5-sonnet-20240620 | receipt_item_comments.proto | receipt-no-total-labels.pdf   | ✅     |\n| claude-3-5-sonnet-20240620 | receipt_optionals.proto         | receipt-no-total-labels.pdf   | ✅     |\n| claude-3-5-sonnet-20240620 | receipt.proto                             | receipt-wild-numbers.pdf         | ❌     |\n| claude-3-5-sonnet-20240620 | receipt_comments.proto           | receipt-wild-numbers.pdf         | ❌     |\n| claude-3-5-sonnet-20240620 | receipt_item_comments.proto | receipt-wild-numbers.pdf         | ✅     |\n| claude-3-5-sonnet-20240620 | receipt_optionals.proto         | receipt-wild-numbers.pdf         | ❌     |\n\nFor receipt-no-tax-or-totals.pdf, the receipt with the subtotal, tax and total labels but values missing, 5/8 of the tests fail, meaning the models outputted at least one of these values even though they aren't actually in the document.\n\nI ran three more rounds of testing for this document specifically.\n\n| Model                      | Proto                                                                                                                            | PDF                                                                                                                              | Failures |\n| -------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | -------- |\n| gpt-4o-2024-08-06          | receipt.proto                             | receipt-no-tax-or-totals.pdf | 3/3      |\n| gpt-4o-2024-08-06          | receipt_comments.proto           | receipt-no-tax-or-totals.pdf | 1/3      |\n| gpt-4o-2024-08-06          | receipt_item_comments.proto | receipt-no-tax-or-totals.pdf | 1/3      |\n| gpt-4o-2024-08-06          | receipt_optionals.proto         | receipt-no-tax-or-totals.pdf | 1/3      |\n| claude-3-5-sonnet-20240620 | receipt.proto                             | receipt-no-tax-or-totals.pdf | 3/3      |\n| claude-3-5-sonnet-20240620 | receipt_comments.proto           | receipt-no-tax-or-totals.pdf | 2/3      |\n| claude-3-5-sonnet-20240620 | receipt_item_comments.proto | receipt-no-tax-or-totals.pdf | 0/3      |\n| claude-3-5-sonnet-20240620 | receipt_optionals.proto         | receipt-no-tax-or-totals.pdf | 2/3      |\n\nIn all but one test, we see hallucinations at least 1/3 times and the only approach for which we don't (claude-3-5-sonnet-20240620/receipt_item_comments.proto) is pretty kludgy.\n\nTakeaways\n\nThere are lots more things to try her",
  "canonicalUrl": "https://www.danielcorin.com/posts/2024/vlms-hallucinate/index"
}