Raw Record Source

{
  "path": "/posts/2024/vlm-data-extraction-with-protobufs",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "tags": [
    "vlms",
    "protobuf"
  ],
  "$type": "site.standard.document",
  "title": "VLM data extraction with Protobufs",
  "updatedAt": "2024-08-03T16:37:52.000Z",
  "publishedAt": "2024-08-03T16:37:52.000Z",
  "textContent": "import Chat from '@components/prose/Chat.astro';\n\nIn light of OpenAI releasing structured output in the model API, let's move output structuring another level up the stack to the microservice/RPC level.\n\nA light intro to Protobufs\n\nMany services (mostly in microservice land) use Protocol Buffers (protobufs) to establish contracts for what data an RPC requires and what it will return.\nIf you're completely unfamiliar with protobufs, you can read up on them here.\n\nHere is an example of a message that a protobuf service might return.\n\nHere's a very simple service that makes use of that message.\n\nProtobuf messages define the container into which data is packed and sent back to the caller.\nProtobuf messages and services typically live in .proto files.\nA tool called protoc can be used to generate code in the language of your choice to help you interact with a system that requires and responds with protobufs.\n\nExtraction with a model\n\nSimilar to how we can use libraries like Pydantic (via libraries like instructor) and targeted prompting to get Pydantic objects in and out of a model, we can accomplish something quite similar with protobufs.\nThe benefit of using protobufs rather than Pydantic, is they are a language agnostic data interchange format.\nIt doesn't matter what language our caller or our server uses -- the following approach can still be applied.\n\nA simple example\n\n<Chat\n  model=\"gpt-4o\"\n  summary=\"Simple extraction with a protobuf\"\n  messages={[\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: '\\n\\nUsing the provided content, extract data as JSON in adherence to the above schema.\\nNo talk. JSON only.\\n\\nContent:\\nReceipt from Acme Groceries\\nDate: 2024-03-15\\nTotal: $42.99\\nReceipt ID: R12345\\n',\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: '',\n        },\n      ],\n    },\n  ]}\n/>\n\nWe've seen this before.\nLet's try more things.\n\nData extraction with VLMs\n\nWe now have Vision Models can do inference and perform tasks with images in the context.\nLet's prompt the model to extract data from a screenshot I took while finetuning gpt-3.5.turbo.\n\n<Chat\n  model=\"gpt-4o\"\n  summary=\"Extract fine-tuning job data\"\n  messages={[\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: '\\n\\nUsing the provided content and images, extract data as JSON in adherence to the above schema.\\nIf multiple pages or images are provided, combine the information into a single JSON object.\\nNo talk. JSON only.',\n        },\n        {\n          type: 'image_url',\n          image_url: {\n            url: 'https://danielcorin.com/img/posts/2024/fine-tuning-connections.png',\n          },\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: '',\n        },\n      ],\n    },\n  ]}\n/>\n\nThis seems to work quite well.\n\nParameterize the approach\n\nA parameterized version of the prompts above looks like this.\n\n<Chat\n  model=\"gpt-4o\"\n  summary=\"Prompt template for data extraction\"\n  messages={[\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: '\\n\\nUsing the provided content and images, extract an instance of {ProtoMessage} as JSON in adherence to the above schema.\\nIf multiple pages or images are provided, combine the information into a single JSON object.\\nNo talk. JSON only.',\n        },\n        {\n          type: 'text',\n          text: '{some image}',\n        },\n      ],\n    },\n  ]}\n/>\n\nThis next part is a bit unusual but stick with me.\n\nWe can setup a gRPC server and proto service that accepts a path to a .proto and a file path and responds with the Protobuf and JSON versions of the extraction using the approach above with a model. E.g.\n\nIf we assume we have the working server described above, given the FinetuneJob object above, here is what an example request script might look like\n\nFor this example, our server is running on the same system as the client.\nThe server\n\n1. receives the request above\n2. reads the image and .proto files from the file system\n3. builds a prompt\n4. makes a call to a model to do inference to extract the desired data in adherence to the specified protobuf message schema\n5. returns the protobuf message as a google.protobuf.Any type, which can be unpacked by the client into the specific message type passed in\n\nWith this approach, we now have a Protobuf service that returns a Protobuf message in the schema we describe.\nHowever, more interestingly, we can modify the extraction instructions and resulting structure of the returned protobuf message by modifying the message itself.\n\nIn the FinetuneJob example above, it we also want to extract status from the image, we only need to augment the message:\n\nRunning the inference again with the same prompt above and updated protobufs, yields\n\nNote the last item status now has the expected value \"Succeeded\".\n\nTry it out\n\nI wrote up a more complete proof-of-concept in this repo.",
  "canonicalUrl": "https://www.danielcorin.com/posts/2024/vlm-data-extraction-with-protobufs"
}