VLM data extraction with Protobufs
import Chat from '@components/prose/Chat.astro';
In light of OpenAI releasing structured output in the model API, let's move output structuring another level up the stack to the microservice/RPC level.
A light intro to Protobufs
Many services (mostly in microservice land) use Protocol Buffers (protobufs) to establish contracts for what data an RPC requires and what it will return. If you're completely unfamiliar with protobufs, you can read up on them here.
Here is an example of a message that a protobuf service might return.
Here's a very simple service that makes use of that message.
Protobuf messages define the container into which data is packed and sent back to the caller. Protobuf messages and services typically live in .proto files. A tool called protoc can be used to generate code in the language of your choice to help you interact with a system that requires and responds with protobufs.
Extraction with a model
Similar to how we can use libraries like Pydantic (via libraries like instructor) and targeted prompting to get Pydantic objects in and out of a model, we can accomplish something quite similar with protobufs. The benefit of using protobufs rather than Pydantic, is they are a language agnostic data interchange format. It doesn't matter what language our caller or our server uses -- the following approach can still be applied.
A simple example
<Chat model="gpt-4o" summary="Simple extraction with a protobuf" messages={[ { role: 'user', content: [ { type: 'text', text: '\n\nUsing the provided content, extract data as JSON in adherence to the above schema.\nNo talk. JSON only.\n\nContent:\nReceipt from Acme Groceries\nDate: 2024-03-15\nTotal: $42.99\nReceipt ID: R12345\n', }, ], }, { role: 'assistant', content: [ { type: 'text', text: '', }, ], }, ]} />
We've seen this before. Let's try more things.
Data extraction with VLMs
We now have Vision Models can do inference and perform tasks with images in the context. Let's prompt the model to extract data from a screenshot I took while finetuning gpt-3.5.turbo.
<Chat model="gpt-4o" summary="Extract fine-tuning job data" messages={[ { role: 'user', content: [ { type: 'text', text: '\n\nUsing the provided content and images, extract data as JSON in adherence to the above schema.\nIf multiple pages or images are provided, combine the information into a single JSON object.\nNo talk. JSON only.', }, { type: 'image_url', image_url: { url: 'https://danielcorin.com/img/posts/2024/fine-tuning-connections.png', }, }, ], }, { role: 'assistant', content: [ { type: 'text', text: '', }, ], }, ]} />
This seems to work quite well.
Parameterize the approach
A parameterized version of the prompts above looks like this.
<Chat model="gpt-4o" summary="Prompt template for data extraction" messages={[ { role: 'user', content: [ { type: 'text', text: '\n\nUsing the provided content and images, extract an instance of {ProtoMessage} as JSON in adherence to the above schema.\nIf multiple pages or images are provided, combine the information into a single JSON object.\nNo talk. JSON only.', }, { type: 'text', text: '{some image}', }, ], }, ]} />
This next part is a bit unusual but stick with me.
We can setup a gRPC server and proto service that accepts a path to a .proto and a file path and responds with the Protobuf and JSON versions of the extraction using the approach above with a model. E.g.
If we assume we have the working server described above, given the FinetuneJob object above, here is what an example request script might look like
For this example, our server is running on the same system as the client. The server
- receives the request above
- reads the image and .proto files from the file system
- builds a prompt
- makes a call to a model to do inference to extract the desired data in adherence to the specified protobuf message schema
- returns the protobuf message as a google.protobuf.Any type, which can be unpacked by the client into the specific message type passed in
With this approach, we now have a Protobuf service that returns a Protobuf message in the schema we describe. However, more interestingly, we can modify the extraction instructions and resulting structure of the returned protobuf message by modifying the message itself.
In the FinetuneJob example above, it we also want to extract status from the image, we only need to augment the message:
Running the inference again with the same prompt above and updated protobufs, yields
Note the last item status now has the expected value "Succeeded".
Try it out
I wrote up a more complete proof-of-concept in this repo.
Discussion in the ATmosphere