Thought Eddies

Using Marvin for Structured Data Extraction

Dan Corin July 12, 2023

I've been following the "AI engineering framework" marvin for several months now. In addition to openai_function_call, it's currently one of my favorite abstractions built on top of a language model. The docs are quite good, but as a quick demo, I've ported over a simplified version of an example from an earlier post, this time using marvin.

The result:

The code is clean and the result is good quality. The abstraction allows me to almost entirely avoid dealing with code that calls the language model. I get to think in data structures and code and the language model's response is woven into the software using the primitives I define. However, the response isn't exactly how I want it. I don't like that additional suffixes are being included in some of the unit. For example, "unit": "cup unsalted". The following modification to the Ingredient class helps improve this

New output:

This mostly looks good. My only remaining complaint is that if no details are extracted, the field is still included as an empty string.

I tried a few different modifications to the Ingredient class to eliminated this but all were unsuccessful such that the output still included "details": "" for some ingredients.

It's hard to tell without actually reading the prompt and response verbatim what is going on here. Inspecting pydantic's behavior for a null value, we see details show up as None rather than an empty string:

The outputted JSON now contains null for the field:

I have to assume the language model is outputting the empty string ("") rather than null or omitting the field. As a final test, I ran the code again using gpt-4 and the last definition for details above.

Gpt-4 is slower and more expensive and still does not do what I want. This small issue isn't difficult to correct in code, but it provides a bit of signal into how well the model follows instructions with this approach to prompting, which is a function of both the model and the prompt itself.

Discussion in the ATmosphere