Raw Record Source

{
  "path": "/posts/2023/using-marvin-for-structured-data-extraction",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "$type": "site.standard.document",
  "title": "Using Marvin for Structured Data Extraction",
  "updatedAt": "2023-07-12T12:28:51.000Z",
  "publishedAt": "2023-07-12T12:28:51.000Z",
  "textContent": "I've been following the \"AI engineering framework\" marvin for several months now.\nIn addition to openai_function_call, it's currently one of my favorite abstractions built on top of a language model.\nThe docs are quite good, but as a quick demo, I've ported over a simplified version of an example from an earlier post, this time using marvin.\n\nThe result:\n\nThe code is clean and the result is good quality.\nThe abstraction allows me to almost entirely avoid dealing with code that calls the language model.\nI get to think in data structures and code and the language model's response is woven into the software using the primitives I define.\nHowever, the response isn't exactly how I want it.\nI don't like that additional suffixes are being included in some of the unit.\nFor example, \"unit\": \"cup unsalted\".\nThe following modification to the Ingredient class helps improve this\n\nNew output:\n\nThis mostly looks good.\nMy only remaining complaint is that if no details are extracted, the field is still included as an empty string.\n\nI tried a few different modifications to the Ingredient class to eliminated this but all were unsuccessful such that the output still included \"details\": \"\" for some ingredients.\n\nIt's hard to tell without actually reading the prompt and response verbatim what is going on here.\nInspecting pydantic's behavior for a null value, we see details show up as None rather than an empty string:\n\nThe outputted JSON now contains null for the field:\n\nI have to assume the language model is outputting the empty string (\"\") rather than null or omitting the field.\nAs a final test, I ran the code again using gpt-4 and the last definition for details above.\n\nGpt-4 is slower and more expensive and still does not do what I want.\nThis small issue isn't difficult to correct in code, but it provides a bit of signal into how well the model follows instructions with this approach to prompting, which is a function of both the model and the prompt itself.",
  "canonicalUrl": "https://www.danielcorin.com/posts/2023/using-marvin-for-structured-data-extraction"
}