{
"path": "/posts/2023/promptfoo-and-output-structure",
"site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
"tags": [
"language_models",
"promptfoo"
],
"$type": "site.standard.document",
"title": "Promptfoo and standardizing output structure across models",
"updatedAt": "2023-07-27T23:20:00.000Z",
"publishedAt": "2023-07-27T23:20:00.000Z",
"textContent": "promptfoo is a Javascript library and CLI for testing and evaluating LLM output quality.\nIt's straightforward to install and get up and running quickly.\nAs a first experiment, I've used it to compare the output of three similar prompts that specify their output structure using different modes of schema definition.\nTo get started\n\nThe scaffold creates a prompts.txt file, and this is where I wrote a parameterized prompt to classify and extract data from a support message.\n\nThe more interesting part is the promptfooconfig.yaml file which specifies the different inputs and schemas:\n\nSince we're keeping input constant, the output is a matrix of result that vary by schema and provider.\n\n<table>\n <tr>\n <th>Provider</th>\n <th>Schema</th>\n <th>Output</th>\n </tr>\n <tr>\n <td>gpt-3.5-turbo-16k</td>\n <td>Pydantic</td>\n <td>{\n\"message_type\": \"complaint\"\n}\n </td>\n </tr>\n <tr>\n <tr>\n <td>gpt-3.5-turbo-16k</td>\n <td>JSON</td>\n <td>{\n\"message_type\": \"complaint\"\n}\n </td>\n </tr>\n <tr>\n <td>gpt-3.5-turbo-16k</td>\n <td>Protobuf</td>\n <td>{\n\"order_id\": \"\",\n\"message_type\": \"COMPLAINT\"\n}\n </td>\n </tr>\n <tr>\n <td>gpt-4</td>\n <td>Pydantic</td>\n <td>{\n\"order_id\": null,\n\"message_type\": \"complaint\"\n}\n </td>\n </tr>\n <tr>\n <tr>\n <td>gpt-4</td>\n <td>JSON</td>\n <td>{\n\"order_id\": null,\n\"message_type\": \"complaint\"\n}\n </td>\n </tr>\n <tr>\n <td>gpt-4</td>\n <td>Protobuf</td>\n <td>{\n\"order_id\": \"\",\n\"message_type\": 1\n}\n </td>\n </tr>\n</table>\n\nThis is a lot of variety!\nHowever, the results could be helpful as we think about the output we actually want and the model we plan to use.\nWe see order_id is sometimes omitted, sometimes the empty string \"\" and sometimes null.\nWe also see message_type shows up as the upper and lowercase version of \"complaint\" as well as the corresponding protobuf enum integer value 1.\n\nIt would be nice if there was more consistency.\n\nIt would be useful (if possible) to write prompts that yield object schema with consistent structure across models.\nWe could try and prompt engineer in this direction or try and create a dataset and fine tune.\n\nSome minor changes to the prompt yield consistency across all schemas and models for the input.\n\nWith the new prompt, all outputs become",
"canonicalUrl": "https://www.danielcorin.com/posts/2023/promptfoo-and-output-structure"
}