Raw Record Source

{
  "path": "/posts/2024/evals-unit-testing-for-lms",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "tags": [
    "evals",
    "language_models"
  ],
  "$type": "site.standard.document",
  "title": "Evals: unit testing for language models",
  "updatedAt": "2024-05-15T18:42:38.000Z",
  "publishedAt": "2024-05-15T18:42:38.000Z",
  "textContent": "Generative AI and language models are fun to play with but you don't really have\nsomething you can confidently ship to users until you test what you've built.\n\nWhat are evals?\n\nEvals are like the unit tests for LLMs. Similar to unit tests, evals can take on\nmany different forms -- they are just code you run to generate a model\ncompletion then check the contents of that completion. A more challenging part\nabout LLMs relative to \"average code\" is their outputs aren't really\ndeterministic. Let's think about non-deterministic (less-deterministic?) code\nfor a second. If you were testing a random number generator you might write code\nlike this:\n\nThis approach allows you to test the bounds of the function random() without\nrelying on a single specific result.\n\n<aside>This is not entirely sufficient testing for random number generation, we\nwould probably want to test more things like the distribution of values, trying\ndifferent seeds, etc.</aside>\n\nA simple LLM use case\n\nIn the case of LLMs, I've observed several different approaches to determine\nwhether the model is behaving as expected. If the LLM output is highly\nconstrained (e.g., if it's being used as a classifier), simple assertions could\nbe sufficient to validate the LLM is performing its function as intended.\n\nNote: I'm using a\n[Marvin-esque style of writing an \"AI-powered\" function.\nThe code is not meant to be runnable, just illustrative of the approach.\n\nA more complex use case\n\nIf the LLM is doing something more complicated, a more flexible approach could\nbe required. For example, let's say we expect the LLM to output a recipe as a\nmarkdown list. It would be somewhat hard to validate the contents of the recipe\nwith deterministic code, but we could validate the structure of the model\nresponse (to start at least).\n\nThese approaches are somewhat naive but they impose helpful guardrails around\nthe basic structure and expectations for the LLM outputs of an application.\n\nUsing a model to evaluate a model response?\n\nSome folks are going further, using the model to validate its own outputs in the\nsame completion (by prompting the model to explain itself or refine an initial\nresponse) or separate calls where the model takes a previous model output in as\npart of its prompt then generates a new completion. A couple of places I've\nnoticed this approach being used are to try and detect hallucinations or\ntoxicity.\n\nHere an example of what a simple implementation of an LLM-based toxicity\ndetector for LLM outputs could look like:\n\nOur \"test\" now has two non-deterministic components\n\n- the model-generated birthday card\n- the model-generated evaluation of the birthday card's contents\n\nI think you can derive a directional signal from this approach. Say we called\ngenerate_birthday_card in production and then contains_toxic_language on its\noutput. We could report stats on the approximate % of toxic responses. We could\ntry and tweak our prompt in generate_birthday_card to reduce this percentage\nor block the response to the user if contains_toxic_language == True. It seems\nlike the library (or OpenAI API itself) may even help with this.\n\nAt scale with this approach, there will still probably be both false positives\nand false negatives. Sometimes the model will detect toxicity when we wouldn't\nexpect it to and sometimes the model will fail to detect toxicity when it is\npresent in the contents of the birthday card. To distill these model-based\nmeasurements down to \"% of toxic responses\" is a bit misleading. There can be\nerrors at either step, which can compound errors in the reporting of \"% of toxic\nresponses\", which is decided entirely by the model. Lastly, it's likely possible\nto do prompt injection in a way that produces toxic output when calling\ngenerate_birthday_card and \"fools\" the model when it runs the\ncontains_toxic_language check into reporting the content is not toxic. This\nthwarts your ability to measure the \"% of toxic responses\" because the model\nyou're attempting to use to measure toxicity has been undermined and does not\nreport correctly. This means a aggregate measurement of 2% toxicity in the\nresponses of your birthday card-generating LLM app may not reflect reality at\nall.\n\nWhy is this bad?\n\nThis approach is not necessarily _bad_, but we shouldn't lull ourselves into\nfeeling a false sense of security when we have models evaluating the outputs of\nmodels. To start, it's important to consider your use case. If you're building a\nchatbot for your e-commerce store visitors, the potential downsides of an\nimperfect model response are likely less impactful than if you are reading data\nfrom receipts and trying to do accounting for your business with the output\ndata. The former has a wide range of possible, useful modes of operation. The\nlatter generally has only one correct answer. If you're relying on a model to\nreport on whether your model generations are correct, healthy, or fitting a\ncertain criterion, you need to anticipate ways in which the reporting model\nmight perform its job incorrectly and add other guardrails and measurements that\ncan give you more signal about the health of your model responses.\n\nWhy models are still worth it\n\nModels don't have to be perfect to be useful. Even in the accounting example,\nwhere we require our numbers to be correct, we can add deterministic checks and\nsafeguards to our system (do line items add up correctly, do the sum of all\nreceipts match the system's total?) that can flag potentially incorrect\ncalculations for a second look. Even deterministic software breaks all the time.\nWe engineer around these breakages by fixing things(!) or with other things like\nerror messages, system restarts and human processes. Models are useful, flexible\ntools but we shouldn't abandon existing best practices just because we ran our\ndemo a few times and it looks like it worked. Measure and plan for failures as a\npart of your design. I'd love to hear what works for you.\n\nI got to the end of this post and decided to make the code\nreal.\nI wasn't quite able to build a successful prompt injection for the birthday card\nuse case, but hopefully the attempt describes the threat vector reasonably well.",
  "canonicalUrl": "https://www.danielcorin.com/posts/2024/evals-unit-testing-for-lms"
}