{
  "$type": "site.standard.document",
  "canonicalUrl": "https://rednafi.com/go/test-state-not-interactions/",
  "description": "Avoid brittle AI-generated tests that check implementation details. Write maintainable tests that verify behavior, not method calls.",
  "path": "/go/test-state-not-interactions/",
  "publishedAt": "2025-09-14T00:00:00.000Z",
  "site": "at://did:plc:fgtm2c26vfcj74rfmeggbyqj/site.standard.publication/3mnl6f7ob462z",
  "tags": [
    "Go",
    "Testing"
  ],
  "textContent": "With the advent of LLMs, the temptation to churn out a flood of unit tests for a false\nveneer of productivity and protection is stronger than ever.\n\nMy colleague Matthias Doepmann recently fired [a shot at AI-generated tests] that don't\nvalidate the behavior of the System Under Test (SUT) but instead create needless ceremony\naround internal implementations. At best, these tests give a shallow illusion of confidence\nin the system's correctness while breaking at the smallest change. At worst, they remain\ngreen even when the SUT's behavior changes.\n\nIn practice, they add maintenance overhead and drag down code reviews. The frustration in\nthat post wasn't about violating some abstract testing philosophy. It came from having to\nwade through countless implementation-checking tests churned out by LLMs across components\nof a real, large-scale distributed system.\n\nI think the problem persists for three reasons:\n\n- First, many developers have begun defaulting to LLMs for generating tests. Regrettably,\n  even in critical systems. In greenfield projects with no test baseline, AI agents often go\n  rogue and churn out these cheap implementation-checking tests. Google calls them\n  [interaction tests].\n- Second, the prevalence of mocking libraries encourages this pattern. They make it too easy\n  to write tests that assert \"which function called which\" instead of \"what actually\n  happened.\"\n- Third, once these tests exist, they create inertia and people keep piling on the same\n  style of tests to be consistent.\n\nTest state, not interactions\n\nThe general theme when writing unit tests should be checking the behavior of the system, not\nthe scaffolding of its implementation. It doesn't matter which method called which, how many\ntimes, or with what arguments.\n\nWhat matters is: if you give the SUT some input, does it return the expected output? In a\nstateful system, does the input cause the system to mutate some persistence layer in the\nexpected way? That persistence layer doesn't always need to be a real database; it could be\nan in-memory buffer.\n\nIn scenarios where your code invokes external systems, it is more useful to test your system\nwith canned responses from upstream calls rather than testing which method is being called.\n\nThe salient point is: _test outcomes, not implementation details._ As the book [Software\nEngineering at Google] puts it: _test state, not interactions:_\n\n> With state testing, you observe the system itself to see what it looks like after invoking\n> with it. With interaction testing, you instead check that the system took an expected\n> sequence of actions on its collaborators in response to invoking it. Many tests will\n> perform a combination of state and interaction validation.\n\nAnd the guidance that follows:\n\n> By far the most important way to ensure this is to write tests that invoke the system\n> being tested in the same way its users would; that is, make calls against its public API\n> rather than its implementation details. If tests work the same way as the system's users,\n> by definition, change that breaks a test might also break a user.\n\nI think the first step in the right direction is to accept that LLMs can't substitute for\nthought. The first few critical tests in your systems shouldn't be written by LLMs and you\nmust vet the tests churned out by the [genie that wants to leap]. Next up, you can often get\naway without a mocking library and more often than not, they improve the quality and\nmaintainability of your tests.\n\nMocking libraries often don't help\n\nMocking libraries come with their own idiosyncratic syntax and workflows. On most occasions,\nhandwritten fakes are better than mocks. I'll use Go to make my point here because that's\nwhat I write the most these days, but the lesson applies to other languages too.\n\nConsider a simple UserService that depends on a DB interface. Its job is to delegate\nuser creation to the database and return any error to the caller:\n\nA mocking tool such as [mockery] can generate a mock implementation of the DB interface.\nThe generated code records calls and arguments so that tests can later assert whether the\nexpected interactions happened:\n\nUsing this mock, a test can be written to check that CreateUser interacts with the\ndependency in the expected way:\n\nThis works mechanically, but it breaks down in practice:\n\n1. It checks the collaborator call, not the result\n\n    A useful test would assert that \"alice\" was actually added or that a duplicate error was\n    returned. This one only verifies that InsertUser(\"alice\") was invoked once.\n\n2. It breaks on harmless refactors\n\n    If the database method is renamed while keeping the same semantics, callers see no\n    difference but the test fails:\n\n    \n\n    The mock-based test no longer compiles or needs rewiring, even though the public\n    behavior didn't change.\n\n3. And worse, it survives real bugs\n\n    If an error is accidentally swallowed, callers get the wrong signal but the test still\n    passes:\n\n    \n\n    A real DB or an in-memory fake would raise a constraint error that should propagate. The\n    mock test goes green anyway because it only checked the call path.\n\nThe common thread is that mocks lock tests to implementation details. They don't protect the\nbehavior that real users rely on.\n\nInterface-guided design and fakes\n\nA better approach is to keep the same interface but back it with a handwritten fake. The\nfake encodes the domain rules you care about, and tests can focus on outcomes instead of\nverifying which collaborator methods were called.\n\nHere, we're hand writing the fake implementation of the DB interface instead of generating\nit via a mockgen library.\n\nTests with the fake read like a statement of expected behavior:\n\nThis avoids the fragility of mocks. The tests survive harmless refactors, fail when behavior\nchanges, and stay readable without a mocking DSL.\n\nBut the cost is maintaining the fake as the interface evolves. However, in practice, that's\nstill easier than constantly updating brittle mock expectations and occasionally dealing\nwith the mock library's [lengthy migration workflow].\n\nFakes vs real systems\n\nSometimes the right move is to test against a real database running in a container. That is\nstill state testing, just at a higher fidelity. The tradeoff is speed: you get stronger\nconfidence in behavior, but the tests run slower.\n\nMost of the time, handwritten in-memory fakes are what you need, and most tests should stick\nto those. When you do need the same behavior you would see in production, tools like\n[testcontainers] let you spin up databases, queues, or caches inside containers. Your tests\ncan then call the SUT normally, with its configuration pointing at the containerized\nservice, just as production code would connect to a production resource.\n\nParting words\n\nThis is not a rally against using LLMs for tests. But the seed tests, the first handful that\nset the standard, need to come from you. They define what correctness means in your system\nand give the ensuing tests a model to follow. If you hand that job to an LLM, you give up\nthe chance to shape how the rest of the suite grows.\n\nThis isn't to disparage mocking libraries either. But I have seen people armed with\noverzealous LLMs and mocks wreak havoc on a test suite and then unironically ask reviewers\nto review the mess. Instead of validating behavior, the suite fills up with fragile\ninteraction checks that break on refactors and stay green through real bugs.\n\nMore often than not, you can skip mocking libraries and rely on handwritten fakes that check\nthe behavior of the SUT instead of its interactions. The next person that needs to read and\nextend your tests might thank you for that.\n\n\n\n\n\n[a shot at ai-generated tests]:\n    https://revontulet.dev/p/2025-dont-let-your-mocks-mock-you/\n\n[interaction tests]:\n    https://abseil.io/resources/swe-book/html/ch12.html#:~:text=With%20interaction%20testing%2C%20you%20instead%20check%20that%20the%20system%20took%20an%20expected%20sequence%20of%20actions%20on%20its%20collaborators%20in%20response%20to%20invoking%20it.%20Many%20tests%20will%20perform%20a%20combination%20of%20state%20and%20interaction%20validation.\n\n[software engineering at google]:\n    https://abseil.io/resources/swe-book/html/ch12.html\n\n[genie that wants to leap]:\n    https://tidyfirst.substack.com/p/genie-wants-to-leap\n\n[mockery]:\n    https://vektra.github.io/mockery/latest/\n\n[lengthy migration workflow]:\n    https://vektra.github.io/mockery/v3.5/v3/\n\n[testcontainers]:\n    https://testcontainers.com/",
  "title": "Test state, not interactions"
}