{
  "$type": "site.standard.document",
  "canonicalUrl": "https://rednafi.com/system/tap-compare-testing/",
  "description": "Master shadow testing for large-scale system migrations. Learn to safely rewrite services by comparing outputs between old and new implementations.",
  "path": "/system/tap-compare-testing/",
  "publishedAt": "2025-12-13T00:00:00.000Z",
  "site": "at://did:plc:fgtm2c26vfcj74rfmeggbyqj/site.standard.publication/3mnl6f7ob462z",
  "tags": [
    "Distributed Systems",
    "Go",
    "Testing"
  ],
  "textContent": "Throughout the years, I've been part of a few medium- to large-scale system migrations. As\nin, rewriting old logic in a new language or stack. The goal is usually better scalability,\nresilience, and maintainability, or more flexibility to adapt to changing requirements. Now,\nwhether rewriting your system is the right move is its own debate.\n\nA common question that shows up during a migration is, \"How do we make sure the new system\nbehaves exactly like the old one, minus the icky parts?\" Another one is, \"How do we build\nthe new system while the old one keeps changing without disrupting the business?\"\n\nThere's no universal playbook. It depends on how gnarly the old system is, how ambitious the\nnew system is, and how much risk the business can stomach. After going through a few of\nthese migrations, I realized one approach keeps showing up. So I'll expand on it here.\n\nThe idea is that you shadow a slice of production traffic to the new system. The old system\nkeeps serving real users. A copy of that same traffic is forwarded to the new system along\nwith the old system's response. The new system runs the same business logic and compares its\noutputs with the old one. The entire point is to make the new system return the exact same\nanswer the old one would have, for the same inputs and the same state.\n\nAt the start, you don't rip out bad behavior or ship new features. Everything is about\noutput parity. Once the systems line up and the new one has processed enough real traffic to\nearn some trust, you start sending actual user traffic to it. If something blows up, you\nroll back. If it behaves as expected, you push more traffic. Eventually the old system gets\nto ride off into the sunset.\n\nThis workflow is typically known as _[shadow testing]_ or _tap and compare testing_.\n\nThe scenario\n\nSay we have a Python service with a handful of read and write endpoints the business depends\non. It's been around for a while, and different teams have patched it over the years. Some\nof the logic does what it does for reasons nobody remembers anymore. It still works, but\nit's getting harder to maintain. Also, the business wants a tighter SLO. So the team decides\nto rewrite it in Go.\n\nTo keep the scope tight, I'm only talking about HTTP read and write endpoints on the main\nrequest path. The same applies to gRPC, minus the transport details. I'm ignoring everything\nelse: message queues, background workers, async job processing, analytics pipelines, and\nother side channels that also need migrating.\n\nDuring shadow testing, the Python service stays on the main request path. All real user\ntraffic still goes to the Python service. A proxy or load balancer sitting in front of it\nforwards requests as usual, gets an answer back, and returns that answer to the user.\n\nThat same proxy also emits tap events. Each tap event contains a copy of the request and the\ncanonical response the Python service sent to the user. Those tap events go to the Go\nservice on a shadow path. From the outside world, nothing has changed. Clients talk to\nPython, and Python talks to the live production database.\n\nThe Go service never serves real users during this phase. It only sees tap events. For each\nevent, it reconstructs the request, runs its version of the logic against a separate\ndatastore, and compares its outputs with the Python response recorded in the event. The\nPython response is always the source of truth.\n\nThe Go service has its own datastore, usually a snapshot or replica of production that's\nbeen detached so it can be written freely. This is the sister datastore. The Go service only\ntalks to it for reads and writes. It never touches the real production DB. The sister\ndatastore is close enough to show real-world behavior but isolated enough that nothing\nbreaks.\n\nWith this setup in place, you spend time fixing differences. If the Python service returns a\nspecific payload shape or some quirky value, the Go service has to match it. If Python gets\na bug fix or a new feature, you update Go. You keep doing this until shadow traffic stops\nproducing mismatches. Then you start thinking about cutover.\n\nStart with read endpoints\n\nReads don't change anything in the database, so they are easier to start with.\n\nOn the main path, a user sends a request. The proxy forwards it to the Python service as\nusual. The Python service reads from the real database, builds a response, and returns it to\nthe caller.\n\nWhile that is happening, the proxy also constructs a tap event. At minimum, this event\ncontains:\n\n- The original request: method, URL, headers, body.\n- The canonical Python response: status code, headers, body.\n\nThe proxy sends this tap event to the Go service via an internal HTTP or RPC endpoint.\nAlternatively, it can publish the event to a Kafka stream, where a consumer eventually\nforwards it to the internal tap endpoint.\n\nThe important thing is that the tap event captures the exact input and output of the Python\nservice as seen by the real user.\n\nA typical read path diagram during tap compare looks like this:\n\n\n\n{{< mermaid >}}\ngraph TD\n    subgraph MAIN_PATH [MAIN PATH]\n        User([User]) --> Proxy\n        Proxy --> Python\n        Python <-- reads production state --> ProdDB[(Prod DB)]\n    end\n\n    subgraph SHADOW_PATH [SHADOW PATH]\n        Proxy -- \"tap event: {request, python_resp}\" --> Go\n        Go <--> SisterDB[(Sister DB)]\n        Go --> Log[Log mismatch?]\n    end\n\n    classDef userStyle fill:#6b7280,stroke:#4b5563,color:#fff\n    classDef proxyStyle fill:#7c3aed,stroke:#5b21b6,color:#fff\n    classDef pythonStyle fill:#2563eb,stroke:#1d4ed8,color:#fff\n    classDef goStyle fill:#0d9488,stroke:#0f766e,color:#fff\n    classDef dbStyle fill:#ca8a04,stroke:#a16207,color:#fff\n    classDef logStyle fill:#dc2626,stroke:#b91c1c,color:#fff\n\n    class User userStyle\n    class Proxy proxyStyle\n    class Python pythonStyle\n    class Go goStyle\n    class ProdDB,SisterDB dbStyle\n    class Log logStyle\n{{</ mermaid >}}\n\n\n\nFrom the Go service's point of view, a tap event is just structured data. A simple shape\nmight look like this on the wire:\n\nThe Go side reconstructs the request, runs its own logic against the sister datastore, and\ncompares its answer with python_response. No extra call back into Python. No race between\na second read and the response that already went to the user.\n\nOn the Go side, a handler for a read tap event might look like this:\n\nA few things to notice:\n\n- Truth lives with the Python response that already went to the user.\n- The Go service sees exactly the same request the Python service saw.\n- Comparison happens off the user path. Users never wait on the Go service.\n- The Go service only touches the sister datastore, never the real one.\n- The tap handler doesn't return any payload. It just compares service outputs and emits\n  logs.\n\nWhen the read diffs drop to zero (or near zero) against live traffic, you can trust the Go\nimplementation matches the Python one.\n\nWrite endpoints are trickier\n\nWrite endpoints change state, so they are harder to migrate.\n\nOn the main path, only the Python service is allowed to mutate production state.\n\nA typical write looks like this on the main path:\n\n1. User sends a write request.\n2. Proxy forwards it to the Python service.\n3. Python runs the real write logic, talks to the live database, sends emails, charges\n   cards, and returns a response.\n4. Proxy returns that response to the user.\n\nThat path is the only one touching production. The Go service must not:\n\n- write anything to the real production database\n- trigger real external side effects\n- call any real Python write endpoint in a way that causes a second write\n\nFor writes, the tap event pushed by the proxy looks quite similar to reads:\n\nThe write path diagram during tap compare becomes:\n\n\n\n{{< mermaid >}}\ngraph TD\n    subgraph MAIN_PATH [MAIN PATH]\n        User([User]) --> Proxy\n        Proxy --> Python\n        Python <-- writes prod state, triggers side effects --> ProdDB[(Prod DB)]\n    end\n\n    subgraph SHADOW_PATH [SHADOW PATH]\n        Proxy -- \"tap event: {request, python_resp}\" --> Go\n        Go <--> SisterDB[(Sister DB)]\n        Go --> Log[Log mismatch?]\n    end\n\n    classDef userStyle fill:#6b7280,stroke:#4b5563,color:#fff\n    classDef proxyStyle fill:#7c3aed,stroke:#5b21b6,color:#fff\n    classDef pythonStyle fill:#2563eb,stroke:#1d4ed8,color:#fff\n    classDef goStyle fill:#0d9488,stroke:#0f766e,color:#fff\n    classDef dbStyle fill:#ca8a04,stroke:#a16207,color:#fff\n    classDef logStyle fill:#dc2626,stroke:#b91c1c,color:#fff\n\n    class User userStyle\n    class Proxy proxyStyle\n    class Python pythonStyle\n    class Go goStyle\n    class ProdDB,SisterDB dbStyle\n    class Log logStyle\n{{</ mermaid >}}\n\n\n\nOn the Go side, the write tap handler follows the same pattern as reads but has more corner\ncases to think through.\n\nA shadow write handler might look like this:\n\nYou are comparing how each system transforms the same request into a domain object and\nresponse. You are not trying to drive the Python service a second time. You are not trying\nto rebuild the Python result from scratch against changed state.\n\nBut with this setup, the write path has several corner cases to think through.\n\nUniqueness, validation, and state-dependent logic\n\nUniqueness checks, conditional updates, and other validations that depend on database state\nare sensitive to timing. The Python write runs against the actual production state at the\nmoment the main request hits. The Go write runs against whatever state exists in the sister\ndatastore when the tap event arrives.\n\nIf the sister datastore is a snapshot that is not continuously replicated, it will drift\nalmost immediately. Even with streaming replication, there may be short lags. That means:\n\n- A create request that was valid in prod might look invalid against a slightly stale\n  snapshot if another request changed state in between.\n- A conditional update like \"only update if version is X\" can take different branches if the\n  sister store has not applied the latest change yet.\n- A multi-entity invariant that Python enforced with a transaction might appear broken in\n  the sister store if replication replayed statements in a different order relative to the\n  tap event.\n\nYou should expect some write comparisons to be noisy because of state drift and treat those\nseparately. In practice you often:\n\n- Keep replication as close to real time as you can, or regularly reseed the sister\n  datastore.\n- Attach a few state fingerprints to the tap event, like the version of the row before and\n  after the write, so you can tell when the sister store is simply behind.\n- Filter out mismatches that can be traced to obvious replication lag when you look at diff\n  reports.\n\nThe important thing is: when you see a mismatch, you can decide whether it is a real logic\ndifference or just the sister store living in a slightly different universe for that\nrequest.\n\nIdempotency, retries, and ordering\n\nReal systems don't get one clean write per user action. You get retries, duplicates, and\nconcurrent updates.\n\nOn the main path, you might have:\n\n- A user hitting \"submit\" twice.\n- A client retrying on a network timeout.\n- Two services racing to update the same record.\n\nYour Python service probably already has a story for this, such as idempotency keys, version\nchecks, or last-write-wins semantics. The tap path needs to reflect what actually happened,\nnot an idealized story.\n\nBecause the tap event is constructed from the real request and real response at the proxy,\nit naturally honors whatever the Python service did. If a retry was coalesced into a single\nwrite under an idempotency key, you will see a single successful response in the tap stream.\nIf the second retry was rejected as a conflict, you will see that error. The Go service just\nneeds to implement the same semantics against the sister datastore.\n\nWhat still bites you is ordering. Tap events may arrive at the Go service a little out of\norder relative to how mutations hit production. If two writes race, Python might process\nthem in order A, B while the tap messages arrive as B, A. The sister datastore will then\nexperience a different sequence of state changes than production did, which can yield\nlegitimate differences in final state.\n\nYou can't fully eliminate this. What you can do is:\n\n- Keep tap delivery low latency and best-effort ordered.\n- Focus your comparisons more on single-request behavior (did CreateUser behave the same)\n  than on multi-request history until you are comfortable with the noise.\n- Use version numbers or timestamps in the domain model to detect when the sister store is\n  applying changes in a different order, and treat those as \"not comparable\" rather than\n  bugs.\n\nExternal side effects\n\nWrites often have external side effects: emails, payment gateways, cache invalidations,\nsearch indexing, analytics.\n\nThe tap path isolates database writes by using the sister datastore, but that is not enough\non its own. You have to run the Go service in a mode where those side-effectful calls are\neither disabled or mocked.\n\nThe usual pattern is:\n\n- Centralize side-effectful behavior behind interfaces or specific modules.\n- In normal production mode, those modules call real providers.\n- In tap compare mode, they are wired to no-op or record-only implementations.\n\nYou want the code paths that decide \"should we send a welcome email\" or \"should we charge\nthis card\" to run, because they influence the domain model and response shape. You don't\nwant the actual email to go out or the real payment provider to be hit twice.\n\nOn the Python side, you don't need dry runs or special write endpoints. The real main path\nalready did the work, and the tap event gives you the results. The only thing the Python\nservice might need for tap compare is a dedicated read endpoint that returns a normalized\nview of state if you want to sample post-write state directly. That read endpoint must not\ncause extra writes or side effects.\n\nWhat tap compare can and can't tell you on writes\n\nIt tells you:\n\n- For a given real user request and the production state that existed at that moment, what\n  the Python service chose to return.\n- Whether your Go service, running against a similar but separate view of state, tends to\n  produce the same shape and content of domain objects and responses.\n- Whether your Go write path can execute at all against realistic traffic without panicking\n  or tripping over obvious logic errors.\n\nIt doesn't guarantee:\n\n- That the Go service produces exactly the same side effects in exactly the same order as\n  the Python service. External systems and replication noise get in the way.\n- That the Go service behaves identically under arbitrary concurrent write histories. You\n  saw the histories that actually happened during the tap window, which might miss some edge\n  case interleavings.\n- That all mismatches are bugs. Some will be explained by replication lag, idempotency\n  behavior, or intended fixes.\n\nThe right way to think about it is: tap compare lets you align the new system with the old\none for the traffic you actually have, under the state and timing conditions you actually\nexperienced. It shrinks the unknowns before you put the new system in front of real users.\n\nFrom tap handlers to production handlers\n\nThe Tap handlers are test-only. They will never be promoted to production. They exist to\nvalidate the domain logic, not to serve users. The 204 No Content response makes this\nclear.\n\nHere's how the pieces fit together:\n\n- Core domain logic: methods on goUserService that take a context and input, return a\n  response. This is the code you're actually testing.\n- Tap handlers: call the domain logic, compare against the Python response, discard the\n  result. Pure validation.\n- Production handlers: call the same domain logic and write real HTTP responses. This is\n  what users hit after cutover.\n\nBoth tap and production handlers call the same domain logic. The difference is what happens\nto the result. Tap handlers compare and throw away. Production handlers serialize and\nreturn.\n\nA production handler might look like this:\n\nDuring tap compare, TapHandleGetUser feeds the same inputs into goUserService.GetUser\nand compares resp against the Python response. Meanwhile, HandleGetUser exists but isn't\non the main path yet. It might serve staging traffic or a canary behind a flag.\n\nOnce the diffs drop to zero, you have evidence goUserService.GetUser works correctly. At\nthat point, you route real traffic to HandleGetUser. The domain logic has already been\nvalidated. The production handler just wires it to HTTP.\n\nOnce the production handlers have started to serve real traffic, you can remvove the tap\ntests:\n\n- Delete the tap handlers. The Tap prefix makes them easy to find.\n- Remove tap-only wiring. Strip out comparison code and sister-datastore plumbing.\n- Point domain logic at the real datastore. Flip a config or swap the write path.\n- Flip the proxy. Route traffic to HandleGetUser and HandleCreateUser.\n- Optionally keep a thin tap path. Mirror a small slice of traffic for extra safety.\n\nTap compare is scaffolding. Once you trust the domain logic, you throw it away and let the\nproduction handlers take over.\n\nOther risks and pitfalls\n\nA few things worth calling out beyond what the write section already covers:\n\n- Logging and privacy: Dumping full requests and responses on every mismatch is a good\n  way to leak user data. If this is relevant in your case, log IDs and fingerprints, not\n  full payloads.\n- Non-deterministic data: Auto-incremented IDs diverge, timestamps can differ by\n  milliseconds, 10.0 vs 10 doesn't matter. Normalize or ignore these fields.\n- Bug compatibility: The Python code has bugs. The Go code may fix them, which shows up\n  as a mismatch. Sometimes you replicate the bug to keep the migration low-risk, then fix it\n  later once the new system is live.\n- Cost and blast radius: Shadowing production traffic is expensive. Plan for the extra\n  load so the tap path doesn't degrade the main path.\n\nParting words\n\nTypically, you don't have to build all the plumbing by hand. Proxies like [Envoy], [NGINX],\nand [HAProxy], or a service mesh like [Istio], can help you mirror traffic, capture tap\nevents, and feed them into a shadow service. I left out tool-specific workflows so that the\ncore concept doesn't get obscured.\n\nTap compare doesn't remove all the risk from a migration, but it moves a lot of it into a\nplace you can see: mismatched payloads, noisy writes, and gaps in business logic. Once those\nare understood, switching over to the new service is less of a big bang and more of a boring\nconfiguration change, followed by trimming a pile of Tap code you no longer need.\n\n\n\n\n[envoy]: https://www.envoyproxy.io/\n\n[nginx]: https://nginx.org/\n\n[haproxy]: https://www.haproxy.org/\n\n[istio]: https://istio.io/\n\n[shadow testing]: https://microsoft.github.io/code-with-engineering-playbook/automated-testing/shadow-testing/",
  "title": "Tap compare testing for service migration"
}