{
  "$type": "site.standard.document",
  "canonicalUrl": "https://rednafi.com/misc/notes-on-event-driven-systems/",
  "description": "Deep dive into event-driven architecture patterns. Learn publish-subscribe, CQRS, outbox pattern, eventual consistency, and handling microservice coupling.",
  "path": "/misc/notes-on-event-driven-systems/",
  "publishedAt": "2024-09-21T00:00:00.000Z",
  "site": "at://did:plc:fgtm2c26vfcj74rfmeggbyqj/site.standard.publication/3mnl6f7ob462z",
  "tags": [
    "Networking",
    "Distributed Systems",
    "Design Patterns"
  ],
  "textContent": "I spent the evening watching this incredibly grokkable talk on event-driven services by\nJames Eastham at NDC London 2024. Below is a cleaned-up version of my notes.\n\nI highly recommend watching the full talk if you're interested before reading this\ndistillation.\n\n{{< youtube qcJASFx-F5g >}}\n\nThe curse of tightly coupled microservices\n\nMicroservices often start with HTTP-based request-response communication, which seems\nstraightforward but quickly becomes a pain as systems grow. Coupling - where one service\ndepends on another - creates a few issues. Take the _order processing service_ in a\nfictional Plant-Based Pizza company. It has to talk to the _pickup service_, _delivery\nservice_, _kitchen_, and _loyalty point_ service. They're all tied together, so if one\nfails, the whole system could go down.\n\nThe system relies on all services being up at the same time, which causes issues when any\nservice crashes. Even something like loyalty points can take the whole thing offline, making\nyou wonder if the order processing service really needs to care about that.\n\nOn top of that, there's semantic coupling - things like data formats. _\"How do you handle\nnull values in strings? What casing is your JSON using - camelCase?\"_ These details might\nseem minor, but in tightly coupled systems, they pile up, making the integrations fragile\nand complicated.\n\nWhat event-driven architecture solves\n\nEvent-driven architecture offers a way to decouple services. Instead of one service needing\nto communicate directly with another, services react to events, giving more flexibility and\nscalability. _\"Event-driven architecture is about reversing dependencies - reversing the\nlines of integration.\"_ Now, the order processing service doesn't need to know where the\ndownstream services are. It simply publishes an event, and the downstream services react to\nit.\n\nThis shift is powerful because it frees services from having to know about each other. In\nthis model, the kitchen doesn't wait for the order processing service to send a direct HTTP\nrequest. Instead, it listens for an event that triggers its response without direct\nintegration. _\"You've removed that runtime coupling because producers and consumers no\nlonger need to know each other exist.\"_ By decoupling systems, event-driven architecture\nimproves fault tolerance, scalability, and flexibility.\n\nThe nature of an event\n\nAt the core of event-driven systems is the concept of an event. An event is _\"an immutable\nfact. It's something that's happened in the past. It cannot be changed.\"_ When an event is\npublished, it's a record of something that has already occurred, like _\"order confirmed\"_ or\n_\"pizza boxed.\"_ Events are simple, factual, and unchangeable.\n\nThe analogy of a light switch brings this to life: _\"You hit the light switch, and that\nraises a light switched-on event. You can't un-switch on a light.\"_ To turn the light off,\nyou generate a new event - _\"light switched-off\"_ - but you don't undo the original. This\nprinciple of immutability ensures that events in the system are reliable and unambiguous,\nforming the foundation for how systems react.\n\nEvent-driven vs. event-based systems\n\nIt's easy to confuse event-driven systems with event-based systems, but the distinction is\ncrucial. Event-driven systems are driven by business-specific events - things that reflect\nreal-world actions and decisions, not just technical events like a button click. _\"An\nevent-driven system uses events like these: 'order confirmed,' 'pizza boxed,' 'staff member\nclocked in.'\"_ These are business-level events that reflect the narrative of the company,\nnot just low-level system changes.\n\nIn contrast, event-based systems simply react to any change, such as a file being added to\nS3 or a button being clicked in a UI. _\"We've been building event-based systems for years...\nthat doesn't make it event-driven.\"_ The difference is significant because event-driven\nsystems align technical architecture with business needs, creating a more meaningful,\ncoherent system where the events reflect the organization's core processes.\n\nFat vs. sparse events\n\nOne of the critical design decisions in event-driven systems is choosing between fat events\n(which carry a lot of data) and sparse events (which carry minimal data). Fat events, also\nknown as Event-Carried State Transfer (ECST), include all the information a consumer might\nneed. For instance, _\"the kitchen can consume this event - it's got the list of items on the\norder, so now it knows what it needs to cook.\"_ This reduces the need for callbacks or\nadditional requests for data back to the original system that publishes the event, making\nthe system more robust in terms of runtime interaction.\n\nHowever, fat events come with risks. _\"The downside of that is that you get more coupling at\nthe schema level.\"_ Because fat events contain so much information, it becomes harder to\nchange the event format without impacting multiple consumers. As more services depend on\nthat data, the risk of breaking something grows.\n\nIn contrast, sparse events are lightweight but require callbacks to get additional\ninformation. Initially, this might seem more efficient, but as more services join the\nsystem, the number of callbacks increases exponentially. _\"Now you've got this potentially\ninfinite number of downstream services that are all making calls back to get more\ninformation.\"_ The result is a more tightly coupled system, albeit in a different form.\n\nSo which one of these is the right one to choose? The answer is: it depends. Sparse events\nreduce the need for frequent changes, but fat events reduce the need for constant\nback-and-forth communication. Often, a combination of both is necessary, depending on the\nuse case.\n\nPublish-subscribe pattern and the role of the broker\n\nThe core of event-driven architecture lies in the publish-subscribe pattern, facilitated by\nan event broker. _\"At its core, an event-driven architecture is made up of three parts: You\nhave a producer, a consumer, and some kind of event broker in the middle.\"_ The producer\ngenerates the event, the broker routes it, and the consumer processes it. The beauty of this\nsystem is that producers and consumers don't need to know about each other's existence.\n\n_\"The first thing you'll notice is that the producer and the consumer here have no idea each\nother exists - the communication is managed by the broker.\"_ This decoupling makes the\nsystem more flexible and scalable. A consumer can be added or removed without impacting the\nproducer. The broker ensures that events are delivered, allowing the system to continue\nfunctioning smoothly even as it evolves.\n\nHowever, one responsibility remains: _\"The schema of your event - the format of that event -\nis the biggest part of the coupling that you will see in event-driven architecture.\"_ While\nruntime coupling is removed, semantic coupling still exists. Producers must ensure that the\nevent schema doesn't change in ways that break existing consumers.\n\nHandling constraints and governance\n\nIn event-driven systems, the responsibility for handling constraints shifts from the\nproducer to the consumer. Producers generate events as quickly as they can, without worrying\nabout the load on consumers. _\"As a producer, it's not your responsibility to care about how\nyour events are used... that's the subscriber's responsibility.\"_ Consumers must handle\ntheir own ingestion rates and ensure they don't get overloaded.\n\nGovernance plays a critical role in managing these systems, particularly as they evolve.\nWhen changes are made to event schemas, it's essential to communicate those changes to all\nconsumers. _\"Governance is really important with event-driven architecture because you've\ngot these systems that just don't care about each other.\"_ One effective method for managing\nthis is through Request for Comments (RFCs), which allow for collaborative discussion before\nany changes are implemented.\n\n_\"Rather than just publishing an event and hoping for the best, introducing governance\nensures that events remain consistent and understandable across teams.\"_ This helps prevent\nbreaking changes that could take down systems you didn't even know were relying on your\nevents.\n\nMetadata-data pattern for evolvability\n\nTo enhance the evolvability of an event-driven system, East recommends using the\nmetadata-data pattern. This pattern separates the event's core data from its metadata,\nallowing for greater flexibility. _\"Splitting your event down into a metadata section and\nthe data section helps you to stay evolvable.\"_ The data contains the specifics of the\nevent, while the metadata includes information like _\"event type,\"_ _\"event ID,\"_ and\n_\"version.\"_\n\nThis separation allows consumers to understand and process events more easily while\nproviding room for schema changes. For example, _\"event versioning allows you to introduce\nbreaking changes in a controlled manner.\"_ By publishing multiple versions of an event, you\ncan ensure backward compatibility while encouraging consumers to upgrade to the latest\nschema.\n\nEventual consistency in event-driven systems\n\nOne of the trade-offs in event-driven architecture is that systems must embrace eventual\nconsistency. In a request-response system, actions happen immediately and are reflected in\nreal-time. But in an event-driven system, updates propagate over time. _\"Eventually, over\ntime, these systems will converge on the same view of the world.\"_ This is a shift in\nmindset for many developers used to strong consistency.\n\nTo illustrate this, consider a card payment: _\"When you make a card transaction, all you're\ndoing is making a theoretical guarantee that, at some point in the future, that money is\ngoing to move from your bank account to theirs.\"_ While the system is eventually consistent,\nthe end result will be correct, just not immediately. Event-driven architecture functions\nsimilarly - updates happen asynchronously, and systems eventually reach a consistent state.\n\nHandling HTTP communication in an event-driven world\n\nNot every system can fully adopt event-driven architecture, and many still rely on\nHTTP-based communication. To integrate these systems into an event-driven world, you need a\nmiddle layer. For example, if your loyalty point service is being replaced by a third-party\nSaaS product that only supports an HTTP API, you'd still have a service managing that\nintegration. This service listens for events and translates them into HTTP requests for\nsystems that aren't event-driven.\n\nTo handle differences in response times and reliability between HTTP-based and event-driven\nsystems, introducing a queue or intermediary storage is crucial. _\"Introducing this queue\nmeans you can keep this amount of durability... you can process to the third-party API as\nand when you need to.\"_ This queue adds resilience, allowing your system to continue\nfunctioning smoothly, even when interacting with external services that don't follow\nevent-driven principles.\n\nAsynchronous commands\n\nCommands in an event-driven system don't always need to be synchronous. Instead of waiting\nfor an immediate response, systems can issue commands asynchronously, allowing for greater\nflexibility and non-blocking workflows. _\"You want to send an email, but you might not\nnecessarily want it to be completely request-response.\"_\n\nAn asynchronous command might still send a request to a service, but the response isn't\nrequired to continue processing. This allows systems like the email notification service to\nhandle requests at its own pace, rather than blocking the core order processing service.\n_\"Your email service can still expose an endpoint, but as opposed to that being an\nHTTP-based endpoint, that could just be a message channel.\"_ This approach decouples the\nservices further and ensures more efficient use of resources.\n\nCQRS for separating reads and writes\n\nCommand Query Responsibility Segregation (CQRS) is a powerful pattern that pairs well with\nevent-driven architecture. CQRS separates the system into two parts: one for handling\ncommands (writes) and another for handling queries (reads). _\"In CQRS, you split your system\ninto two completely independent services - one for processing commands, one for handling\nqueries.\"_ This allows each part of the system to be optimized for its specific workload.\n\nFor example, the command service focuses on writing data to the database and publishing\nevents, while the query service listens for those events and updates a read-optimized view\nof the world. This separation enables more efficient scaling, as the query service can be\ntuned for fast reads, potentially storing data in caches like Redis or even keeping it in\nmemory.\n\nI'm a bit skeptical about CQRS since I've worked on a system with a terrible implementation\nthat went horribly wrong. But I intend to keep an open mind.\n\nHandling failure with the outbox pattern\n\nIn event-driven systems, failure is inevitable, so you need strategies to handle situations\nwhere events fail to publish. The outbox pattern is one such approach. _\"At the point you\nwrite the data to the main database... you also write the data to a secondary table.\"_ This\noutbox table ensures that if the event fails to publish initially, it can be retried later.\n\nThis creates consistency across the system by acting as a buffer between the database and\nthe event bus. Alternatively, systems can use change data capture to respond directly to\nchanges in the database. _\"As a record is written to the database, you can stream that - you\ncan react to that - and you can publish events off the back of that.\"_ Both methods ensure\nreliability, preventing events from being lost due to temporary failures.\n\nThe outbox pattern sounds great in theory, but in practice, if you have a large system with\nmany services publishing to the broker, managing an extra process for each service to read\nfrom the outbox table and publish to the event bus becomes a hassle. Instead, on the\npublisher side, retrying with a circuit breaker has worked better for me. Also, fun fact: I\nwas asked about the outbox pattern in 4 of the last 5 places I interviewed for a backend\nrole.",
  "title": "Notes on building event-driven systems"
}