Recently, I came across a problem statement which revolved around inter-service communication so that we can have a sort of "workflow" of actions in which each service performed once it knows that its pre-requisite service has completed its task. Sounds confusing! Let me break this down using an example for you. Let's try to think of a shopping experience from an e-commerce website. What steps do you take at a high level as a user to complete a "shopping transaction". Let's list them down:
- Search for an item.
- Select an item and add it to the cart.
- Checkout and move to payments.
- Make payments and receive a bill.
These four seemingly simple "sub-transactions" are actually very complex and may involve a lot of back-end engineering going on ranging from updating various databases and caches based on each transaction to fetching information from multiple services (like you should not be redirected to payments if, by the time you are adding to cart, some item has been sold out from the stock). Apart from these complications, one of the major complications is the "sequence" of these back-end calls as these transactions are completed. We might want some calls to be async, and some to be in sync. One way to deal with this situation is to have a central service and all calls will be made to that service and that in turn will call the dependent services and get the information. This central service is sort of a controller service and does all the "call management" for us. But it has a lot of problems like a single point of failure (SPOF) and it is not fast enough (it will be queue based after all).
To tackle this issue, engineers came up with this solution of event-based architecture. What this architecture essentially does is that it treats any "sub-transaction" as an "event" and each event that must follow up with an async call or a backend task, is notified to the relevant service using a notification. The concerned service will then queue up the received notification and execute it in parallel. This makes sure that there is no SPOF and no system is "blocked" because of some other system unless it "should" be blocked as these two systems need to work one after the other. Also, it implements the best practice of design by decoupling all the related services and each one of them can be separately modified without concerning the other systems. So we can implement a "workflow" without deeply coupling the "steps" of the workflow. Let us understand this in the context of our shopping example:
- Search for an item: As soon as you search for a term, an event called "item searched" is generated which is listened to by the cache services and they update the cache. The same event is also listened to by the analytics service which updates the database/data lake for analytics purposes.
- Add the item to the cart: It generates an event that is received by the stock service which checks if the item is still available in stock and blocks it until the payment is made. In parallel, the event is also received by the payments service which generates a payment link. Once, both these services complete their tasks, they return their separate events which are received back by the cart service and only when it has both events, it starts the payment process.
- Once the payment is made: It generates an event that "payment made success" which is received by the billing service and the stock service which respectively generate a bill and initiate packing of the blocked item and reduce item count.
This way, all these services were able to contact each other without actually having to bother about a common interface which reduced the coupling and allowed the system to work as a single large piece of software. This looks like a better approach to solving our problem. Let's see the whole flow via a sample HLD:
Figure 1: A very high-level design of how a potential event-based architecture may work to support an e-commerce transaction
As we can see from the above HLD, a seemingly simple "e-commerce transaction" is actually very complicated. The above diagram is actually a very simplified diagram with lots of assumptions and simplifications. An actual system is even more complicated as everything works at scale. Now, we can imagine how complicated it would have been if these services were not loosely coupled and had no proper mechanism of communicating with each other. A slight variant and modern version of this pattern are called the "saga" pattern (for obvious reasons!).
Despite all these "good stuff", the event-driven architecture suffers from the following drawbacks:
- Heavy engineering: It slightly disagrees with the principle of KIS (Keep it simple!). It has a lot of components involved and is prone to over-engineering if not designed well. As you can see in the HLD, it needs lots of notifiers and queues which need to be in sync.
- Queue management: Queues should be managed well to avoid inconsistent behaviour.
- Error handling: If there is an error, it is tough to track as this is not a usual single-flow workflow. Also, the more components, the more the chances of failure.
Despite these drawbacks, this architecture is heavily used in modern contexts as most of its drawbacks are handleable with good engineering. I hope I was able to make my points clear through this doc around this topic. I am open to feedback and improvements.
Amrit Raj
Comments