An HLD attempt for a chat system!


    Let's attempt to design a simple chat application that resembles WhatsApp or Facebook messenger. We must remember that these are very mature systems and have a lot of complex features but we may want to scope out our requirements very well so that we can cover them in this article with reasonable complexity. So let's note down the requirements, to begin with.
Functional requirements :
  1. The system must allow one-to-one chats between two users. Group chats are out of the scope of the design, yet the system should be scalable to group chats.
  2. The messages should be preserved for at least 5 years post that we may clean them up.
  3. The system should allow text messaging and later be extendable to media messages as well (images, videos and gifs etc).
  4. The system should be able to detect and show online / offline users and messages should be delivered once they are back online.
Non-functional requirements :
  1. The system should support a huge scale. We can expect 1 billion active monthly users extendable to even more.
  2. The system should be highly available. We can expect greater than 99.99 % availability (More on availability numbers can be read here). 
  3. Expect read to write ratio of 1:1.
    Based on these requirements, let's do some back-of-the-envelope calculations and come up with ballpark numbers around the components that will help us choose components better.
  1. QPS requirements: 1 billion (1,000,000,000) * 20 messages (average 20 messages per user) / (30 * 24 * 60 * 60) ~ 8,000 queries per second. Peak QPS ~ 2 * normal QPS ~ 16,000 QPS. This is a big number! It would require a lot of optimizations and a properly distributed system.
  2. Memory requirements: Assuming a character has 1-byte space and a typical message uses 100 characters, we would need to support the following memory if we want to store the messages for 5 years: 1,000,000,000 * 20 * 100 * 12 * 5 ~ 110 TB memory (assuming only text. Media would need way more!). The memory requirements are still reasonable and even if we include media which is say 10X, we would have a requirement of around 1 PB of memory. For our use case, let's take the 110 TB requirement. 
    Now, with an estimate of what are our requirements in a more quantitative way with the back-of-the-envelope numbers, let's try to answer the other design questions.
  1. Connection: This is a unique problem statement where we need to decide the communication channels involved and the connections existing. We cannot really use the usual HTTP based connects. The reason is that HTTP is a client-initiated protocol, that is, the client only gets a response when it "asks" for it. In our case, our message receiver can't keep knocking on the servers to check if there are messages! This will overload it. To resolve this, modern messaging apps dedicate their money and time to come up with lots of patient protocols for communications like WhatsApp uses a customized version of the open standard Extensible Messaging and Presence Protocol (XMPP) For our use case, we may use a web socket-based connection which will serve our purpose. In this, the clients, whenever they log in / go online, it sends a handshake request to the server to establish a bidirectional connection between the client and server which is persisted until a timeout occurs or the person logs out / goes offline for a specified time (say 5 minutes). This makes sure the message can travel both ways without knocking. This is an important piece of our design. You can read more about it here.
  2. Database: For complex problems like these, we usually need both an SQL and a No-SQL database which we customize for separate use cases. For our use case, to keep the usual log in and user data, we may use MySQL as a database and for chats and other contents, we may use a NoSQL database, say Amazon DynamoDB for now.
  3. Services: We would need at least three major services. One to handle our chats, one to handle our online and offline status and one to support the usual API operations. These two services should suffice for now.
    Based on these, let's come up with the first draft of our system. It will look something like this: 

Figure 1: HLD of the chat application system. 

    The above figure 1 explains my thought process at a high level. There are multiple considerations here. Let's discuss them one by one:
  1. Message-id generator: When we talk about distributed storage in tables and queues, we would need an id to be generated and it is not easy to generate sorted unique ids (we would need them to sync messages when we do an LLD). Hence we need a separate (maybe a 3rd party service for this).
  2. Status check for online / offline: This is done via the presence service. It sends a "heartbeat" and checks if it responds before a timeout. If it responds, it stores it in the cache as a key-value pair with the key as the userId / sessionId and the value as the last positive heartbeat response. This can be used to then check for the online/offline status.
  3. Scale: We would need to deploy the whole system on the distributed sphere following best practices like consistent hashing, error handling, failure retry mechanism and metrics/alarms/dashboards in place to monitor things. Also, we would need logging and audit mechanisms. 
    In totality, this is what at a very high level a typical chat application looks like. The implementation of this, the LLD and other low-level design components like table schema and connection protocols are complex affairs and we are keeping them out of the scope of this doc as of now. To understand the best practices and principles used and a better understanding of many other concepts, please refer to my other posts on this blog. As I always say, no design is perfect (and I am anyways just learning :). I am open to comments and feedback and more details. Let's keep learning! You can read a bit more here.

Amrit Raj

Comments