How to design our own key-value store


    There are a lot of modern-day use cases of a key-value pair-based store. We have many NoSQL-based DB's based on key-value pairs (like Amazon DynamoDB). We have a lot of key-value-based caches in existence (Redis cache). It would be really cool if we could design one which works at scale!
Functional Requirements
  1. It should be able to store and persist data in the form of a key and value where the key will be a string and the value would be an object (JSON etc).
  2. It should be able to fetch the value based on the key.
Non-Functional Requirments
  1. It should be highly available. Eventual consistency would be acceptable. It should be partition tolerant and the addition or reduction of servers should not affect availability. 
  2. It should be able to handle very large amounts of data.
  3. It should be scalable at will. It should have tunable parameters for adjusting the consistency and availability trade-off. 
    Based on these requirements, let us first try to clarify a few things. 
  • Storage: Let's assume we are storing the data in individual servers (let's call them nodes). Each node will consist of separate caches for temporary storage and when a certain threshold (say THRESHOLD_CACHE) is reached, it stores it in the permanent file in the system in sorted order of keys. Each node will also maintain a log file which will serve as an audit mechanism for the transactions in case something goes wrong and the node goes down.
  • Distribution of the nodes: To cater to the high availability requirement, it is almost certain that we would need to distribute the nodes across regions so there is no single point of failure. To distribute the incoming requests to various nodes, we would use consistent hashing. This would ensure that our nodes are partition tolerant and even if one node goes down or a new node is added, our service will still be available. 
  • Replication of data and syncing: To provide consistency, if we go for data replication, we would need to sync the data among all the replicated partitions. For this, let us take three parameters. 1) N = Number of replicas of the data, 2) W = the minimum number of successful acknowledgement of replication for a write operation to finally conclude it as a successful write. For example, if say we have N = 5 and W = 3, If we get a write request, we will start replicating the newly written key-value-pair to all 5 replicas. Until it is successfully written in 3 out of 5 replicas, we would not return a successful write to the client. 3) R = the minimum number of successful read acknowledgements from the replicas for considering it a successful read operation.

    Now, we can tune our consistency and availability using the parameters N, W, and R: 
    If W + R > N, it will be a highly consistent system as each read and write will need a lot of acknowledgements before it is considered successful. It has a cost that each query will then take longer and may reduce availability.
    If W + R <= N, it will be an eventually consistent system but responses will be faster as it needs fewer acknowledgements from the replicas.
    Hence, with W, R and N, we can tune our availability and consistency as per requirement.
  • Conflict of key-value pairs: In case there is a conflict between two key-value pairs generated at the same time, we may use the versioning technique based on server id and created timestamp and most recent version. After a certain time, older versions can then be cleaned up at each node.
    Now, with these components and baseline information, let us try to come up with a design for such a system.

Figure 1: A High-level design of the distributed node system to support the key-value pair system

    In the above HLD, the users can save and fetch the key-value pairs. We can use two simple APIs : 
  1. put(final String key, final String value)
  2. get(final String key)
This design, at a high level, will serve all the requirements for a key-value pair system which can scale to very large data and users distributed across the globe and is tolerant to breaks. There is always a scope for the betterment and comments are welcome!

Amrit Raj

Comments