Databases: An HLD perspective!

 


    This post is an extract from one of the courses that I am recording for HLD design. This would talk about dataset-related concepts and dedication just from the HLD perspective without diving deep into more such details. I hope this will serve as a good reference for HLD decisions!

A database is a storage system that stores data in such a way as to optimize certain types of operations on that data which include but are not limited to fast retrieval based on requirements (queries), aggregation (like basic math operations) etc. The database can choose to store data in one of the two formats which are :

1.    Structured: Storage in tabular form. So each unit of data is stored in one row of the table and hence each row of data should be "similar" in type as the column structure is mostly fixed. An example of a database which does this is MySQL. Sometimes also referred to as SQL databases as they can be queried using a SQL (Structured Query Language) command.

2.    Unstructured: Storage of data in non-structured / semi-structured format and not in tables. Data can be stored as a key-value pair, as separate documents, JSON or in other such ways. These offer flexibility for each unit of data storage. An example of such a database is MongoDB. Sometimes also referred to as No-SQL databases as they can't be queried using a SQL (Structured Query Language) command.

ACID properties of the database:

1.    A (Atomicity): Entire transaction (A read/write/update of data) that happens at once or doesn't happen at all. There is no "partial transaction".

2.    C (Consistency): The database must be consistent before and after transactions. For eg, if your opening account balance is 1000 and post deduction of 100 (a transaction), it should be 900 and not anything else. If your transaction is aborted say after we read 1000 and updated to 900 but before we could write 900, we would still have 1000 in the record. This would be inconsistent.

3.    I (Isolation): Multiple transactions occur in parallel without interference.

4.    D (Durability): The transactions are done and persisted even if the system fails.

These "ACID" principles are very rigidly followed by SQL-based databases (for obvious reasons as they store data in a very structured way and "enforce" the ACID is easy). These are not strictly followed by most No-SQL databases and they generally enforce what is called "eventually consistent" where for a few moments, the database may be in an inconsistent state, but soon it becomes consistent. Hence, for use cases where the data to be stored is "structured" and needs strict ACID properties, we should opt for SQL-based databases (eg, a banking system. Here, an inconsistent database even for a few moments may be costly!). If we need the flexibility to store any type of data with very quick access, we may opt for a No-SQL-based system.

Vertical Vs Horizontal scaling:

1.    Vertical scaling: Say today your database can handle 10GB of data. Tomorrow if we want to store 20GB, we may add more disk space to your server hosting this database. This is called vertical scaling as we are adding more resources to the same "node". It has hardware limitations and we can't go beyond a certain point and it poses a risk of SPOF.

2.    Horizontal scaling: Now, to solve the above problem, if we go ahead and add a new "node" altogether to store the new 10GB on a different server located at some other place maybe, we can call it horizontal scaling. It is preferred as it has fewer hardware limitations and is not prone to SPOF easily.

The SQL-based databases are usually horizontally scalable easily as "spreading" the data across the regions on different nodes while keeping the "structure" intact is not an easy task. The No-SQL databases are easier to scale horizontally which is a big advantage.

CAP Theorem:

A distributed system can deliver only two of three desired characteristics: consistency, availability, and partition tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP. In the context of the databases, if we want consistency, we will need to go for a SQL database which will not scale very well horizontally creating an availability issue in cross regions. If we go for availability, we may choose a No-SQL database which will compromise consistency. If we go for both consistency and availability, we will need to compromise with the partition tolerance(The ability to keep working even if some nodes of the system are down). So in all cases, we must know the trade-off and choose two of these three. This is known as the CAP theorem.

Replication:

Database replication refers to the process of copying data from a primary database to one or more replica databases in order to improve data accessibility and system fault tolerance and reliability. Database replication is typically an ongoing process which occurs in real-time as data is created, updated, or deleted in the primary database but it can also occur as one-time or scheduled batch projects. It ensures security, no SPOF, and availability.

Sharding:

Database sharding splits a single dataset into partitions or shards. Each shard contains unique rows of information that you can store separately across multiple computers, called nodes. All shards run on separate nodes but share the original database’s schema or design. 

Database sharding methods apply different rules to the shard key to determine the correct node for a particular data row. The following are common sharding architectures.

Range-based sharding

Range-based sharding, or dynamic sharding, splits database rows based on a range of values. Then the database designer assigns a shard key to the respective range. For example, the database designer partitions the data according to the first alphabet in the customer's name.

Hashed sharding

Hashed sharding assigns the shard key to each row of the database by using a mathematical formula called a hash function. The hash function takes the information from the row and produces a hash value. The application uses the hash value as a shard key and stores the information in the corresponding physical shard.

Geo sharding

Geo-sharding splits and stores database information according to geographical location. For example, a dating service website uses a database to store customer information from various cities as follows. 

 

Comments