This post is an extract from one of the courses that I am recording for HLD design. This would talk about dataset-related concepts and dedication just from the HLD perspective without diving deep into more such details. I hope this will serve as a good reference for HLD decisions!
A database is a storage system that stores data in such a way as to optimize certain types of operations on that data which include but are not limited to fast retrieval based on requirements (queries), aggregation (like basic math operations) etc. The database can choose to store data in one of the two formats which are :
1.
Structured: Storage in
tabular form. So each unit of data is stored in one row of the table and hence
each row of data should be "similar" in type as the column structure
is mostly fixed. An example of a database which does this is MySQL. Sometimes
also referred to as SQL databases as they can be queried using a SQL
(Structured Query Language) command.
2.
Unstructured: Storage
of data in non-structured / semi-structured format and not in tables. Data can
be stored as a key-value pair, as separate documents, JSON or in other such
ways. These offer flexibility for each unit of data storage. An example of such
a database is MongoDB. Sometimes also referred to as No-SQL databases as they
can't be queried using a SQL (Structured Query Language) command.
ACID properties of the
database:
1.
A (Atomicity): Entire
transaction (A read/write/update of data) that happens at once or doesn't happen at
all. There is no "partial transaction".
2.
C (Consistency): The
database must be consistent before and after transactions. For eg, if your
opening account balance is 1000 and post deduction of 100 (a transaction), it
should be 900 and not anything else. If your transaction is aborted say after
we read 1000 and updated to 900 but before we could write 900, we would still
have 1000 in the record. This would be inconsistent.
3.
I (Isolation):
Multiple transactions occur in parallel without interference.
4.
D (Durability): The
transactions are done and persisted even if the system fails.
These "ACID"
principles are very rigidly followed by SQL-based databases (for obvious
reasons as they store data in a very structured way and "enforce" the
ACID is easy). These are not strictly followed by most No-SQL databases and
they generally enforce what is called "eventually consistent" where
for a few moments, the database may be in an inconsistent state, but soon it
becomes consistent. Hence, for use cases where the data to be stored is
"structured" and needs strict ACID properties, we should opt for
SQL-based databases (eg, a banking system. Here, an inconsistent database even
for a few moments may be costly!). If we need the flexibility to store any type
of data with very quick access, we may opt for a No-SQL-based system.
Vertical Vs Horizontal
scaling:
1.
Vertical scaling: Say
today your database can handle 10GB of data. Tomorrow if we want to store 20GB,
we may add more disk space to your server hosting this database. This is called
vertical scaling as we are adding more resources to the same "node".
It has hardware limitations and we can't go beyond a certain point and it poses
a risk of SPOF.
2.
Horizontal scaling:
Now, to solve the above problem, if we go ahead and add a new "node"
altogether to store the new 10GB on a different server located at some other
place maybe, we can call it horizontal scaling. It is preferred as it has fewer
hardware limitations and is not prone to SPOF easily.
The SQL-based
databases are usually horizontally scalable easily as "spreading" the
data across the regions on different nodes while keeping the
"structure" intact is not an easy task. The No-SQL databases are
easier to scale horizontally which is a big advantage.
CAP Theorem:
A distributed system
can deliver only two of three desired characteristics: consistency,
availability, and partition tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP.
In the context of the databases, if we want consistency, we will need to go for
a SQL database which will not scale very well horizontally creating an
availability issue in cross regions. If we go for availability, we may choose a
No-SQL database which will compromise consistency. If we go for both
consistency and availability, we will need to compromise with the partition
tolerance(The ability to keep working even if some nodes of the system are
down). So in all cases, we must know the trade-off and choose two of these
three. This is known as the CAP theorem.
Replication:
Database replication
refers to the process of copying data from a primary database to one or more
replica databases in order to improve data accessibility and system fault
tolerance and reliability. Database replication is typically an ongoing process
which occurs in real-time as data is created, updated, or deleted in the
primary database but it can also occur as one-time or scheduled batch projects.
It ensures security, no SPOF, and availability.
Sharding:
Database sharding
splits a single dataset into partitions or shards. Each shard contains unique
rows of information that you can store separately across multiple computers,
called nodes. All shards run on separate nodes but share the original
database’s schema or design.
Database sharding
methods apply different rules to the shard key to determine the correct node
for a particular data row. The following are common sharding architectures.
Range-based sharding
Range-based sharding,
or dynamic sharding, splits database rows based on a range of values. Then the
database designer assigns a shard key to the respective range. For example, the
database designer partitions the data according to the first alphabet in the
customer's name.
Hashed sharding
Hashed sharding
assigns the shard key to each row of the database by using a mathematical
formula called a hash function. The hash function takes the information from
the row and produces a hash value. The application uses the hash value as a
shard key and stores the information in the corresponding physical shard.
Geo sharding
Geo-sharding splits and stores database information according to
geographical location. For example, a dating service website uses a database to
store customer information from various cities as follows.
Comments