Database schema design - A skill to cherish


    Any software solution we design, in some way or the other, generally depends on a persistence solution which in most cases is a database. We do come across solutions like file-based or hybrid persistence solutions, but databases are our primary persistence solution. While we write our low-level design document, we must ensure we come up with the most suitable database and persistence design. We need to answer a few questions before we can jump into designing the database solution. Each of these questions will help us make a few design decisions. Let me put it this way:
  1. Whether to use an SQL-based database or a No-SQL based? Questions to ask:
    1. Do I know the "structure" of the data that I am going to store?
    2. Is that "structure" going to change frequently or is it mostly going to be the same?
    3. Do I need to perform a lot of "join" based queries where we fetch information from two different tables?
    4. Will I need horizontal scaling or will vertical scaling serve my purpose?
    5. Will I need "hard consistency" or will "eventual consistency" be good enough for me given I can optimize the availability?
    6. Do I need strict adherence to ACID properties?

      Based on these questions, if your answer resolves around these keywords: "structure", "joins", "complex queries", "vertical scalability", "ACID", "throughout consistent" etc, go for a SQL-based DB. Otherwise, go for the No-SQL DB. Most real-world systems are a blend of both and try to get the best of both worlds! You can read more here.
  2. Now I know if I need an SQL database or a No-SQL database. Great! Now how do I design the schema for my use case? Questions to ask:
    1. What are the potential queries that my table will cater to and what volume of such queries will come to me? Are the queries going to be "write heavy" or "read heavy"?
    2. What is the expectation from my table in terms of responses? Do we expect the table to be lightning fast when it comes to insertions and slow in the retrieval or the other way around?
    3. What all parameters are "unique" in the data that the table is going to store?
    4. How will I "connect" two pieces of data? Do I have a parameter which establishes a "connection" between two pieces of independent data?

      Based on these questions, you will get a hint of what a table would look like. Post that, there are normalization techniques (for SQL DB's) that will help you reduce the "duplicity" of data stored. For No-SQL-based DB's, you will have to handle them more by "intuition". Now, how these questions will help you design a table schema, I will cover in the second segment of this article with an example to make it concrete.
Let's take an example for both SQL and No-SQL-based databases.

Case 1:

I want to design a storage solution to keep a user's request data. The data can contain information around the user credentials and "some" metadata which might "evolve" over time. The data is sensitive and subject to country-level compliance and data from a certain region of the world needs to be stored only in "certain regions" databases. The data will be written once and read multiple times. Strict consistency is not required at all times as the request will anyways be used mostly for later dumping. 

    With these requirements, it is almost certain that we must use a No-SQL database for this use case as there is no structure and complex queries are not expected. Horizontal scaling is expected though. Now, for our use case, let's assume we use DynamoDB as our No-SQL DB. 
    Based on our second set of questions, we will be able to figure out the schema for this table. We know that the table is going to be "read" heavy, hence we might want to add more "indexes" on the table for more flexible and faster queries. This will slow down writes as more indexes mean more time to balance the data tree internally with each write but we are good with that. Now, what is "unique" about the data? Let's assume that the encrypted Adhaar number of every person is going to be unique, so we can make it the partition/primary key. This key can also connect two "people" centred tables as they will be the same for a given person across the tables. Now, as this is a No-SQL database, this much information is enough to create a table and other columns may be added on the go. We can add the indexes based on what queries will be made. Say we would need to search the DB for a contact number or date of birth etc, and we can make their indexes. So our schema would look something like this :
{
    adhaarHash: String // Partition Key
    contactNumber: Number // Global secondary index
    dob: Number/Date // Global secondary index
}
For more documentation of AWS DynamoDB and indexing, refer to this

Case 2:

I want to design a system to store the payment details of the users for a bank. We do not want to bug the customers if the payments are made and hence would want the system to be consistent at all times. We can compromise on the availability if needed a bit but our data should at all times be consistent. The data should be stored on Bank's servers only and not on other servers. We keep minimal data of our users so as to comply with government laws and hence we have strict guidelines on what data we need to store. For updating other tables of the bank, we would need to perform complex queries on this table.

    Based on the above use case, we can see a SQL-based DB would best suit us. It will provide us consistency, vertical scale, a structured approach and the ability to perform complex queries.
    Now, let's try to come up with a schema. The "unique" item here will be the customer_id. Let's make it the primary key. The "connecting" factor, apart from the customer_id to the other bank's table could be the account_number. Let's make it a foreign key. To comply with government guidelines, let's say we store only the payment_amount, and payment_status. The rest of the information can be found in other details tables existing with the bank using our primary and foreign keys. Now, what could be a potential query. Maybe we might need to see a list of all people whose payments are not done. For this, we can make the payment_status column an index. 

    This is the basic approach to choosing and designing a database schema. In the real world, problem statements are more complex and lengthier, but they can always be broken down into these fundamental questions. If we can answer them correctly, we would be able to design the schema correctly almost always. Obviously, this skill requires a lot of practice but based on these foundations, we can keep practising and ace the art of designing a database eventually. 

Amrit Raj

Comments