How about a simple HLD of Google docs... our way!


    Recently, I had a discussion with one of my friends about what the high-level design of google docs might look like. It provoked a lot of interest in me and I took an attempt to design a simpler version of google docs with whatever minimal knowledge I had. Let me try and share this with you guys.
    Let's first freeze on the functional and non-functional requirements. Google docs is a very vast space to design and we must choose what subset of that we wish to dig into and design. For now, let's take the following functional requirements: 
  1. The users should be able to create, delete and update a doc. The doc may contain text, images or videos for now.
  2. The doc should be sharable and up to 5 people should be able to collaborate on the doc and simultaneously edit the doc.
  3. The user should be able to search for a doc from amongst lots of docs. 
    The non-functional requirements, for now, can be as follows:
  1. The doc should be eventually in a consistent state. Strict consistency though is not a mandate. 
  2. The system should be scalable horizontally to a larger audience.
    For now, let's begin our journey with these requirements and try to come up with the first draft of what systems/components we will need to have in our system. These are as follows as they come to my mind: 
  1. A service to host the APIs. We would need some basic CRUD APIs and other APIs. To begin with, I can think of the following APIs:
    1. CRUD (Create, Update and Delete) APIs for each doc.
    2. Sync API: Assuming we will let the individual users make changes locally and the changes can then be synced with the server and reflect on the other users after a few seconds like a browser-based cron API call (The system is not expected to be always consistent).
    3. API to share the doc.
  2. A file storage mechanism to store and index the created docs.
    1. Based on the non-functional requirements, we would need to make this system scale. So to make that happen, I would opt for Amazon S3 as the storage as it is almost infinitely scalable. Also, it provides the ability to generate pre-signed URLs to all the docs stored in it which our share doc API can then use off the shelf. 
  3. A database to store metadata around the main doc.
    1. For this, I can see no clear structure of the data. It may be a text, a video, an image and other such information. Also, we would need a DB that might scale horizontally but we are happy to compromise with strict ACID properties as per the requirements. My personal choice based on these conditions would be a No-SQL Database. Probably Amazon DynamoDB.
  4. Other common things like load balancers, caches and utility services like tiny URLs or so.
With these basic components, let me try and come up with the first draft of a high-level component diagram:

Figure 1: HLD component diagram for Google docs
    
    Now that we have our draft of an HLD component diagram, let's try to drill down into each component so that we can support the requirements. Before that, let me briefly explain the data flow in the diagram above. The three actors can simultaneously interact with our docs via the Docs UI and they may make local changes to be stored in their browser-level cache. Any API calls made from the UI will be directed to the load balancer which would route the request to one of the two API servers which in turn will fetch the required data from the AWS account where we would have our data stored across multiple regions with multiple replica's of data along with proper sharding techniques. 
    Let's now see the schema of the database. I propose the following schema for the user table:
{
    username: String // Partition key (Unique). Probably the user's Gmail addresses
    baseDirectoryUrl: String // The S3 location where the docs of this particular person reside.
    userMetadata: JSON // This represents all the other additional info we might want to keep.
}

    Let's now try to come up with a directory structure for the S3. I propose to keep the files in the following way:
  1. s3://<username>/ This will serve as the base directory for the user. The link to this will be stored in the table defined above for each user.
  2. Each of the user directories defined in step 1 will further contain three subdirectories, one for text, one for images and one for videos. Say s3://<username>/text, s3://<username>/images, s3://<username>/videos.
  3. All docs created by the user will go to the "text" subdirectory and each of the images and videos will go to the respective subdirectories. If an image/video is attached to the text, the text doc will contain the link to that image/video in that doc in place of the original image/video and will be replaced with the actual image/video only while rendering.
    Now, based on our data storage structure, let us try to come up with our APIs and maybe a basic pseudocode-level algorithm for them:
  1. createDoc(final String docName, private String userName) {
     // It takes a docName and userName. Checks if user exists, then goes to its baseDirectoryURL
     // and creates a doc. If the user does not exist, it creates an entry in the table and then creates doc.
    }
  2. getDoc(final String userName, final String docName) {
     // Refers to the table and gets the base URL and then fetches the doc from there.
    }
  3. syncData(final String userName, final String docName, final String docImage) {
     // It sends an image of each local change after a certain interval to the server to make sure 
     // there are no merge conflicts and then merge the net changed file from the three users to the     // doc.
    }
  4. shareDoc(final String docName, final userName, final JSON permissions) {
     // Creates and returns a link of the doc based on the resigned URL of S3 based on the permission
     // JSON. The link can then be shared.
    }
    WIth this design and these API's, we can solve the problems we wanted to tackle based on our requirements. The LLD would be even more interesting but looks like will be an overkill for this post! I leave it to the readers to improvise, validate and suggest betterment and maybe come up with a LLD!

Amrit Raj

Comments