How about adding a searching layer in your design! Elastic search is at your service!


    Many modern-day software solutions need a dedicated search capability to best serve their users. Searching may be needed in an e-commerce software system so that users can search for the product they want most efficiently, or it may be needed in a metric emitting dashboard so that we can search data to index and create a metrics dashboard.  Direct database searching might not be the best solution for such use cases. Imagine having to type the exact product name to search in an e-commerce web app. We always try searching with a string close to the actual name of the product but never the exact string. We rely on the search system to "improvise" and be "smart" enough to "guess" what we mean when we try to search for something. This is where the need for a dedicated search solution came up and people actively started to look for it. Elastic search is one such powerful solution. Let's try to discuss the search use case and how we can integrate it into our design keeping one of the most popular search solutions, the elastic search.
     You can read the whole documentation around elastic search here. For now, let me bring out the details of ES (elastic search) relevant to us for now:
  1. It is a distributed search and analytics engine for all types of data. 
  2. It is built on Apache Lucene using Java and interacts using simple REST APIs.
  3. It is highly scalable and very fast. 
  4. When clubbed with Kibana (a tool to visualize the data in ES and make analytics on that), it becomes a great analytics tool as well.
    Now, let's try to understand how a typical search engine works. Every search engine has a different algorithm but let us take the example of ES and see how it handles the use case. We can then build an idea of how this task is handled and most other engines also work on similar lines. Let us first understand some concepts before we see how ES works:
  1. Documents: Each JSON-based "record" stored in ES is called a document. ES stores and operates on JSON-based records (which resembles NoSQL databases a bit). An example of a document could be a student record say :
    {"name": "foo", "percentage": {"Math": 90, "English": 8}, "class": 10, "Father": "bar"}
  2. Indices: A collection of "logically related" documents is called an index. It can be visualized as a table holding each record (or document). This is just for understanding purposes but actually, no tables are involved.
  3. Inverted index: It is a hashmap kind of structure where keys are "tokens" and values point to the document. For example, let's take the document example of a student we took above. If we "tokenize" the document, we will get a list of Strings containing each word. So we will get something like this : ["name", "foo", "percentage", "Math", "English", "bar", .... ]. Now, when we create a map of this list with each word acting as a key and the corresponding value being pointers to which documents they point, we get an inverted index. 
    Based on this index, we may create a "score" for each word based on how many times the word occurs in all the documents. This score is the core idea behind search results and recommendations. When we search for a keyword, it shows all the documents which have that word in decreasing order of this "score".
    There are other concepts around ES as well like node (a single running instance of ES), cluster (a group of nodes), shards (storing different documents at different nodes based on some logic to avoid a single point of failure), replicas (copies of documents so that if original is lost, we have a replica) etc. We also have a whole mechanism of how ES saves the documents to drive (it is an expensive operation and hence is done only once in a while and till then the document details are maintained in a sort of log file) and how it is extracted/updated (documents are immutable and every update creates a new copy of it with a newer version and marks the old version not useful as deleting otherwise will need index updates) using REST APIs (POST, GET, PUT etc calls with JSON payload). For now, all we need to know is that once we have stored our data in the ES as documents and created an index on them, we can query them using JSON-based API calls and using them, we can perform complex queries like get, group, arithmetic operations, fuzzy search (search based on recommendation. Like searching even when spellings are not correct etc. This is done using this algorithm.) etc. With these data points and ways to access the documents, we can easily build our search tool on top of this.
    If we include Kibana on top of this, we would be able to build real-time dashboards which would get their data and metrics from ES and would do those complex arithmetic operations to come up with analytics, graphs, metrics, data points etc which may be critical to analyse the performance, usage and make business and tech decisions. You can read more about Kibana here. Let's now finally see how we can make a simple design to include these.

Figure 1: Very high-level diagram of a potential search bar design using elastic search and associated dashboards using Kibana

    The above diagram is a very highly reduced diagram in terms of complexity and the actual production system would actually consist of a lot of other complex things in the design like load balancers, shards, replicas, clusters and even the data entry and access mechanism would be far more detailed. But the above image would give the readers a view of how things work from a bird's view. There are also costs associated with dumping data in ES and need to be evaluated before madly dumping everything in ES. Besides, things like replica numbers, shards, and indexes need to be defined well in advance as changing them later is way too difficult. This is an interesting topic to explore and these videos will help the readers a lot. I leave the rest to my curious readers. Please comment down your views, improvisations and suggestions!

Amrit Raj

Comments