Before solution aiming for a problem, we must understand the problem. And before understanding the problem itself, we must understand the subtle details of technical terms and considerations. Let us first dive deep into these.
Storage and retrieval issues
Law of diminishing utility of an event: An event in the software loses its relevance and utility with time. For example, I receive an event on my app that someone ordered a burger! If I process that event in a second, it will be amazing! If I process it in a minute, it will be so-so. If I take an hour to process it, it will be irrelevant as the person might no longer be hungry or have ordered from somewhere else. This theory can be represented with this graph :
|
Figure 1: Graph representing the value of an event with a time |
The data of the events on the left side of graph 1 is called hot data and on the right side is called cold data. We employ very different sorts of storage techniques for both types of data. Usually, to store the very hot data, we employ techniques which are heavily coupled with the storage. For example storage of all the data to the local disk close by to avoid any network calls and provide low latency retrieval. But these come at a great cost and limitations of hardware. These can be represented as follows :
|
Figure 2: Latency and cost of tightly coupled storage solutions |
The solution used for cold data which is less frequently used by less number of concurrent users is generally decoupled and cloud-based. This is because network calls no more look as much of an issue and cost becomes more important. This can be represented by a figure like this:
|
Figure 3: Latency and cost of storage of loosely coupled cloud storage |
In the industry, we generally use both and we have to set up an architecture to migrate data based on a purge policy from a hot storage to a cold storage. This is a very tiresome process and requires some serious engineering. This is where Apache Pinot (our first component of the trio!) comes to us providing us with the sweet spot of both worlds making our chart look something like this:
|
Figure 4: The cost and latency analysis with usual hybrid methods like Apache pinot |
Real-time ingestion issue
The data that we receive in the form of events needs to reach our processors and our data storage in real time. Many architectures were built to solve problems like these like the pub-sub and the event-driven architecture. Most of these architecture depends on a Queue or a streaming service/component. Apache offers its Kafka for the same use case which can ingest data to a sync from almost every source (with a few tweaks in cases). It is a low-cost high throughput and low latency distributed solution which solves this problem. This is the second piece of our trio!
Data visualization
Data visualization is an expensive and time-consuming process. Hiring a front-end engineer and designing for a particular backend or database not only couples the front end and the back end but also increases the cost. A better solution offered by Apache is the superset (our third component of the trio!). This is an open-source, low latency, backend and database decoupled solution for connecting to any relational database off the shelf and designing dashboards and charts using the drag-and-drop method! It provides easy deployment and role-based access restrictions, inbuilt cache and email-based reporting cron. On top of this, you can always clone its code base and make tweaks to make it a more custom solution!
Now, let's get back to our core problem statement and how this trio solves us. The exact problem statement (which almost every software firm faces these days) is:
- Realtime event capturing and data ingestion into system and data store.
- Milliseconds level fast retrieval of hot data.
- Milliseconds to single-digit seconds of retrieval of cold data.
- Bulk data storage on a horizontally scalable system.
- Low cost and easy to maintain the system.
- Great visualization tools at low cost and customizability.
To solve these use cases, we can use the trio of Apache Kafka, Apache Pinot and Apache Superset. Let me now dive into the architecture of a flow which may potentially solve the use case. In the diagram, I will also try to go slightly deep into Apache Pinot's internal working as well as it is the major component in the design. The design could be something like this:
|
Figure 5: The trio of Kafka, pinot and superset |
The above design is very easy to implement for a PoC as the three components of our trio fit in each other very well because of active support for integration. The details on production architecture would vary keeping in mind that these are all distributed systems and we would need to give some time to the minute details of the design. But at a high level, this would solve our use cases really efficiently. We can read more about these from the official documentation of the trio here:
Comments