Data ingestion pipelines: The veins carrying data of your system!



     Data ingestion has been a topic of discussion for me recently as I started working on a particular project. I started reading about this topic extensively and realized that data ingestion serves as the drainage system of data. It pumps data to every part of a system and works like the heart of a huge system. Let us see what data ingestion exactly does. 

     At a very high level, data ingestion is a process of gathering data from one or more sources and dumping it at a different place (a sync) with some modifications in the way. This seemingly simple task of data movement quickly turns out to be a complex endeavour when the size of data increases and the response time needs to be minimized. Also, making this system fault tolerant becomes a challenge as it scales. Usually, there are these three needs/types of data ingestion: 

  1. ETL (Extract, Transform, Load): This is one use case where we need to fetch data from various sources and transform it and then load it to some other place in a transactional manner. For example:


  2.  CDC (Change data capture): In this use case, you only pull data from the source if there is an action performed on it (for example an update, delete or add etc.). This kind of pipeline is more optimized in terms of the payload but makes more calls to the source and sync. For example:


  3.  Replication: This kind of pipeline takes the data from the source and does a conflict resolution before dumping it into the sync. These are usually used to replicate data sources or sync two data sources. For example: 

     These are the three main types of data ingestion pipelines. Of course, there are an 'n' number of permutations and combinations of these three variants of pipelines suited for specific purposes. Examples of the most popular open-source ETL-type pipeline tools could be Apache Airflow and Temporal. These work on the concept of a DAG (Directed Acyclic Graph) based transformation techniques. The input is transformed in a step-by-step manner where is step is a node of the DAG and the end result is deterministic. You can read more about them here
     Similarly, a good example of a CDC-type pipeline could be Debezium. It generates streams on every event occurring on the source which can then be read by a Kafka or Airflow or Temporal to transform and then save to the sync. Now, let us discuss the typical use cases of such data ingestion pipelines:
  1. To collect data from various sources for processing. For example, to generate a bill you might need some data located at different tables some of which are SQL based and some are No-SQL based.
  2. For reporting purposes. Most of the time, the data needed to generate reports, charts and dashboards are located at different places which need to be moved in bulk to another place for building the graphical interface for it.
  3. For purging the data. Databases are very expensive! They are good for storing hot or warm data (data which is used frequently) and not to store cold data. We might as well move the cold data to a data lake or a data warehouse. This involves regular movement to bulk data without failure to reduce cost and improve the database performance. The cold data can then also be used for analytics and ML purposes. This is best done via the data pipelines. 
  4. For creating a replica, sync, or migration. Whenever we are migrating bulk data we might consider the data ingestion pipeline. Similarly, for keeping two databases in sync, we might consider them.
     There are other use cases as well but these are the most popular ones in the industry. I have kept the diagrams above very simple but usually, such pipelines come with a great number of analytics tools and dashboards. They also come with good retry mechanisms (sometimes off the shelf and sometimes to be manually configured). These are usually whole infrastructures in themselves and require careful design and implementation as these may quickly become a fatal bottleneck for the whole system. Their failure can be critical. You can read more about an example of such an infrastructure here

-Amrit Raj

Comments