Data ingestion has been a topic of discussion for me recently as I started working on a particular project. I started reading about this topic extensively and realized that data ingestion serves as the drainage system of data. It pumps data to every part of a system and works like the heart of a huge system. Let us see what data ingestion exactly does.
At a very high level, data ingestion is a process of gathering data from one or more sources and dumping it at a different place (a sync) with some modifications in the way. This seemingly simple task of data movement quickly turns out to be a complex endeavour when the size of data increases and the response time needs to be minimized. Also, making this system fault tolerant becomes a challenge as it scales. Usually, there are these three needs/types of data ingestion:
- ETL (Extract, Transform, Load): This is one use case where we need to fetch data from various sources and transform it and then load it to some other place in a transactional manner. For example:
- CDC (Change data capture): In this use case, you only pull data from the source if there is an action performed on it (for example an update, delete or add etc.). This kind of pipeline is more optimized in terms of the payload but makes more calls to the source and sync. For example:
- Replication: This kind of pipeline takes the data from the source and does a conflict resolution before dumping it into the sync. These are usually used to replicate data sources or sync two data sources. For example:
- To collect data from various sources for processing. For example, to generate a bill you might need some data located at different tables some of which are SQL based and some are No-SQL based.
- For reporting purposes. Most of the time, the data needed to generate reports, charts and dashboards are located at different places which need to be moved in bulk to another place for building the graphical interface for it.
- For purging the data. Databases are very expensive! They are good for storing hot or warm data (data which is used frequently) and not to store cold data. We might as well move the cold data to a data lake or a data warehouse. This involves regular movement to bulk data without failure to reduce cost and improve the database performance. The cold data can then also be used for analytics and ML purposes. This is best done via the data pipelines.
- For creating a replica, sync, or migration. Whenever we are migrating bulk data we might consider the data ingestion pipeline. Similarly, for keeping two databases in sync, we might consider them.
Comments