Compute systems needed at Petabyte scale! Apache Spark at the rescue!

Some references :

Imagine being given the task to design a system which has to work with hundreds of TB of data or even PB of data. The data is in the form of JSON or better parquet format. The requirements are as follows for this system ;

The user should be able to query the data like a usual SQL-based database. The data though may be unstructured.
The user should be able to have a UI-based system through which he can query the data and even visualize it.
The system might not be real-time like a usual MySQL-based system. However, there should be methods for the user to know that his / her query has been resolved and the UI has been updated with the results.
The system should store the results in a separate S3 file for later retrievals.

If we look at this problem, the very core of the problem which we face is two-pronged:

The Volume of data.
The structure of data and nested queries on nested data.

Usual databases are mostly either SQL-based or NoSQL-based (Keeping aside other specific use case-based databases etc.). On one hand, data volume and data type suggest using a NoSQL database but the query pattern suggests we use an SQL database. Here, Apache Spark comes to the rescue! As per the official documentation :

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

In a very over-simplified definition, apache-spark allows users to write code in Java, Python, Scala etc which can convert huge chunks of data in a distributed fashion into an RDD (Resilient Distributed Dataset) which can be queried like a normal SQL database with even nested queries being supported.

The whole process is really fast (compared to the size of the data!). There are other similar tools as well which can do similar stuff like duckDB. A sample code example which converts JSON data to a DF and then queries for the max and min salary :

from pyspark.sql import SparkSession

from pyspark.sql.functions import col

# Initialize SparkSession

spark = SparkSession.builder \

.appName("MaxMinSalaryExample") \

.getOrCreate()

# Sample JSON data

json_data = [

{"Name": "Alice", "Age": 30, "Salary": 50000},

{"Name": "Bob", "Age": 35, "Salary": 60000},

{"Name": "Charlie", "Age": 40, "Salary": 55000},

{"Name": "David", "Age": 45, "Salary": 70000},

{"Name": "Eve", "Age": 50, "Salary": 45000}

]

# Create DataFrame from JSON data

df = spark.createDataFrame(json_data)

# Show the DataFrame

print("DataFrame:")

df.show()

# Find the maximum and minimum values of the Salary column

max_salary = df.agg({"Salary": "max"}).collect()[0][0]

min_salary = df.agg({"Salary": "min"}).collect()[0][0]

# Print the results

print(f"\nMaximum Salary: {max_salary}")

print(f"Minimum Salary: {min_salary}")

# Stop SparkSession

spark.stop()

This script can be run on any cluster of EC2s with spark installed on them. We usually use AWS EMR for this purpose. The best part is, that Spark can do the same computation as the above code at the scale of petabytes of data in matters of minutes or sometimes even in seconds depending on the cluster configurations. We will use this technology to solve our use case. This is one of the potential designs (Might not be the perfect one but pardon me for my ignorance!) :

This design caters to all the (at least most of) the requirements. But I must tell the readers that working with spark is like taming a beast which comes with many things to keep in mind before using it which are as follows :

Running spark is costly! It takes loads of hardware and EMR clusters are expensive to spin up. So if your problem is not that big (in terms of volume of data), Spark might be an overkill. You can explore duckDB instead if really needed.
Spark needs specialized skills and maybe one might need to hire data engineers or spark specialists to be a part of development.
Development timelines are slower as testing is not as quick as usual software development.
It needs additional infra for deployment like airflow and for testing, we might need notebooks like Jupyter etc.
The version mismatches are a nightmare to resolve most of the time. A particular version of Spark only works with a particular version of Postgres and a particular version of Kafka etc.

Despite all these complexities of using spark, it is a great tool which we can use to solve complex problems of similar type and scale. Apart from this, I am still learning and working on Spark recently. Please pardon me for my ignorance if any. It's just my two cents.

Thanks!

Amrit Raj

The software perspective

Followers

Search This Blog

Compute systems needed at Petabyte scale! Apache Spark at the rescue!

Comments