It’s still one of the most popular data science tools for data cleaning and analytics.
However, after being in data science field for some time, the data volume that I’m dealing with increases from 10MB, 10GB, 100GB, to 500GB or sometimes even more than that.
My PC either suffered low performance or long runtime due to the inefficient local memory usage for data that was larger than 100GB.
That was the time when I realized Pandas wasn’t initially designed for data at large scales.
That was the time when I realized the stark difference between large data and big data.
The word large and big are in themselves “relative” and in my humble opinion, large data is data sets that are less than 100GB.
Now, Pandas is very efficient with small data (usually from 100MB up to 1GB) and performance is rarely a concern.
But when you have more data that’s way larger than your local RAM (say 100GB), you can either still use Pandas to handle data with some tricks to certain extent or choose a better tool — in this case, Dask.
This time, I chose the latter.
To some of us, Dask might be something that you’re already familiar with.
But to most aspiring data scientists or people who just got started in data science, Dask might sound a little bit foreign.
And this is perfectly fine.
In fact, I didn’t get to know Dask until I faced the real limitation of Pandas.
Through its parallel computing features, Dask allows for rapid and efficient scaling of computation.
It provides an easy way to handle large and big data in Python with minimal extra effort beyond the regular Pandas workflow.
In other words, Dask allows us to easily scale out to clusters to handle big data or scale down to single computers to handle large data through harnessing the full power of CPU/GPU, all beautifully integrated with Python code.
Think of Dask as an extension of Pandas in terms of performance and scalability.
What’s even cooler is that you can switch between Dask dataframe and Pandas Dataframe to do any data transformation and operation on demand.
Okay, enough of theory.
It’s time to get our hands dirty.
You can install Dask and try that in your local PC to use your CPU/GPU.
Instead of taming the “beast” by scaling down to single computers, let’s discover the full power of the “beast” by scaling out to clusters, for FREE.
YES, I mean it.
Understanding that setting up a cluster (AWS for example) and connecting Jupyter notebook to the cloud can be a pain to some data scientists, especially for beginners in cloud computing, let’s use Saturn Cloud.
This is a new platform that I’ve been trying out recently.
Saturn Cloud is a managed data science and machine learning platform that automates DevOps and ML infrastructure engineering.
To my surprise, it uses Jupyter and Dask to scale Python for big data using the libraries we know and love (Numpy, Pandas, Scikit-Learn etc.). It also leverages Docker and Kubernetes so that your data science work is reproducible, shareable and ready for production.
There are three main types of Dask’s user interfaces, namely Array, Bag, and Dataframe. We’ll focus mainly on Dask Dataframe in the code snippets below as this is what we mostly would be using for data cleaning and analytics as a data scientist.
Dask dataframe is no different from Pandas dataframe in terms of normal files reading and data transformation which makes it so attractive to data scientists, as you’ll see later.
Here we just read a single CSV file stored in S3. Since we just want to test out Dask dataframe, the file size is quite small with 541909 rows.
NOTE: We can also read multiple files to the Dask dataframe in one line of code, regardless of the files size.
When we load up our data from the CSV, Dask will create a DataFrame that is row-wise partitioned i.e rows are grouped by index value. That’s how Dask is able to load the data into memory on-demand and process it super fast — it goes by partition.
In our case, we see that the Dask dataframe has 2 partitions (this is because of the specified when reading CSV) with 8 tasks.
“Partitions” here simply mean the number of Pandas dataframes split within the Dask dataframe.
The more partitions we have, the more tasks we will need for each computation.
Now that we’ve read the CSV file to Dask dataframe.
It is important to remember that, while Dask dataframe is very similar to Pandas dataframe, some differences do exist.
The main difference that I notice is this method in Dask dataframe.
Most Dask user interfaces are lazy, meaning that they don’t evaluate until you explicitly ask for a result using the method.
This is how we calculate the mean of the by adding method right after the method.
Similarly, if we want to check the number of missing values for each column, we need to add method.
During the data cleaning or Exploratory Data Analysis (EDA) process, we often need to filter rows based on certain conditions to understand the “story” behind the data.
We can do the exact operation as what we do in Pandas by just adding method.
And BOOM! We get the results!
Now that we’ve understood how to use Dask in general.
It’s time to see how to create a Dask cluster on Saturn Cloud and run Python code in Jupyter at scale.
I recorded a short video to show you exactly how to do the setup and run Python code in a Dask cluster in minutes. Enjoy! ????