We all believe that data science is a strong asset to gaining crucial insights from your business data. However, I still...
We all believe that data science is a strong asset to gaining crucial insights from your business data. However, I still find that many people (including those who are already doing data science professionally) are lacking on how data science can provide these insights. This is because there is still a great barrier for them to understand or even believe how these insights are derived. Data science is the ability to generate data-driven business value. Data science is a marriage of art with science in the sense that data scientists must be able to express their understanding through a visualized data story. As we learn to tell this story well, we see that the scientific and business benefits follow. Data visualizations are a great way to show your data in an easy-to-understand way. There are a plethora of data visualization techniques, such as graphs, box plots, bar plots, and histograms – to name a few. A key skill that all great data scientists possess is the ability to know which is the best visualization technique to use for what they are trying to show. In this blog post, we will shift our attention to time-series data. We will answer the following questions: 1. How can we best illustrate time-series data? 2. How can we easily develop it in Python?
There are multiple ways to visually present time-series data. Traders (and anyone who has dabbled with crypto) will most definitely be familiar with the scatter plot — or its candle chart variation — as means to depict time-series. Albeit its ability to show how our target value is changing over time, this type of plot can become really complex really fast. Firstly, the longer the data range, the more difficult it becomes to show the entire data. Secondly, comparing multiple target values over the same time also becomes tedious. Sure, we can get away with a comparison between 2, maybe 3, targets. But anything beyond, and the visualization becomes too heavy. A cleaner way to achieve this is to bring our static plots to life through animation. Animation is a very powerful tool for presenting your data, especially when time series are involved. Animated visualizations allow you to clearly see patterns in your data, that you might have otherwise overlooked with a static chart. These types of plots, most commonly referred to as data-races, have increased in popularity over the years. There is even a YouTube channel dedicated to uploading only data-races. In this blog, you will learn how to create your first animated data visualization in Python which will give you a better understanding of how to work with time series data. A mixture of pandas, matplotlib, and bokeh are used for creating this visualization. As an example, we will be using the Earth Surface Temperature Data dataset from Kaggle. This dataset provides an Earth surface temperature reading by country per month. The raw dataset has a shape of 577462 rows by 4 columns. The first step before we can generate the visualization, we need to transform our dataset to a wide format. Wide format means that every row must represent a new time element, while every column should hold the value of a specific target. Our dataset has the following structure: Step 1: Decide which metric to focus on. In this case, we will visualize the AverageTemperature Our columns should be the names of the different countries, and their values should reflect their average temperature reading for that time element. This step can easily be done using a pivot table.
This creates a table with 3239 rows and 243 columns (belonging to the 243 countries available in the dataset).
The next step is to fill in any missing values. When creating our animation, it is important that we always have some value to show. Any NULL values will break the aesthetics of our animation.
In our case, our dataset does suffer from missing values. Especially for the early years. Our filling strategy will be to forward-fill (use the last valid reading for the missing one) our missing recordings.
We also want to drop all countries from our dataset which do not have any recordings.
For the sake of simplicity and performance, I will also limit our data to start from the year 2000 onwards. If you are following along, you absolutely do not have to do this step.
We can easily do this in Python using Pandas.
Our dataset now has 165 rows and 242 columns and it’s ready to be animated!
One of the main reasons why I love Python is that someone, somewhere, has already created a package to solve our task using a few lines of code.
This is the case for data race visualizations as well.
We can install the package either using pip or conda, as follows:
After installation, we can import the package into our Python module and initialize the visualization.
And the result? (Note: I had to convert the result to a GIF to be supported by Medium. This resulted in some of the scales being weird).
And there you have it — your first animated plot using Python!
I encourage you to go over the documentation of this package to familiarise yourself with all the extra functionality.
Article originally posted here by David Farrugia. Reposted with permission.