6 Fundamental Visualizations for Data Analysis

Read original article here

6 Fundamental Visualizations for Data Analysis
A practical guide with Matplotlib
Photo by Lucas Benjamin on Unsplash
Data visualization is a very important part of data science. It is quite useful in exploring and understanding the data. In some cases, visualizations are much better than plain numbers at conveying information.
The relationships among variables, the distribution of variables, and underlying structure in data can easily be discovered using data visualization techniques.
In this post, we will learn how to create the 6 basic yet commonly used types of data visualizations. I also wrote a post that explains how to create these visualizations with Seaborn.
We will be using Matplotlib for this post. Thus, you will not only learn about the visualizations but also see the difference between Matplotlib and Seaborn syntax.
We will use the grocery and direct marketing datasets available on Kaggle to create the visualizations.
Let’s start by reading the datasets into pandas dataframes.
import numpy as np
import pandas as pdgrocery = pd.read_csv("/content/Groceries_dataset.csv",
parse_dates=['Date'])marketing = pd.read_csv("/content/DirectMarketing.csv")
The first 5 rows of the grocery dataframe (image by author)
The first 5 rows of the marketing dataframe (image by author)
We can start to create visualizations and explore the datasets now.
1. Line plot
Line plots visualize the relation between two variables. One of them is usually the time so that we can see how a variable changes over time.
For the grocery dataset, we can use a line plot to visualize how the number of items purchased changes over time.
Let’s first calculate the number of items purchased on each day using the groupby function of pandas.
items = grocery[['Date','itemDescription']]\
.groupby('Date').count().reset_index()items.rename(columns={'itemDescription':'itemCount'}, inplace=True)items.head() Date itemCount
0 2014-01-01 48
Here is the matplotlib syntax to create the line plot.
plt.figure(figsize=(10,6))
plt.title("Number of Items Purchased - Daily", fontsize=16)
plt.plot('Date', 'itemCount', data=items[items.Date > '2015-08-01'])
plt.xlabel('Date', fontsize=14)
plt.ylabel('Item Count', fontsize=14)
(image by author)
The first line creates a Figure object, the second line adds the title, and the third line plots the data on the Figure object. The last two lines adds the labels for x-axis and y-axis.
The plot contains the data after 2015–08–01 for demonstration purposes.
Note: The default figure size is (6,4). We can change it for each figure separately or update the default figure size.
#to get the default figure size
plt.rcParams.get('figure.figsize')
[6.0, 4.0] #to update the default figure size
plt.rcParams['figure.figsize'] = (10,6)
2. Scatter plot
Scatter plot is commonly used to visualize the values of two numerical variables. We can observe if there is a correlation between them. Thus, it is also a relational plot.
A scatter plot can be used to check if there is a correlation between the salary and spent amount in the marketing dataset. We can also distinguish the values based on a categorical variable.
Let’s create a scatter plot of the salary and spent amount for married and single people separately.
fig, ax = plt.subplots()
plt.title("Salary vs Spent Amount", fontsize=16)ax.scatter('Salary', 'AmountSpent',
data=marketing[marketing.Married == 'Married'])ax.scatter('Salary', 'AmountSpent',
data=marketing[marketing.Married == 'Single'])ax.legend(labels=['Married','Single'],
loc='upper left', fontsize=12)
(image by author)
We have created a Figure object with multiple Axes objects. The scatter plots for each category (Married and Single) are plotted on the Axes objects.
It is much simpler to separate categories with Seaborn. We just pass the name of the column to the hue parameter.
There is a positive correlation between salary and spent amount which is not a surprise. Another insight is that the married people earn more than single people in general.
Note: You may have noticed the “xticks” and “yticks” sizes are different between the first and second plot. I have updated these settings using the rc method as below.
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
3. Histogram
Histogram is a way to check the distribution of a continuous variable. It divides the value range of the variable into bins and shows the number of values in each bin. Thus, we get an overview of how the values are distributed.
We can check the distribution of spent amount using a histogram.
plt.title("Distribution of Spent Amount", fontsize=16)
plt.hist('AmountSpent', data=marketing, bins=16)
(image by author)
The bins parameter is used to change the number of bins. More bins will result in a more detailed overview of the distribution.
4. Box plot
Box plot provides an overview of the distribution of a variable. It shows how values are spread out by means of quartiles and outliers.
A box plot can be used to check the distribution of spent amount in the marketing dataset. We can also differentiate based on the “OwnHome” column.
X1 = marketing[marketing.OwnHome == 'Own']['AmountSpent']
X2 = marketing[marketing.OwnHome == 'Rent']['AmountSpent']plt.title("Distribution of Spent Amount", fontsize=16)
plt.boxplot((X1,X2), labels=['Own Home', 'Rent'])
(image by author)
We can pass an array of values to the box plot function or multiple arrays in a tuple. People who own a home spend more in general. The values are also more spread out for them.
The line in the middle represents the median value of the variable.
5. Bar plot
Bar plot is mainly used on categorical variables. It is a simple plot but useful for reports or delivering results.
We can create a Figure with two bar plots using the subplots function.
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,
sharey=True, figsize=(8,5))ax1.bar(x=location.index, height=location.values, width=0.5)
ax1.set_title("Location", fontsize=14)ax2.bar(x=age.index, height=age.values, width=0.5)
ax2.set_title("Age Groups", fontsize=14)
(image by author)
We can see how many values exist in each category. This information can simply be obtained with the value_counts function of pandas. However, it is preferred to use a visualization.
6. 2D Histogram
2D histograms combine 2 different histograms on a grid (x-axis and y-axis). Thus, we are able to visualize the density of overlaps or concurrence. In other words, we visualize the distribution of a pair of variables.
We can easily create a 2D histogram using the hist2d function.
plt.figure(figsize=(8, 8))
plt.title("Histogram of Spent Amount and Salary", fontsize=16)plt.hist2d("AmountSpent", "Salary", range=[[0, 2000], [0, 80000]],
data=marketing, cmap='Blues')
(image by author)
To get a more informative picture, I have used the range parameter to limit the ranges on x-axis and y-axis. Otherwise, most of the values would be squeezed to the bottom left corner because of the outliers.
The darker regions contain more data points. We can say most people are in the lower region of both ‘AmountSpent’ and ‘Salary’ columns.
Conclusion
What we have covered in this post is just a small part of Matplotlib’s capabilities. However, these basic plots are commonly used in exploratory data analysis or creating data reports.
Furthermore, they help a lot to learn the syntax for Matplotlib. The best way to master Matplotlib, like in any other subject, is to practice. Once you are comfortable with the basic functionality, you can proceed to the more advanced features.
Matplotlib syntax is more complex than Seaborn but it provides more control and flexibility on the plots.
Thank you for reading. Please let me know if you have any feedback.
Written by

Images Powered by Shutterstock

The Data Daily

6 Fundamental Visualizations for Data Analysis