Logo

The Data Daily

Tidyverse, an opinionated Data Science Toolbox in R from Hadley Wickham

Tidyverse, an opinionated Data Science Toolbox in R from Hadley Wickham

Get your productivity boosted with Hadley Wickham's powerful R library, Tidyverse. It has all you need to start developing your own data analytics and data science workflows.

Here we will review Tidyverse (short for Tidy Universe), which is a library compiled and designed as a ready-to-use set of R packages that share a common data representation and API design for higher consistency and fluency when scripting and analyzing data. It also features easy installation and core packages loading to get you started in no time. It can be considered as the data science and data management toolbox in R.

With this library loaded into your project, you can easily perform the fundamental data science tasks like importing, plotting, wrangling and modelling data as well as functional programming for new developments. The bone marrow of this library is comprised of a robust array of R packages as ggplot2, dplyr, tidyr, readr, purr and tibble among others. We will go into the details of each of these packages further along as we will show some basic examples to get you started with this amazing toolbox. The library is intended to be a harmonious and compatible set of tools and commands that bring to life the by-the-book definition of an effective data science workflow.

First, the history of Tidyverse, how it came to be and who is the mastermind behind it. This ‘package of packages’ was developed by Hadley Wickham, Chief Scientist at RStudio and co-author of the amazing O’Reilly series book “R for Data Science”; he is also in charge of maintaining the library to its best. You may already be familiar with Hadley as he developed previous R packages like reshape, reshape2 and plyr; which in turn, were building blocks of Tidyverse as a product of several experiments and versions. It was created for statisticians and data scientists with the sole purpose of boosting their productivity and as an attempt to tried to reproduce and abstract the Canonical Data Science Workflow (figure) into an actual product. This is an extremely versatile and consistent library as its powerful features range from productivity and workflow enhancement, to new data science software development and data science education.

It is important that expand a bit more on the general features of this library, so we can see it in action later in our practical example. Its consistency is deeply rooted on the fact that variables, functions and operators follow regular patterns and syntax. For example, the first argument of every function will be a tidy data frame (one row per observation, one column per variable, one entry by cell). To perform operations, one can intuitively connect a sequence of commands, base functions and operators to create a tidy pipeline. The way in which the packages are organized, the coding style and testing procedures comprise a second, lower-level degree of consistency across the library, so it when we say the word “consistent”, it should not be taken lightly. Finally, because there is a one-to-one relationship between the analysis workflow processes and the different Tidyverse packages, it is extremely easy to establish effective end-to-end workflows that respond to specific analytic purposes and utilize several types of data.

It is therefore not a surprise to see an increasing popularity among its users, both experienced and beginners. In fact, those looking to learn R in Data Science should definitely start with Tidyverse as it has a friendly, low-steep learning curve that allows an early-career professional to clean and tackle nontrivial datasets in short time.

Now, let’s dig deeper into each of the core packages so we can have an overview of the fluency and consistency of the library.

Now, let’s get started. For this example, we will be using the Titanic dataset from the Kaggle competition, which you may easily download here. The purpose of our example is to run walk you through some common operations you can perform in Tidyverse and show a bit more the syntax and the power of the consistency of this library. I hope you enjoy reproducing this code yourself. I encourage you to go through the documentation so you can come up with your own Tidyverse recipes for data analysis.

Note: You will see a particular this symbol (‘%>%’) very often throughout the example. This is called the pipe operator and it is very useful when scripting as it allows you to keep track of the logic of your analysis. Think of it in the following way: The function f(x) is now expressed as x %>% f.

For more documentation on Tidyverse go to the Official site and see additional examples.

If you are starting out with R and would like to learn more about the usage of Tidyverse, the book “R for Data Science” by Hadley Wickham is the best resource out there.

Also, for another great summary on Tidyverse and its features, this post from R Views is an amazing reference.

Images Powered by Shutterstock