Steve Lohr of The New York Timessaid: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."
It is undeniable that 80% of a data scientist’s time and effort is spent in collecting, cleaning and preparing the data for analysis because datasets come in various sizes and are different in nature. It is extremely important for a data scientist to reshape and refine the datasets into usable datasets, which can be leveraged for analytics. In this article we will look at data preparation, its importance and how it is done.
Do you know the difference between agreat enterprise data scientist and a mediocre data scientist? Most people are of the thought that, the greatness of a data scientist is wrapped up in building better algorithms. However, ProjectPro data science experts say that building a better algorithm is just like building a superfast rocket ship. It is good on the assumption that the ship is in the right direction. Nobody is interested in moving faster if they are heading in the wrong direction for analysis but this is what truly happens in many organizations. Data cleaning plays a crucial role in ensuring that the rocket points in the right direction.
Suppose you are trying to analyse the log files of a website, to find out which IP address the spammers are coming from, or from which demographic your website is getting more sales, or in which geographic region is the website popular? To answer all these questions, analysis has to be performed on the data with two important columns, namely - the number of hits made to the website and the IP address of the hit. As we all know that log files are not structured and contain lots of unstructured textual information, in simple terms, preparing the log file to extract data in the required format (IP address + Number of Hits) for analysis, can be termed as - data preparation.
CrowdFlower, provider of a “data enrichment” platform for data scientists, conducted a survey of about 80 data scientists and found that data scientists spend –
The survey statistics clearly reveal that most of a data scientist’s time is spent in data preparation (collecting, cleaning and organizing) before they can begin doing data analysis. There are several valuable data science tasks like data exploration, data visualization, etc. but the less glamorous and least enjoyable data science task - is data preparation. Data preparation is also referred as data wrangling, data munging or data cleaning. The amount of time needed for data preparation for a particular analysis problem ,directly depends on the health of the data i.e. how complete it is, how many missing values are there, how clean it is and what are the inconsistencies.
The survey also revealed that 57% of the data scientists consider cleaning and organizing data - as the most boring and least enjoyable task of the data science process and 19% consider collecting datasets as the least enjoyable task.
Monica Rogati, VP for Data Science at Jawbone in reference to data preparation said -It’s something that is not appreciated by data civilians. At times, it feels like [data wrangling] is everything we do.”
Let us consider a simple example, where your goal as a data scientist, is to estimate how many burgers McDonald’s sells every day in US. You have a .csv file - where each row describes the finances of McDonalds. There are columns like state, city and the number of burgers sold. However, instead of having all this data in one single document, you probably receive it in multiple files and in diverse formats. A data scientist has to join all this data and make sure that the resulting combination makes sense for further analysis. Usually there are several formatting inconsistencies and floating issues in the dataset. For instance, there could be some rows where the state is 101 and the number of burgers sold is “New York”. Data cleaning process requires a data scientist to find all these glitches, fix them and ensure that next time when such data comes in, it is fixed automatically. Predictive analysis results of a data scientist can be as good as the data they have assembled. Data preparation is a vital step of the data science process for any valuable insights to pop up, which is why a data scientist job commands a high pay package, in the industry.
There are petabytes of data available out there but most of it is not in an easy to use format for predictive analysis. Data cleaning or preparation phase of the data science process, ensures that it is formatted nicely and adheres to specific set of rules. Data quality is the driving factor for data science process and clean data is important to build successful machine learning models as it enhances the performance and accuracy of the model. Data scientists evaluate the suitability and quality, to identify if any improvements can be made to the dataset to achieve required results. For instance, a data scientist might discover that few data points bias the machine learning model towards a certain result. This helps them create a filter to tackle this situation.
According to a Gartner research report, poor quality of data or bad data costs an average organization $13.5 million every year, which is too high a cost to bear. Bad data or poor quality of data can alter the accuracy of insights or could lead to incorrect insights, which is why data preparation or data cleaning is of utmost importance even though it is time consuming and the least enjoyable task of the data science process.
The foremost and important step of the data preparation task that deals with correcting inconsistent data is filling out missing values and smoothing out noisy data. There could be many rows in the dataset that do not have value for attributes of interest or there could be inconsistent data or duplicate records or some other random error. All these data quality issues are tackled in the foremost step of data preparation.
Missing values are tackled in various ways depending on the requirement either by ignoring the tuple or filling in the missing value with the mean value of the attribute or using a global constant or some other techniques like decision tree or Bayesian formulae. Noisy data is tackled manually or through various regression or clustering techniques.
Data Integration step involves - schema integration, resolving data conflicts if any and handling redundancies in data.
This step requires removing any noise from the data, normalization, aggregation and generalization.
Data warehouse might contain petabytes of data and running analysis on the complete data present in the warehouse, could be a time consuming process. In this step data scientists obtain a reduced representation of the data set, that is smaller in size but yields almost same analysis outcomes. There are various data reduction strategies a data scientist can apply, based on the requirement- dimensionality reduction, data cube aggregation and numerosity reduction.
Dataset usually contains 3 types of attributes- continuous, nominal and ordinal. Some algorithms accept only categorical attributes. Data discretization step helps data scientist divide continuous attributes into intervals and also helps reduce the data size - preparing it for analysis.
Many methods and techniques have been developed for data preparation but it is still an area of research, that many scientists are exploring to discover novel techniques and strategies.
Tools like OpenRefine (GoogleRefine), DataCleaner and many others are being built to automate data preparation or data cleaning process, so that it can help data scientists save data preparation time. IDC predicted that by the end of 2020 the spendings on data preparation tools will grow 2.5 times faster than the regular IT controlled tools. Another study by Forrester predicted the in 2016, machine learning will replace manual data preparation. However, automating data preparation is not so easy, because no two data preparation tasks are same. Data preparation is not an art and hence it is necessary for aspiring data scientists to learn Python and R language to be successful in this data science process.
All data is dirty! It is up to you, as a data scientist, to improve it. If you are an aspiring Data scientist, then it is necessary to have a good knowledge of various data cleaning or data munging toolkits like Python and R. Python pandas packageand dplyr, reshape2, lubridate, tidyr packages in R language are a perfect fit for most of the data munging tasks.
We invite the data science community to chime in with their comments on what challenges they have come across in data preparation for data mining.
If you are an aspiring data scientist and would like to learn more about the data preparation tools like Python and R, please send email to anjali@projectpro.io