Data preparation is an important step in any data analysis. This article offers suggestions for making that process easier and more effective.
You just updated your LinkedIn profile with the sexiest job of the 21st Century, according to Harvard Business Review. That’s right: you’re a data scientist. You’re pulling down a six-figure salary. You’re single-handedly turning your once-tired business into a data-driven machine with fancy new machine learning models and algorithms. Your parents may not understand what you do, but they’re proud.
If only they knew that you’re basically a data janitor.
That’s not to say that janitorial work isn’t a noble profession, whether it’s of the sweep-the-floors or the cleanse-the-data variety. Both are important and, in the case of data science, data cleansing, or data preparation, is a critical precursor to being able to do anything useful with data.
According to Anaconda’s 2021 State of data science survey, survey respondents reported they spend “39% of their time on data prep and data cleansing, which is more than the time spent on model training, model selection and deploying models combined.” According to other studies, data preparation can claim as much as 80% of a data scientist’s time.
Data preparation takes so much of a data scientist’s time because, ultimately, data can’t do much if it hasn’t been vetted and prepped for success. Given the importance of good data preparation to delivering good data science, it’s important to understand what it is and how to do it well.
According to TechRepublic, data preparation is “the process of cleaning, transforming and restructuring data so that users can use it for analysis, business intelligence and visualization.” AWS’s definition is even simpler: “Data preparation is the process of preparing raw data so that it is suitable for further processing and analysis.”
But what does this actually mean in practice?
Data doesn’t typically reach enterprises in a standardized format and, thus, needs to be prepared for enterprise use. Some of the data is structured—like customer names, addresses and product preferences — while most is almost certainly unstructured—like geo-spatial, product reviews, mobile activity and tweets.
Before data scientists can run machine learning models to tease out insights, they’re first going to need to transform the data, reformatting it or perhaps correcting it, so it’s in a consistent format that serves their needs. This is where data preparation makes all the difference.
Talend, a company that provides tools to help enterprises ensure the integrity of their data, has suggested a few key benefits of data preparation, including:
In addition, data preparation can help to reduce data management costs that balloon when you try to apply bad data to otherwise good ML models. Now, given the importance of getting data preparation right, what are some tips for doing it well?
If you’ve read this far, you hopefully are convinced that you can’t deliver ML success without substantial investment in data preparation.