Logo

The Data Daily

The Key Concepts To Investigate Your Dataset

The Key Concepts To Investigate Your Dataset

This article was published as a part of the Data Science Blogathon.

This is common advice for many data scientists. If your data set is messy, building models will not help you to solve your problem. What will happen is “garbage in, garbage out.” In order to build a powerful machine learning algorithm. We need to explore and understand our data set before we define a predictive task and solve it.

Before going further, Data scientists spend most of their time exploring, cleaning, and preparing their data for modeling. This helps them to build accurate models and check assumptions required for fitting models.

If you are good at understanding data preparation almost 80% of the work is completed.

Whether it’s surveying results, sales data, or an email campaign, you’ve collected data for a specific purpose. By extension, apply this purpose to the questions you’re asking of the data itself. Beginning with some specific questions can keep your research focused and allow you to see the forest through the trees. A question like “what does my revenue look like for the past 3 years” is vague and allows for exploration but also confusion.

Instead, something like “which channel brings in the most revenue for the past 3 years” has a clearer answer. Subsequent questions may be: “which department brings in the most revenue per year” or “are sales in climbing gear increasing or decreasing this year?” It’s important to have a specific question in mind when you begin data analysis so as to provide some structure and avoid stumbling into false positives.

It’s easier to spot relationships if you analyze the data from different subsets. For example, segment your revenue data by channel like the chart above, or by the department. Experiment with the subsets and variables that make the most sense of the questions you developed in the previous step.

This design focuses on allowing you to stay within your train of thought and smoothly transition from question to question, without tripping up on formatting or equations. It can also be helpful to use what would be referred to as a pivot table in Excel. In our outdoor gear retailer example, you can switch from a quarterly view to revenue by a quarter of the year just by selecting in a drop-down menu. The graph below then is an aggregate of each quarter’s revenue between 2010 and 2013.

Experiment with your time variables. Look at the quarter, month, or week, whichever makes sense based on what you’re looking for. Sometimes what is missing is also just as important as what is there. If there are holes in your data analysis, take note. It can be helpful to take notes through your analysis, reminders of what you’d like to research or discuss with colleagues later.

Take a look at this quarterly analysis of revenue by the department. It’s not very helpful because it’s hard to spot trends.

This yearly line graph makes it much easier to see that Climbing is the fastest-growing department and Running sales have been decreasing for the past three years.

Data analysis is a continual process and the best way to approach it is to try to get less and less wrong. You probably won’t ever have all the data you want or need to answer every question about your business, but you can at least push toward more answers and better decisions. This continual feedback loop (question, analyze, investigate, repeat) can be improved but will never be perfect.

Understanding and interpreting data are a very crucial step in machine learning. In this blog post, we tried to provide an overview of techniques that can help you to better know your data

Depending on the size, dimension, and type of your data, you can choose the algorithm. For instance, when you have big raw data, you can use representative examples instead of random samples. If you have a wide data set, you can also find the important dimensions to understand the representative samples.

Different techniques can give you different insights on your data. It is your job to use the tools to solve the mystery like a detective.

You can also read this article on our Mobile APP

Images Powered by Shutterstock