Logo

The Data Daily

The high level Data Science brief you need before you start your career

The high level Data Science brief you need before you start your career

I remember my days of stumbling through articles and videos to understand data science. I did not understand what exactly data science is till I did it as a part of my project. I am writing this article to help anyone who wants to know what actually data science is and why is it trending so much these days, and get a taste of it before diving deeper into the field.

Data science is the analytics of a very big amount of data which gives prediction and actionable insights given by some metrics. You can think of it as weather forecast, if it rains you have to wear a rain coat, or carry an umbrella. It usually doesn’t have to do with anything in the past, but learning from the past data, so the business can have an actionable insight.

Data science isn’t new though, the term is on the rise. It isn’t new because scientists have been using data to have better findings from years now. Why the term is on the rise and the field is trending a lot? Because our capacity to store data has significantly increased. It is speculated that total data stored in the world has increased from 0.8 ZB in 2008–09 to 30ZB in 2020 and going to be more than 120ZB by 2025. And hence businesses are keen to understand their data to take data driven decisions.

A very good data scientist will require skills ranging from being a domain expertise, mathematician, developer, good soft skills, scientific method, data engineering, statistics, visualization, advanced computing and so on. But finding all the qualities in a single person is quite rare, and hence big companies are hiring such people to share the resources.

But what we actually need to be a data scientist is: 1. The foremost is passion for data 2. Relate problems to analytics 3. Care about engineering solutions 4. Exhibit curiosity 5. Communicate well with the team

The steps for any data science project on a high level seems somewhat like this, and follow a similar pattern.

The data can be acquired through APIs, databases (SQL or NoSQL), RSS Feeds, live data from sensors, spread sheets. We have different mediums to store and retrieve data and these are few of them. Before starting with the process, we get the data from one of such mediums which is available. The raw dataset is imported into the analytics platform.

1. Understanding the nature of the data, preliminary analysis (customer’s/client’s need, feasibility, economic analysis), finding correlations(within the data), general trends and outliers(any rare event, or exception in data, to be ruled out to have better analytics).

2. Describing the data — This involves measuring the data not restricted to, a. Mean and Median of the columns. b. Mode — which is the value that occurs the most. c. Range — how the data is spread.

3. Visualizing the data — The data is visualized through a bunch of visuals such as, Heat maps, Histogram — Distribution of data, Boxplots, Line Graphs, Scatter Plots. This gives insights of the data before we actually start working on it. It provides us with information which might be missed through non visual processes.

This process involves cleaning of the data, real world data is usually inconsistent, duplicate, invalid. We usually clean the data by,

Involves building a model. Data is input to the model and an analysis technique is decided. What the model generates is the output. We have different techniques and models depending upon the requirement.

ModellingTo predict something useful from the dataset, we need to use Machine Learning algorithms. This is where machine learning comes as a sub division of data science.

The steps are as follow: 1. Select Technique — Of the given techniques, we can select which suits our requirement from the data. 2. Build Model — We can use existing pre defined models, or customize them. 3. Validate Model — Validation of the model, whether the model is giving the right output.

Validation can be done for the 1. Classification and Regression — By comparing predicted value vs correct value. 2. Clustering — Verify whether the groups assigned are useful for our business. 3. Association Analysis — Takes longer as it has to be investigated by actually looking at the data and verify the relations between different entities.

We can also repeat the analysis, build the model again, if we are not satisfied with the results. This is decided after looking at the result and is done till we have a minimum level of accuracy.

This is usually a decision what to present, what part of the analysis to be shown, and deciding best value to the business. Finding and reporting what are the main values, main result of the whole process. What question is our findings and analysis answering? Think of the business questions which are getting answered by such analysis.

All findings must be reported so the next steps can be decided. This can also result in scrapping the project if no scope of usefulness is visible from the project.

The whole process of data science is done to get some actionable insights for the business or the product. Hence this step is the final one, where it is decided what actions to be taken after the analysis.

Usually the business decides the action from the insights after the suggestions from the data team.

Now that you have an idea what to be expected in the field of data science, and what the industry expects from a data scientist, and the what are the processes involved, you can make a better judgement as to picking the data science as your career.

Images Powered by Shutterstock