Should a Data Scientist Know How to Code?

Read original article here

A data scientist can be many things, and a coder could one of them. Over the course of my career in data science, I have seen a wide array of professionals using the tiniest amount of coding. But on the other hand, I have seen people write books of code to explain their model. Really, what it comes down to is what type of data scientist you want to be. After interviewing at countless companies over the last few years, I have learned that there are specific data scientist roles. They can be broken down into two main bins: those who use modeling to explain the more business-focused side of data science, or those who perform advanced python code that not only puts the model into production but also finds a way to streamline the data so that the whole entire pipeline is automated. Below, I will elaborate on the main differences between business data scientists and coding data scientists.

There are certain big tech giants that will post a job title as data scientist, but when you look closely at the text you will see that there are sentences upon sentences before the first programming language is listed.

Another key focus for this position is really honing in on your visualization skills. Meaning, you will have to be a guru at platforms or tools like Google Data Studio, Power BI, or the most popular, Tableau. Not only is visualizing your data important but getting it is important as well. You will have to practice strong SQL skills to query your databases that will make your model dataset. Once you begin modeling, some of the essential things you will need to know are statistics — like A/B testing. To sum, the key parts of this type of data scientist are:

Visualizations — Tableau, the most prominent tool to show business stakeholders what your key metrics and findings are.

SQL — querying your database by joining on various tables, subquerying, and other more complex ways of getting your data.

Statistics/Insights — this part of the role entails modeling with regression, Analysis of Variance (ANOVA), using tools like SAS as well, in addition to A/B Testing.

Different than the more business-centered role from above, this role aims to highlight people who have a great knack for software engineering while tampering with machine learning models in a more object-oriented fashion.

Object-Oriented Programming — instead of working only in your Jupyter Notebook, you will be writing clean and structured (most likely Python code) in .py files that use classes and functions.

Docker/Airflow /DAGs— these tools help to automatically bring in new data, train, and evaluate. Based on certain parameters, you can run your model every hour, day, etc, or when there is a certain amount of new evaluation data. Another term for all of these methods together is a machine learning pipeline.

Although the two roles overlap, and might even have two separate job descriptions, sometimes at your company you will find yourself doing one or the other. If you are lucky (or not), you will have an official machine learning engineer or software engineer to place your saved model into production so you can focus on the model itself, like the hyper-parameters and unsupervised versus supervised learning. To be the best, you will need both of these roles. However, do keep in mind that some companies will tell you that you will be a data scientist, when really you will almost be a data or business intelligence analyst.

To find out more about the difference between a data scientist and data analyst, I have written a previous article that you can find here [3]. Thank you for reading and I hope you found this article interesting and useful!

Images Powered by Shutterstock

The Data Daily

Should a Data Scientist Know How to Code?