Do you really need a data scientist?
Just “plugging in” a data scientist in your databases won’t deliver the expected results. First, you need to ensure your data is actually valuable.
Dec 18, 2018 · 8 min read
Your company has data — and it’s likely a lot of data. Millions of rows, maybe even images, audio and video. But nothing you can call big data …yet. Data were collected over time through many systems, whether yours or third parties, like ERPs, CRMs and other applications. You stored it somewhere: your relational database, spreadsheets, NoSQL databases or anywhere else. It may even be stored on a third party database (inside some software that you use). The fact is: the data belongs to you, it is stored and you have access to it. Because data science is a popular topic right now, a common illusion arises:
If there is data stored somewhere, all we need to do is hire a data scientist. He/she will certainly extract something from the data and then turn it into something valuable for us.
Well, if your company’s current situation somehow relates to the description above…I’m sorry. You probably don’t really need a data scientist. At least, not yet. Let’s explore a hypothetical scenario where you hire a data scientist.
Hypothetical scenario: hiring a data scientist
Most data scientists I see in the job market are actually just analysts who have learned Python, R, pandas and scikit-learn in MOOCs . They took part in some Kaggle competitions and have little professional experience. And they are eager to demonstrate their knowledge in the “real world”.
It’s the sexiest job of the 21st century. Nobody wants to create dashboards and reports anymore, everyone wants to work with artificial intelligence.
If you post a job searching for a data scientist, tons of candidates will show up and you will probably face the following situations:
Your interviews, tests and screenings will successfully select a data scientist with some machine learning engineering qualities. He/she will have good programming skills and theoretical knowledge of several algorithms and their applications.
You will show him/her your “data stored somewhere”. And then will give him/her an open problem to work with. Something like: we would like to reduce default risk, or we need to increase the sales.
He will try to use the data, applying models and algorithms to respond to the problem you proposed. And will probably fail.
Then, you’ll start giving him tasks that should be directed to a business intelligence analyst, like designing a dashboard to track the daily sales or doing some backoffice tasks automation.
After a while, your data scientist will be frustrated . He cannot apply what he learned in courses and competitions after all. He will start contemplating other companies jobs posts, since they are apparently doing real data science — the grass is always greener on the other side of the fence.
Your business won’t have the expected benefits from hiring a data scientist. At most, you will have an unmotivated BI analyst earning the same as a data scientist. Maybe you will have to search for a substitute professional.
Note that the problem is not the data scientist. He has the necessary knowledge and tried to do his work properly. The actual problem lies in your data…and also in the lack of a real scientist.
The biggest issue with most data scientists is that they are not actually scientists. Anxiety and desire to quickly apply models and algorithms ends outshining important stages of a good data science work, such as contextualization, problem framing, experiment design and data collection.
Your data is (probably) garbage!
Transactional databases (those which store data from orders, payments, access logs, etc) were developed specially to store transaction data — which sustain applications. This raw data hasn’t that much value for data science. The developers who structured those databases did not think, and probably shouldn’t be thinking, about how that data would be used for analysis. They simply created data models that would increase the performance of whichever application they were programming at that moment.
Data in those transactional databases will be possibly badly formatted, without documentation, will have non representative column names, without key consistency, lots of duplicated or missing rows, inconsistent values between diverse sources, among other issues. It is also possible that, both in big and small companies, some processes rely entirely on spreadsheets. As a consequence, your base won’t have change logs or historical data. The lack of processes and controls will make it really difficult to distinguish the truth of each event.
It is likely that you already have a business intelligence structure running on these transactional databases. You just plugged some tools directly in the databases or spreadsheets and put some data analysts to create reports and dashboards. This worked pretty well and delivered value to the company. You saw all that you had made, and it was very good.
But then came the moment when you started to have insights on how data could help your company increase market share, being more efficient and so on. And you saw all those data there…available…just being used to create some reports and explain things which have already happened….just waiting to deliver more value.
What if predicting the future was possible? What about optimizing business policies? Or retaining a customer who is just about to leave? Well, you have data — a whole lot of them. They are there…restless, screaming, urging to be used. You feel like Eve in the Garden of Eden. The snake asks: “won’t you eat from any tree in the garden?”. Yes, you fall into temptation and hire that data scientist I mentioned in the previous hypothetical situation.
The thing is: the simple fact of having data stored in transactional databases doesn’t mean you have a gold mine to dig. They need to be carefully worked on and transformed into analytical bases — and this process takes lots of time. A specific context is needed to create an analytical database, as well as understanding the business peculiarities, so that mere transactions are transformed into something meaningful.
Still, there are situations in which transactional data is actually garbage, since the collection is not done as it should. I have met situations where the system doesn’t store all transaction’s informations, or stores transformed data, making it impossible to go back to their original components. More than two years of data collection, which could be useful for developing a fantastic prediction model, were useful only to increase the company’s storage costs.
Let’s make things clear: you have a data problem. Data scientists you eventually hire will have all the tools and methods to create value. But, without good data, anything they do will be useless. And this is not their fault — it’s yours. In this scenario, each day you spend with a data scientist, it’s a day seeing your money leaving through the front door.
Hire a scientist. One who works with data.
Scientists have to deal with every step of an experiment, from their conception to publishing the results. They are usually professionals with degrees in physics, chemistry, mathematics, statistics or biology. They may not know everything about machine learning but, at this point of the read, I think I already made my point: you don’t necessarily need someone who can implement scikit-learn methods.
The scientific research flow usually goes as follows:
Hire people who know how to think like this. At interviews, don’t test if the candidate knows all the tools and technologies. Instead, test if he/she can follow a line of thought like the one above. Search for situations and evidences that show their ability of following every step in the scientific method. A candidate with these skills will probably perform really well with your business data.
A scientist tests ideas and hypothesis. He fails, learns and eventually comes up with a solution. If you need things done really fast, hire then an experienced scientist. Know exactly what is your data situation then test the applicant’s experience, verifying if he/she has already faced something similar. Ask which course of action he took and what was learned from that. An experienced scientist will be able to solve similar problems way quicker.
Learning how to use new tools is relatively fast. Learning how to think, on the other hand, is a slow longstanding process. It usually takes four or more years of a bachelor’s degree in hard sciences and, sometimes, also a masters or a doctorate degree. Hiring someone who knows how to think and have some basic knowledge of data science tools is way better than hiring someone who knows a lot of different tools but doesn’t think about where and how applying them. This doesn’t mean you shouldn’t hire a scientist who doesn’t know how to program, but rather that the programming skills shouldn’t be the most relevant one when choosing a candidate.
You will also have to rethink all your processes and data flows. The scientist needs to know the business environment and understand how your data is collected. He/she should suggest changes in your applications in order to capture the right data, at the right moment. He must ensure all processes are reproducible. This will require your developers to alter services, create new APIs and meet the scientist’s demands (who must know how to advocate for them with solid evidences). Your technology team will need to give him certain autonomy, so that the scientist is able to create new data schemas, automatize flows and creates new ETLs.
Look, this does not mean forgetting immediate payback and focusing only on long term. Your scientist will work to organize and improve data quality. On the way, he/she will uncover processes that need optimization and will detect some patterns. He will be able to apply human intelligence to make adjustments that will turn into short term payback. At the same time, this will bring you a step closer to the point where you will actually need a data scientist who knows everything about machine learning.