The Data Daily

The 4 Hard Truths Data Science Blogs Don't Teach You About

The 4 Hard Truths Data Science Blogs Don't Teach You About

I’m angry. Yes, I’m angry at data science blogs. Which is weird, because data science is blooming right now and we see a wide variety of knowledge being spread at every corner of the Internet. So, explaining why I am angry at them will be a tough one, but bear with me while I describe my journey through this industry.

Let me give you some context first. A few years ago I was on the road to data science, I wanted to learn everything about this field, the sole idea of building something intelligent that can help someone predict something amazed me.

Inspired by this idea I decided I wanted to become a data scientist, and like many others, I jumped from engineering to this new landscape. Not knowing where to start I began searching through the Internet to see for myself how data science looks like.

Soon enough, I ended up in the vast sea of blogs, poured with hype and expectations, I was reading titles such as:

I was ready to read them all. I thought to myself:

I was about to learn that skill only gets you so far.

After months of hard study, I was quickly looking for a job in this new and exciting field. I was pretty good at Pandas, and I was able to get my head around scikit-learn. Tensorflow? No problem.

I landed a job as a Data Analyst. I was lucky! And this company was great. I was hyped, I felt awesome, and I wanted to show what could I do. I wanted to help the company grow, and show them how can we apply data analysis and machine learning into their operations.

But in my small mind, I had no idea how utterly wrong I was, and let me tell you why.

Guess what? Every blog out there will lay its analysis over the foundation that data is already there, waiting to be analyzed. They run over this assumption, and I, as a rookie, was lied by omission.

As a data analyst, I was tasked to analyze our sales, monthly revenue, cancellations, and everything that is without a doubt important for a SaaS company, and to get a dataset I had to connect to production servers, API’s, buckets, etc.

And you could say: Well, of course, that’s expected! The only problem is that programs do not generate datasets for human consumption.

Most of the time you will have a SQL table in a production database ridden with a huge number of columns that you don’t even understand what they mean. Or JSON files that don’t even have a proper structure. Or incomplete datasets that I needed to join from multiple sources to have a working dataset.

Now, imagine doing that over, and over, and over.

I struggled to get data, but I finally knew my way around. I was already building and putting together datasets, and it was about time to fire up Jupyter and start tinkering with it.

My objective was straight, I wanted to know what are the reasons why people cancel their subscription with us. I started my EDA right away. Found some hard truths, cleared up assumptions and built a small model that can predict the probability for someone to churn.

It wasn’t the best model, but it was good enough, and I was proud of it. I presented my findings to the stakeholders and they were delighted. They now expect a report of the accounts that most likely are going to cancel every month in their email. The experiment was a success!

Dear reader, did you just realized what I just said? If you haven’t managed a data science team before you will think there’s no issue here. However, for someone who has to deal with the coordination, capacity, and planning of such team you will soon realize that this strategy is not scalable.

You’re having a data analyst (or a data scientist), extracting data by himself, running a report locally, on a Jupyter notebook who only works in his computer, manually delivering a report to stakeholders. If you don’t think this is a recipe for a disaster then I invite you to reconsider your strategy to build scalable teams.

Moreover, with this approach you’re going to burn out, stakeholders will depend on your ability to send them the report on time, and you’re teaching the company that they don’t need to learn about data, they have you.

Even though our processes were not scalable, we kept going, and we were developing model after model, even the same models with different algorithms. We just discovered AI, and we wanted to make it ours!

After all those models I built, I realized that people were asking me for things that seemed to tackle no business problem. Predict revenue? Ok. Cancellations? Here you go. Forecast new accounts? No problem.

Now, let me ask some harsh questions to my past self:

If you layout a myriad of models out there without any business objective, with no execution plan, to only please stakeholders’ wonder and amazement, then let me politely tell you that you’re providing nothing of value.

Having a business objective, and a execution plan, is paramount to building successful AI products that are going to change the way you do business. Let’s go even further, your responsibility as data scientist is to educate your stakeholders! They trust your expertise and field knowledge to guide them through this AI revolution.

Have them prepare a business case, ask them difficult questions and execution plans, prepare to mitigate failure, clear assumptions around the business risks involved. You’ll provide a more clear agenda, better models, and you’ll bring more value to the company.

We are at the Fourth Industrial Revolution, and data is at the front-line.

That is something that I bet most people don’t understand. Data has such a breakthrough, that it transformed businesses in its entirety!

Remember how important it was to know how to use a computer? I still remember that having a Microsoft Office learning certificate guarantee you a job somewhere. People were replaced one-by-one by the newer generations who were more adept at computers.

Years later that was not a qualification requirement, it is an expectation. Companies now expect you to know how to use a computer, they expect you to know how to use Excel, they expect you to be able to browse the web without an issue.

If you think that data is going to be different, then I beg you to reconsider your priorities. We see companies investing in data democratization like there’s no tomorrow. Teaching their employees how to handle data, interpret it, and use it to enhance their operations.

They have seen the value that data brings to the business. They know how important it is to make informed decisions, and develop strategies around data is now becoming the norm.

If you think Data Science is going to be reserved for teams who know how to handle data, then I’m afraid you’re wrong. If you want to succeed in your business, whether you are a data analyst, or a Chief Data Officer of an organization, then you have to push for data democratization.

If you see yourself in one of these points, then let me give you some recommendations:

I hope you liked the entry. Follow me on Twitter if you want to read more entries like this ???????? Follow @__franccesco

Also, share this article if you found it interesting. See you soon.

Images Powered by Shutterstock