The lost art of data science for understanding

Read original article here

There is a tremendous difference between data science for understanding and data science for prediction.

The former is understanding why people use the ???? emoji and what emotional states they are trying to communicate— and how this differs across cultures and age groups.

The latter is predicting that if someone types certain words in a certain order then the next emoji they’ll type is ????.

The former requires a rich and interdisciplinary set of skills — mostly human skills — as I first argued in a talk at Penn State in 2016.

The latter is a purely technical problem — and may even be a trivial technical problem — and is just one part of the end-to-end data science process.

Unfortunately, the “predictors” have dominated the popular conception of the field of data science today — especially with the hype around machine learning and artificial intelligence.

Everyone wants to master the latest technologies and modeling techniques with no understanding of the underlying phenomena we are modeling.

Not only is there no understanding, there is not even any curiosity.

For years, one of my favorite questions to ask data science students has been what they would do if they had access to all of the data in the world. In the past, they would say “I would seek to understand the birth of stars, or the progress of societies, or the causes of depression.”

Now, they say “I would seek to learn computer vision, or deep learning, or neural networks.”

We have put all our focus on the tools and lost sight of the problems we are trying to solve. Erich Fromm, the brilliant German psychologist, foresaw this in 1947 when he wrote: “We have become enmeshed in a net of means and have lost sight of the ends.”

Not only that, when it comes to data science, we have made the field much less human and much less accessible as a result.

This is almost certainly a controversial — and likely, a minority — position today but I’m not even sure whether data science for prediction should be considered data science — or whether it should be considered a branch of engineering instead.

To be fair, I identify primarily as a researcher who uses data science as a tool to understand the world, rather than as the object of my research itself, and this perspective is driven by that background.

The questions I’ve been interested in throughout my career have been fundamentally human questions. How does advertising influence purchase behavior? How do we get people to vote for our candidate? Why do people use specific emojis and what do they mean?

These are all questions that involve the complete data science process, end-to-end. In order to solve them, we need dynamic, visionary, cross functional scientists with an innate sense of curiosity who are passionate about understanding the world and driving impact.

Because of the frenetic focus on prediction — and the Kaggleization of data science education — most of the data science resumes that cross my desk these days feature students eagerly seeking to one-up each other in the fanciest technologies.

Yet, almost everything they feature on their resumes are well-defined projects where someone else did the hardest work for them. Someone else defined the problem, and someone else collected, processed, and assembled the data. And after they built a model, someone else decided what it meant and what to do next.

To be clear, there certainly are data science jobs that focus only on building models and most of them pay very well.

But those are not the types of jobs that embody what for me is the true potential and promise of data science — to advance our understanding of the world around us.

If today a data scientist is seen as someone who sits behind a computer all day, tuning away at hyper-parameters, then we have crippled our field greatly and made it much less diverse as a result.

When I first got into data science (by way of biostatistics) — when Hal Varian first said in 2009 that “the sexy job in the next ten years will be statisticians” — the person we envisioned was a visionary and an evangelist, exciting people about the potential of this fascinating new field.

Even Varian described this person as having “the ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate.”

How did we go from that starting point in 2009, to today, when one of my colleagues said to me, only half-jokingly, “I would never trust a data scientist to make a Powerpoint.”

I recently met a young data scientist — she confessed to me that she had a lot of experience working with data but was new to machine learning and she wasn’t sure she should even consider herself a data scientist.

I don’t know what the schools are teaching these days or what Silicon Valley hiring practices are communicating, but this is a very new phenomenon!

I first started interviewing for data science jobs in 2012 and machine learning would come up as an afterthought, and only rarely.

For almost every single position I interviewed for — at startups and major tech companies alike — only occasionally would I be asked about machine learning. And I would laugh and say all I know is the difference between clustering and classification. And the hiring managers would say, That’s more than enough ????

In almost all cases, the part of data science that involves fitting the model is usually the least interesting and least difficult part of the process. That part is bracketed on both sides by dynamic and exciting problem spaces: translating a real world problem into a data science question, acquiring, preparing, and cleaning the data, on the left, and on the right, finding insights and weaving them into a story that leads to impact.

One of the greatest realizations in my career over the past few years has been how much I love “data detective work” — which is also where my passion for journalism comes in handy. Tracking down different sources of data, understanding what they mean, profiling the data, and learning about its provenance. I find it all so fascinating!

If 80% of data science is cleaning data, then we need people who love that part of it too, not just people who view it as a nuisance.

Are we able to attract those types of dynamic, deeply curious, methodical, and persistent thinkers to the field if we reduce the field only to machine learning and prediction?

Or does that only result in the types of people who would make the best big picture data scientists being intimidated out of the field? And does that result in the engineering oriented people who do enter the field expecting to do all ML all the time being disaffected by what actual data science is like most of the time and leaving the field?

I don’t pretend to have all the answers here, or even any of them, but I do think we need to reclaim the value of data science for understanding the world, not just predicting the future.

Images Powered by Shutterstock

The Data Daily

The lost art of data science for understanding