Logo

The Data Daily

When is enough data enough?

When is enough data enough?

The problem and promise of artificial intelligence (AI) is people. This has always been true, whatever our hopes (and fears) of robotic overlords taking over. In AI, and data science more generally, the trick is to blend the best of humans and machines. For some time, the AI industry’s cheerleaders have tended to stress the machine side of the equation. But as Spring Health data scientist Elena Dyachkova intimates, data (and the machines behind it) is only as useful as the people interpreting it are smart.

Dyachkova was replying to a comment made by Sarah Catanzaro, a general partner with Amplify Partners and former head of data at Mattermark. Discussing the utility of imperfect data/analysis in decision-making, Catanzaro says, “I think the data community often misses the value of reports and analysis that [are] flawed but directionally correct.” She then goes on to argue, “Many decisions don’t require high-precision insights; we shouldn’t shy from the quick and dirty in many contexts.”

It’s a great reminder we don’t need perfect data to inform a decision. That’s good. Gary Marcus, a scientist and founder of Geometric Intelligence, an ML company acquired by Uber in 2016, insists that the key to appreciating AI and its subsets machine learning (ML) and deep learning is to recognize that such pattern-recognition tools are at their “best when all we need are rough-ready results, where stakes are low and perfect results optional.” Despite this truth, in our quest for more powerful AI-fueled applications, we keep angling for more and more data, with the expectation that given enough data, ML models will somehow give us better than “rough-ready results.”

Alas! It simply doesn’t work that way in the real world. Although more data can be good, for many applications, we don’t need more data. We need people better prepared to understand the data we already have.

As Dyachkova notes, “Product analytics is 80% quick and dirty. But the ability to judge when quick and dirty is appropriate requires a pretty good understanding of stats.” Got that? Vincent Dowling, a data scientist with Indeed.com, makes it even clearer: “A lot of the value in being an experienced analyst/scientist is determining the amount of rigor needed to make a decision.”

They’re both talking about how to make decisions, and in both cases, the experience of the people looking at the data matters more than the data itself. Machines will never be able to compensate for insufficient savvy in the people who run them. As an editorial in The Guardian posits, “The promise of AI is that it will imbue machines with the ability to spot patterns from data and make decisions faster and better than humans do. What happens if they make worse decisions faster?”

This is a very real possibility if people abdicate ownership, thinking the data/machines will somehow speak for themselves.

Putting the people in charge is not all that easy to pull off in practice. As Gartner Research Vice President Manjunath Bhat suggests, AI is influenced by human inputs, including the data we choose to feed into the machines. The results of our algorithms, in turn, influence the data with which we make decisions. “People consume facts in the form of data. However, data can be mutated, transformed, and altered—all in the name of making it easy to consume. We have no option then but to live within the confines of a highly contextualized view of the world.”

For a successful ML project, argues Amazon applied scientist Eugene Yan, “You need data. You need a robust pipeline to support your data flows. And most of all, you need high-quality labels.” But there’s no way to properly label that data without experienced people. To label it well, you need to understand the data to some degree. This hearkens back to a point made by Gartner analyst Svetlana Sicular a decade ago: Enterprises are filled with people who understand the nuances of their business. They’re the best positioned to figure out the right sorts of questions to ask of the company’s data. What they may lack is that added understanding of statistics that Dyachkova points out—the ability to know when “good enough” results are actually good enough.

Of course, this is why data science is difficult. In every survey on the top roadblocks to AI/ML adoption, “talent” always tops the list. Sometimes we think that’s down to a shortage of data science talent, but maybe we should instead be worried about shortages of basic understanding of statistics, mathematics, and a given company’s business.

Images Powered by Shutterstock