Machine learning pioneer Andrew Ng argues that focusing on the quality of data fueling AI systems will help unlock its full power.
The last 10 years have brought tremendous growth in artificial intelligence. Consumer internet companies have gathered vast amounts of data, which has been used to train powerful machine learning programs. Machine learning algorithms are widely available for many commercial applications, and some are open source.
Now it’s time to focus on the data that fuels these systems, according to AI pioneer Andrew Ng, SM ’98, the founder of the Google Brain research lab, co-founder of Coursera, and former chief scientist at Baidu.
Ng advocates for “data-centric AI,” which he describes as “the discipline of systematically engineering the data needed to build a successful AI system.”
AI systems need both code and data, and “all that progress in algorithms means it's actually time to spend more time on the data,” Ng said at the recent EmTech Digital conference hosted by MIT Technology Review.
Focusing on high-quality data that is consistently labeled would unlock the value of AI for sectors such as health care, government technology, and manufacturing, Ng said.
“If I go see a health care system or manufacturing organization, frankly, I don't see widespread AI adoption anywhere.” This is due in part to the ad hoc way data has been engineered, which often relies on the luck or skills of individual data scientists, said Ng, who is also the founder and CEO of Landing AI.
Data-centric AI is a new idea that is still being discussed, Ng said, including at a data-centric AI workshop he convened last December. But he pointed to some common problems he sees with data:
Differences in labeling. In fields like manufacturing and pharmaceutics, AI systems are trained to recognize product defects. But reasonable, well-trained people can disagree about whether a pill is “chipped” or “scratched,” for example — and that ambiguity can create confusion for the AI system. Similarly, each hospital codes electronic records in different ways. This is a problem when AI systems are best trained on consistent data.
The emphasis on big data. A common belief holds that more data is always better. But for some uses, especially manufacturing and health care, there isn’t that much data to collect, and smaller amounts of high-quality data might be sufficient, Ng said. For example, there might not be many X-rays of a given medical condition if not that many patients have it, or a factory might have only made 50 defective cell phones.
For industries that don’t have access to tons of data, “being able to get things to work with small data, with good data, rather than just a giant dataset, that would be key to making these algorithms work,” Ng said.
Ad hoc data curation. Data is often messy and has errors. For decades, individuals have been looking for problems and fixing them on their own. “It’s often been the cleverness of an individual's skill, or luck with an individual engineer, that determines whether it gets done well,” Ng said. “Making this more systematic through principles and [the use of tools] will help a lot of teams build more AI systems.”
Some of these problems are inherent to differences between companies. Organizations have different ways of coding, and factories make different products, so one AI system won’t be able to work for everyone, Ng said.
The recipe for AI adoption in consumer software internet companies doesn’t work for many other industries, Ng said, because of the smaller data sets and the amount of customization needed. “I think what every hospital needs, what every health care system may need, is a custom AI system trained on their data,” Ng said. “Same for manufacturing. In deep visual defect inspection, every factory makes something different. And so, every factory may need a custom AI model that's trained on pictures.”
But to date there’s been a focus on more multipurpose AI systems that unlock billions of dollars of value.
“I see lots of, let's call them $1 million to $5 million projects, there are tens of thousands of them sitting around that no one is really able to execute successfully,” Ng said. “Someone like me, I can't hire 10,000 machine learning engineers to go build 10,000 custom machine learning systems.”
Data-centric AI is a key part of the solution, Ng said, as it could provide people with the tools they need to engineer data and build a custom AI system that they need. “That seems to me, the only recipe I'm aware of, that could unlock a lot of this value of AI in other industries,” he said.
While these problems are still being explored, and data-centric AI is in the “ideas and principles” phase, Ng said, the keys will likely be tools and education, including:
Moving toward standardization is something to look at, Ng said, but physical infrastructure can be a limiting factor. A seven-year-old X-ray machine will generate different entries than a brand new one, and there aren’t any practical paths to making sure every hospital uses machines from the same generation. It’s also hard to standardize between a factory that makes car parts and one that makes candy.
“Heterogeneity in the physical environment, which is very difficult to change, leads to a very fundamental heterogeneity in the data,” he said. “These different sorts of data need different custom AI systems.”