Data-centric AI prioritizes data quality over quantity, while traditional model-centric AI does the opposite. Thus, a data-centric approach can help mitigate many of the challenges that can arise when deploying AI infrastructure.
Nowadays, no matter what artificial intelligence (AI) project we want to build, we need two main ingredients:
Lots of progress has been made to develop effective models—which has led AI to achieve many breakthroughs. However, equivalent work has not been conducted in the realm of data, except making data sets bigger.
While the progress towards conventional model-centric AI is making smaller differences, Andrew Ng and many other leading scientists and academics are arguing to adopt data-centric AI, which deals with the development of a new paradigm to systematically improve data quality.
Data-centric AI differs from model-centric AI because, in the latter, the main focus is to develop and improve models and algorithms to achieve better performance on a given task. In other words, while model-centric AI treats data as a fixed artifact and focuses on improving AI models, data-centric AI frames models as a static artifact and focus on improving data quality. (Also read: What is Data Profiling & Why is it Important in Business Analytics?)
Data is vital in AI; and adopting an approach to obtain good-quality data is crucial—because useful data is not just error-prone and limited, but also very costly to obtain.
The key idea of data-centric AI is to handle data the same way we would high-quality materials when building a house: We spend relatively more time labeling, augmenting, managing and curating the data.
The "mantra" of conventional model-centric AI is to optimize highly parameterized models with bigger data sets to achieve performance gains.
While this mantra works for many industries, such as media and advertising, it faces industries like healthcare and manufacturing with many challenges. These include:
A data-centric AI approach can help mitigate these challenges and, in turn, help organizations get more out of their data.
The essence of data-centric AI is to treat data as a key asset while deploying AI infrastructure.
Unlike model-centric AI, which also deals with archiving data into a repository, this paradigm emphasizes developing a shared understanding of data to maintain a uniform description.
So how we can do that? What important aspects should we consider to implement this approach? It turns out, to adopt data-centric AI, we need to follow some guidelines. They are:
Data labeling, as its name suggests, deals with assigning labels to data—for example assigning disease labels to medical images.
Data labels provide crucial information about data sets, which an AI algorithm uses to learn. So, it is imperative that the information should be correct and consistent. Moreover, it has been shown that fewer well-labeled data instances (e.g., images) can produce better outcomes than more data with incorrect labels. (Also read: Why Diversity is Essential for Quality Data to Train AI.)
Data-centric AI highly emphasizes the quality of data labels, which requires dealing with inconsistencies in labels and work on labeling manuals. The best way to find these inconsistencies is to use multiple data labelers. After finding an inconsistency or ambiguity in labeling, labelers should decide how to correct the inconsistent labels and document their decision in labeling manual. It is also helpful to provide examples of correct and incorrect data labels in labeling manual.
Some examples of inconsistent labels in iguanas detection, as depicted by Andrew Ng, are shown below. Notice how the labelers are inconsistent in marking the iguanas:
You can eliminate noisy data instances by tossing them out. This expands a model's capacity to generalize to new data.
This task involves generating more data instances from existing instances through, for example, interpolation or extrapolation.
Because data-centric AI focuses on data quality rather than quantity, but some AI models require a high volume of data to perform well, data augmentation can help you find middle ground.
It's important to note, however, that generating more data would not help if that data contains noisy instances.
Feature engineering deals with representing raw data in terms of the most relevant variables (i.e., features) using prior knowledge or an algorithm.
The idea is to use domain knowledge as features to improve a predictive model's quality instead of supplying raw data to the model. Feature engineering is vital to add extra features that may not exist in raw data but that can make a significant difference in performance and mitigate the need for collecting large data sets. (Also read: Why is feature selection so important in machine learning?)
After training a model on a given data set, error analysis can help you find a subset of the data set to improve. By performing this process repeatedly, you can gradually improve your data's quality and, as a result, your model's performance.
In model-centric AI, domain experts are typically not involved as data is considered to be a given artifact.
However, domain knowledge plays a vital role in data-centric AI because domain experts can often detect trivial inconsistencies in the data, which may lead to better results.
Though most data-centric AI ideas already exist as a conventional wisdoms among AI engineers, data-centric AI aims to build a systematic approach and tools to facilitate this process. A typical life-cycle of a data-centric AI is illustrated below:
As shown in the figure, data-centric AI is an iterative process—where training analysis and deployment outcomes can result in going back to the data collection and model training phases to observe and correct issues in test data.
To help AI engineers adopt data-centric AI in their projects, the AI community has developed various tools. These include:
Data-centric AI prioritizes data quality over quantity. Compared to model-centric AI, which seeks to engineer performance gains by expanding data sets, a data-centric approach can help mitigate many of the challenges that can arise when deploying AI infrastructure.