Synthetic data is often treated as a lower-quality substitute and used when real data is inconvenient to get, expensive or constrained by regulation. However, this reaction misses the true potential of synthetic data. Gartner estimates that by 2030, synthetic data will completely overshadow real data in AI models.
Gartner analysts will discuss use cases and the prospect of synthetic data at upcoming Gartner Data & Analytics Summits, taking place regionally from August through November.
We caught up withAlexander Linden, VP Analyst at Gartner, to understand the promise of synthetic data and why it is paramount for the future of AI. Members of the media who would like to attend the upcoming conferences and/or speak with Alexander can contact Laurence Goasduff.
Q: What is the promise of synthetic data, and when to use it?
A: Synthetic data is a class of data that is artificially generated. It is in contrast with real data which is directly observed from the real world. While real data is almost always the best source of insights from data, real data is often expensive, imbalanced, unavailable or unusable due to privacy regulations. Synthetic data can be an effective supplement or alternative to real data, providing access to better annotated data to build accurate, extensible AI models. When combined with real data, synthetic data creates an enhanced dataset that often can mitigate the weaknesses of the real data.
Organizations can use synthetic data to test a new system where no live data exists or when data is biased. They can also take advantage of synthetic data to supplement small, existing datasets that are currently being ignored. Alternatively, they choose synthetic data when real data can’t be used, can’t be shared or can’t be moved. In that sense, synthetic data is one further AI enabler.
Q: Why is synthetic data a must-have and essential for the future of AI?
A: There are many other forms of synthetic data, such as data augmentation or pseudomization /anonymization which are further types of “data synthesis”. Those methods are a must-have in any modern data science team. But, with synthetic data, professionals inject information into their AI models and obtain artificially generated data that is more valuable than direct observation.
Synthetic data can be used for hackathons, product demos and internal prototyping to replicate a set of data with the right statistical attributes. For example, banks and financial services institutions use synthetic data by setting up multiagent simulations to explore market behaviors (such as pension investments and loans), to make better lending decisions or to combat financial fraud. Retailers use synthetic data for autonomous check-out systems, cashierless stores or analysis of customer demographics.
In addition, synthetic data can increase the accuracy of machine learning models. Real-world data is happenstance and does not contain all permutations of conditions or events possible in the real world. Synthetic data can counter this by generating data at the edges, or for conditions not yet seen.
The breadth of its applicability will make it a critical accelerator for AI. Synthetic data makes AI possible where lack of data makes AI unusable due to bias or inability to recognize rare or unprecedented scenarios.
Q: What are the risks of synthetic data?
A: While synthetic data techniques can score quite highly for cost-effectiveness and privacy, they do have significant risks and limitations. The quality of synthetic data often depends on the quality of the model that created it and the dataset developed.
Using synthetic data requires additional verification steps, such as the comparison of model results with human-annotated, real-world data, to ensure the fidelity of results. In addition, synthetic data may be misleading and can lead to inferior results, and synthetic data may not be 100% fail-safe when it comes to privacy.
Because of these technological challenges, user skepticism might also be another hard challenge for synthetic data to overcome, as users may perceive it to be “inferior” or “fake” data.
Finally, as synthetic data gains broader adoption, business leaders may raise questions on the openness of the data generation techniques, especially when it comes to transparency and explainability.
Gartner analysts will provide additional analysis on the future of synthetic data at the Gartner Data & Analytics Summits 2022, taking place August 22-24 in Orlando, FL., September 14-16 in Tokyo, September 19-20 in Mumbai and November 7-8 in Sydney. Follow news and updates from the conferences on Twitter using#GartnerDA.
Gartner clients can find more information in the report Emerging Technologies: When and How to Use Synthetic Data. Learn about the top priorities for data & analytics leaders in 2022 in the complimentary Gartner ebook 2022 Leadership Vision for Data & Analytics Leaders.
If you are a member of the media who would like to speak further on this topic with Alexander Linden, please contact Laurence Goasduff at Laurence.Goasduff@Gartner.com. Members of the media can reference this material in their articles with proper attribution to Gartner.