Logo

The Data Daily

AI And Machine Learning: How Much Data Is Enough?

AI And Machine Learning: How Much Data Is Enough?

Six out of 10 C-level executives surveyed by Forbes Insights believe AI is a key enabler of their organization's future success. Four out of five of those organizations have AI programs in place or are currently piloting them; 74 percent have 10 or more separate initiatives underway.

The oxygen that breathes life into these programs is data. You can't build, validate, or measure the success of a machine learning model without the right amounts and types of data, notes Yiwen Huang, CEO of r2.ai, an automated machine learning platform.

"The amount of data you have is important, but what's more important is whether the attributes of that data and its distribution represent the population you're going after," he says.

For many organizations, the sheer volume of data they have is a problem, it’s not just a matter of needing more and more data but also of understanding how to manage and value the data you already have, says Ray O'Farrell, executive vice president and CTO of VMware.  The bigger issue is identifying and organizing that data that has the most value to the organization, understanding where you might be missing some data, and then managing it in a way that is consistent and compliant with privacy, security and frequently with ethical rules.

"Questions I hear quite often from our customers are, 'Do we have the right types of data to fuel our AI programs' and, 'How do we value it?'" says O'Farrell. "Organizations know there's value in the data they've collected, but they don't always know how to quantify that value or how to extract it."

The worth or value of this data may also change over time: often information loses its value as it gets older or becomes “out of date”, but that's not always the case, says O'Farrell. The value of a data set may suddenly increase or decrease in value in response to an external event. Take autonomous vehicles, for example. With video, LiDAR, and other sensors, testing and training driverless cars generates and archives enormous amounts of data, the vast majority of which is repetitive and offers little immediate value.

"Ninety percent of the time nothing interesting happens," he says. "But at a future time, one of those vehicles is in an accident: now the manufacturer needs to go back and understand whether it validated the safety model correctly. The value of that data is now suddenly much higher, possibly even from a legal POV.”

To be safe, many organizations seem to default to keeping all data in case they might one day need it, O'Farrell says. But the more data organizations keep, the more resources they must expend to store and secure it. The larger that data trove becomes, the more tempting a target it is for external attackers. And when that data is controlled by multiple siloed units within a company, the potential for compliance issues and data breaches grows significantly.

While organizations collect massive amounts of data, they aren't necessarily keeping it all in the same place or handling it in the same way.

In the Forbes Insights survey, 68 percent of companies surveyed are actively building a company-wide roadmap for handling data, but only 11 percent have completed it. Just two percent say they have a solid data governance process in place across the enterprise. And that's often a direct consequence of information silos.

In most organizations, the information technology department has a good data strategy in place, says O'Farrell. But IT often isn't the only group collecting data. If an organization has deployed IoT sensors, for example, that data is usually handled by a separate operations technology team. Meanwhile, sales and marketing typically collect and manage customer data.

Not only do disparate groups end up managing data using different policies, they may be storing it in different places – on premise, in the cloud, or at the edge.

Those decisions are typically based on the nature of the data and how it's used, says O'Farrell. If you're serving up a million videos a day to people's phones, you want that data in the cloud and distributed widely. If you're securing customer or employee records, you're more likely to keep it close to home in your data center, or if it is in the cloud you need assurances of sovereignty and security (even as to where the data is located and the ability to delete it).

But these silos make it harder for organizations to apply analytics across their entire data set, leading to the potential for conflicts and compliance issues.

Enterprise data is also likely to be mirrored, replicated, or backed up in multiple places, notes O'Farrell. Under the EU's General Data Protection Regulation (GDPR), an organization's customers have some rights to ask that their information be “forgotten”. The sheer number of places where that data might reside makes fulfilling such a request enormously complicated. The data storage problem becomes more acute when dealing with industries that are highly regulated, such as financial services or healthcare.

"Companies are uncomfortable with having multiple sources of truth," O'Farrell says. "They're uncomfortable not knowing where their data is, or whether it's private and secure. Companies are starting to ask, 'How do I unify and coordinate this in some fashion?'"

The challenges posed by the data deluge lead some enterprises to seek out a new class of executives who combine business acumen with analytics expertise, says Scott Snyder, a partner with Heidrick & Struggles, an executive search and consulting firm based in Chicago. 

"At Heidrick we place a lot of leadership positions like Chief AI Officer and Chief Data Officer," Snyder continues. "We're also looking for people with data-intensive backgrounds who can graft those skills with institutional or functional knowledge, such as HR, legal, or supply chain. Companies are very interested in finding those kinds of leaders."

While O'Farrell doesn't necessarily see the need for a chief data officer, he says every large organization needs someone who can look at these issues from a 30,000-foot view to ensure that its data sets are up to date and relying upon the best sources of information.

"What I've seen is the emergence of a digital officer as a subset of the CIO organization," he says. "As organizations extract ever more insights from all this data, I definitely see the need for someone who can look at an organization's data policies on a macro level, to make sure the data is valid and secure, its source is known and ensure they're using the data in a legally and ethically compliant fashion.”

It's no secret that artificial intelligence is on every executive's mind. In enterprises across the globe, business leaders are talking about how to take advantage of AI insights and build new revenue streams.

Six out of 10 C-level executives surveyed by Forbes Insights believe AI is a key enabler of their organization's future success. Four out of five of those organizations have AI programs in place or are currently piloting them; 74 percent have 10 or more separate initiatives underway.

The oxygen that breathes life into these programs is data. You can't build, validate, or measure the success of a machine learning model without the right amounts and types of data, notes Yiwen Huang, CEO of r2.ai, an automated machine learning platform.

"The amount of data you have is important, but what's more important is whether the attributes of that data and its distribution represent the population you're going after," he says.

For many organizations, the sheer volume of data they have is a problem, it’s not just a matter of needing more and more data but also of understanding how to manage and value the data you already have, says Ray O'Farrell, executive vice president and CTO of VMware.  The bigger issue is identifying and organizing that data that has the most value to the organization, understanding where you might be missing some data, and then managing it in a way that is consistent and compliant with privacy, security and frequently with ethical rules.

"Questions I hear quite often from our customers are, 'Do we have the right types of data to fuel our AI programs' and, 'How do we value it?'" says O'Farrell. "Organizations know there's value in the data they've collected, but they don't always know how to quantify that value or how to extract it."

The worth or value of this data may also change over time: often information loses its value as it gets older or becomes “out of date”, but that's not always the case, says O'Farrell. The value of a data set may suddenly increase or decrease in value in response to an external event. Take autonomous vehicles, for example. With video, LiDAR, and other sensors, testing and training driverless cars generates and archives enormous amounts of data, the vast majority of which is repetitive and offers little immediate value.

"Ninety percent of the time nothing interesting happens," he says. "But at a future time, one of those vehicles is in an accident: now the manufacturer needs to go back and understand whether it validated the safety model correctly. The value of that data is now suddenly much higher, possibly even from a legal POV.”

To be safe, many organizations seem to default to keeping all data in case they might one day need it, O'Farrell says. But the more data organizations keep, the more resources they must expend to store and secure it. The larger that data trove becomes, the more tempting a target it is for external attackers. And when that data is controlled by multiple siloed units within a company, the potential for compliance issues and data breaches grows significantly.

While organizations collect massive amounts of data, they aren't necessarily keeping it all in the same place or handling it in the same way.

In the Forbes Insights survey, 68 percent of companies surveyed are actively building a company-wide roadmap for handling data, but only 11 percent have completed it. Just two percent say they have a solid data governance process in place across the enterprise. And that's often a direct consequence of information silos.

In most organizations, the information technology department has a good data strategy in place, says O'Farrell. But IT often isn't the only group collecting data. If an organization has deployed IoT sensors, for example, that data is usually handled by a separate operations technology team. Meanwhile, sales and marketing typically collect and manage customer data.

Not only do disparate groups end up managing data using different policies, they may be storing it in different places – on premise, in the cloud, or at the edge.

Those decisions are typically based on the nature of the data and how it's used, says O'Farrell. If you're serving up a million videos a day to people's phones, you want that data in the cloud and distributed widely. If you're securing customer or employee records, you're more likely to keep it close to home in your data center, or if it is in the cloud you need assurances of sovereignty and security (even as to where the data is located and the ability to delete it).

But these silos make it harder for organizations to apply analytics across their entire data set, leading to the potential for conflicts and compliance issues.

Enterprise data is also likely to be mirrored, replicated, or backed up in multiple places, notes O'Farrell. Under the EU's General Data Protection Regulation (GDPR), an organization's customers have some rights to ask that their information be “forgotten”. The sheer number of places where that data might reside makes fulfilling such a request enormously complicated. The data storage problem becomes more acute when dealing with industries that are highly regulated, such as financial services or healthcare.

"Companies are uncomfortable with having multiple sources of truth," O'Farrell says. "They're uncomfortable not knowing where their data is, or whether it's private and secure. Companies are starting to ask, 'How do I unify and coordinate this in some fashion?'"

The challenges posed by the data deluge lead some enterprises to seek out a new class of executives who combine business acumen with analytics expertise, says Scott Snyder, a partner with Heidrick & Struggles, an executive search and consulting firm based in Chicago. 

"At Heidrick we place a lot of leadership positions like Chief AI Officer and Chief Data Officer," Snyder continues. "We're also looking for people with data-intensive backgrounds who can graft those skills with institutional or functional knowledge, such as HR, legal, or supply chain. Companies are very interested in finding those kinds of leaders."

While O'Farrell doesn't necessarily see the need for a chief data officer, he says every large organization needs someone who can look at these issues from a 30,000-foot view to ensure that its data sets are up to date and relying upon the best sources of information.

"What I've seen is the emergence of a digital officer as a subset of the CIO organization," he says. "As organizations extract ever more insights from all this data, I definitely see the need for someone who can look at an organization's data policies on a macro level, to make sure the data is valid and secure, its source is known and ensure they're using the data in a legally and ethically compliant fashion.”

Images Powered by Shutterstock