Moving data to the cloud can bring immense operational benefits. However, today's enterprise data's sheer volume and complexity can cause downstream headaches for data users.
Semantics, context, and how data is tracked and used mean even more as you stretch to reach post-migration goals. This is why, when data moves, organizations must prioritize data discovery.
In today’s AI/ML-driven world of data analytics, explainability needs a repository, just as those doing the explaining need access to metadata, e.g., information about the data being used. Data discovery is also critical for data governance, which, when ineffective, can hinder organizational growth. And, as organizations progress and grow, “data drift” starts to impact data usage, models, and your business.
This two-part article will explore how data discovery, fragmented data governance, ongoing data drift, and the need for ML explainability can all be overcome with a data catalog for accurate data and metadata record keeping.
With the onslaught of AI/ML, data volumes, cadence, and complexity have exploded. Cloud providers like Amazon Web Services, Microsoft Azure, Google, and Alibaba not only provide capacity beyond what the data center can provide, their current and emerging capabilities and services drive the execution of AI/ML away from the data center.
The future lies in the cloud. A cloud-ready data discovery process can ease your transition to cloud computing and streamline processes upon arrival. So how do you take full advantage of the cloud? Migration leaders would be wise to enable all the enhancements a cloud environment offers, including:
Once migration is complete, your data scientists and engineers must have the tools to search, assemble, and manipulate data sources through the following techniques and tools.
Taken together, these techniques enable all people to trust the data and the insights of their peers. A cloud environment with such features will support collaboration across departments and standard data types, including csv, JSON, XML, AVRO, Parquet, Hyper, TDE, etc.
The vision of Big Data freed organizations to capture more data sources at lower levels of detail and in vastly greater volumes. The problem with this collection was that it exposed a far more complex semantic dissonance problem.
For example, data science always consumes "historical" data, and there is no guarantee that the semantics of older datasets are the same, even if their names are unchanged. Pushing data to a data lake and assuming it is ready for use is shortsighted.
Organizations launched initiatives to be "data-driven" (though we at Hired Brains Research prefer the term "data-aware"). They strove to ramp up skills in predictive modeling, machine learning, AI, or even deep learning. And, of course, the existing analytics could not be left behind, so any solution must also satisfy those requirements. Integrating data from your own ERP and CRM systems may be a chore, but for today's data-aware applications, the fabric of data is multi-colored.
The primary issue is that enterprise data no longer exists solely in a data center or even a single cloud (or more than one, or combinations of both).
Edge analytics for IoT, for example, capture, digest, curate, and even pull data from other, different application platforms and live connections to partners (previously a snail-like exercise using obsolete processes like EDI). Edge computing can be decentralized from on-premises, cellular, data centers, or the cloud. These factors risk data originating in far-flung environments, where the data structures and semantics are not well understood or documented.
Problems arise when data sources are semantically incompatible. The challenge of smoothly moving data and logic while everything is in motion is too extreme for manual methods. And valuable analytics are often derived by drawing from multiple sources.