Data Discovery Evolving Into Information Relationship Mapping Leveraging Machine Learning

Programmers and analysts explicitly define many relationships between objects. One problem with that is the age-old issue of documentation. Many programmers are rather remiss in their documentation. That and the complexity of multiple systems means finding relationships between information in multiple systems and define those relationships becomes difficult.

Machine Learning systems can help in two key ways. First, intelligent, rapid, transiting of a company’s systems can quickly interrogate indices and other metadata, building a model of the defined relationships and help an analyst identify consistency and quality issues. Second, an ML system can look at the data without preconceptions and find new relationships between entities and data.

The associated graphic was provided by Io-Tahoe and shows how the two overlap. The dimmer lines are the defined relationships and the brighter lines show the ML discovered relationships. Note that the relationship building isn’t for only graph databases. The vast majority of data in the enterprise is still in tabular form.

The result is an integrated picture of the information in corporate systems, a more holistic view of corporate information. Data discovery is evolving into a more robust idea of information relationship mapping, and that evolution will be aided by machine learning.

What once started as early analysis of singular data sources has now evolved into far more robust ways of analyzing information and the relationships between different fields and information sources. Data discovery is another area where machine learning (ML) is beginning to make inroads.

Twenty years ago, data discovery was a term used to define the early analytics needed to better understand data. For instance, Evoke Software was a company that analyzed large volumes of customer data. It both used metadata to understand field content to find trends and exceptions, and also looked at raw data and used algorithms to identify field boundaries in older or less documented data sources. The description of data discovery was never pure, as the systems were working to analyze data in order to find information. I’ve always preferred the term Knowledge Discovery, but data is too often still the key word used.

The advances in hardware and software in the intervening years have led to far improved performance for that traditional style of data discovery.

The first expansion was supported by the realization of the importance of metadata. Using manually defined relationships between fields in multiple sources, the knowledge discovery market began to understand how multiple systems use what should be identical data, and to help corporate IT and management mediate and reconcile multiple systems.

Along came the web, massively expanding the amount of data to analyze. It wasn’t just the amount of information that expanded, but the types of information that were being gathered also expanded. The main reason for that was the new information had to do with consumers choices and timing that could be tracked in as much detail before the advent of e-commerce.

Companies wanted to monetize that information, primarily selling it to retailers and consumer packaged goods manufacturers. While companies such as Google and Facebook could sell the raw data collected, finding demographic information and relationships between types of transactions (the “if you like X, you may like Y” cross-selling pitch) added value to the information and brought up the price.

Managers might have heard of a new database type called a graph database. That is a structure that difference from the rows and columns of a standard relational database by storing Nodes, such as core information about a person, in a single entity and linking those nodes with edges such as between the nodes Bob and Carol and Ted and Alice (Yes, that’s a movie reference…). Such graphical links are very useful in analyzing social media and other information where different objects can be defined as nodes.

Programmers and analysts explicitly define many relationships between objects. One problem with that is the age-old issue of documentation. Many programmers are rather remiss in their documentation. That and the complexity of multiple systems means finding relationships between information in multiple systems and define those relationships becomes difficult.

Machine Learning systems can help in two key ways. First, intelligent, rapid, transiting of a company’s systems can quickly interrogate indices and other metadata, building a model of the defined relationships and help an analyst identify consistency and quality issues. Second, an ML system can look at the data without preconceptions and find new relationships between entities and data.

The associated graphic was provided by Io-Tahoe and shows how the two overlap. The dimmer lines are the defined relationships and the brighter lines show the ML discovered relationships. Note that the relationship building isn’t for only graph databases. The vast majority of data in the enterprise is still in tabular form.

The result is an integrated picture of the information in corporate systems, a more holistic view of corporate information. Data discovery is evolving into a more robust idea of information relationship mapping, and that evolution will be aided by machine learning.

Images Powered by Shutterstock

The Data Daily

Data Discovery Evolving Into Information Relationship Mapping Leveraging Machine Learning