Logo

The Data Daily

Forrester changed the way they think about data catalogs, and here’s what you need to know | 7wData

Forrester changed the way they think about data catalogs, and here’s what you need to know | 7wData

As we predicted at the beginning of this year, metadata is hot in 2022 — and it’s only getting hotter.

But this isn’t the old-school idea of metadata we all know and hate. We’re talking about those IT “data inventories” that take 18 months to set up, monolithic systems that only work when ruled by dictator-like data stewards, and siloed data catalogs that are the last thing you want to open in the middle of working on a data dashboard or pipeline.

The data industry is in the middle of a fundamental shift in how we think about metadata. In the past year or two, we’ve seen a slew of brand new ideas emerge to capture this new idea of metadata — e.g. the metrics layer, modern data catalogs, and active metadata — all backed by major analysts and companies in the data space.

Now we’ve got the latest sign of this shift. This summer, Forrester scrapped its Wave report on “Machine Learning Data Catalogs” to make way for one on “Enterprise Data Catalogs for DataOps”. Here’s everything you need to know about where this change came from, why it happened, and what it means for modern metadata.

In the earliest days of big data, companies’ biggest challenge was simply keeping track of all the data they now had. IT teams were tasked with creating an “inventory of data” that listed a company’s stored data and its metadata. But in this Data Catalog 1.0 era, companies spent more time implementing and updating these tools than actually using them.

In the early 2010s, there was a big shift — the Data Catalog 2.0 era emerged. This brought a greater focus on data stewardship and integrating data with business context to create a single source of truth that went beyond the IT team. At least, that was the plan. These 2.0 data catalogs came with a host of problems, including rigid data governance teams, complex technology setup, lengthy implementation cycles, and low internal adoption.

Today, metadata platforms are becoming more active, data teams are becoming more diverse than ever, and metadata itself is becoming big data. These changes have brought us to Data Catalog 3.0, a new generation of data governance and metadata management tools that promise to overcome past cataloging challenges and supercharge the power of metadata for modern businesses.

Last year, Gartner scrapped their old categorization of data catalogs in favor of one that reflects this fundamental shift in how we think about metadata. Now Forrester has made its own move to define this new category on its own terms.

One of the biggest challenges with Data Catalog 2.0s was adoption — no matter how it was set up, companies found that people rarely used their expensive data catalog. For a while, the data world thought that Machine Learning was the solution. That’s why, until recently, Forrester’s reports focused on evaluating “Machine Learning Data Catalogs”.

However, in early 2022, Forrester dropped machine learning in its Now Tech report. It explained that even as ML-based systems became ubiquitous, the problems they were meant to solve persisted. Although machine learning allowed data architects to get a clearer picture of the data within their organization, it didn’t fully address modern challenges around data management and provisioning.

The key change — just “conceptual data understanding” via a data wiki is no longer enough. Instead, data teams need a catalog built to enable DataOps. This requires in-depth information about and control over their data to “build data-driven applications and address data flow and performance”.

So what actually is an enterprise data catalog for DataOps (EDC)?

According to Forrester, “[enterprise] data catalogs create data transparency and enable data engineers to implement DataOps activities that develop, coordinate, and orchestrate the provisioning of data policies and controls and manage the data and analytics product portfolio.”

There are three key ideas that distinguish EDCs from the earlier Machine Learning Data Catalogs.

Images Powered by Shutterstock