Logo

The Data Daily

Data mesh: a true paradigm shift?

Data mesh: a true paradigm shift?

Data mesh, data governance, data fabric, data access management, lineage, observability, orchestration…. the ‘governance’ layer of the modern data stack has been attracting growing attention and debate (and confusing terminology).

Whilst the concept of data governance is not new, the emergence of data mesh — a federated, decentralised approach to data governance, coined by Zhamak Dehghani in 2019 — is more recent. This has led organisations to rethink their orchestration and observability layers, and inspired a wave of innovation in the space. But is data mesh just another trend, or does it represent a true paradigm shift?

In order to answer this question and identify the most exciting opportunities for investment, let’s start with the basics and break down the fundamentals of data governance. We’ll look at:

In the piece below, I unpack what’s actually happening, and what it means for investors. Please feel free to share your thoughts and questions in the comments.

We can think of data governance as a broad umbrella of who should manage access to data pipelines, how they can be monitored, and shared. Data governance simply means setting internal standards on how data should be gathered, stored, processed, and disposed of.

The data mesh is a federated approach within this data governance, focusing on a distributed, decentralised approach to enterprise data management. It sees datasets as federated products, orientated around domains. The idea is that each domain-specific dataset has its own engineers and product owners to manage it. This in turn allows for a level of self-service across an organisation.

With data mesh, a team could be composed of software engineers, analysts, data scientists, all working on building operational and analytical data products.

So data mesh could be interpreted as opposite to current data platforms that are more centralised, and often built around complex pipelines. But whilst this federated approach removes blockers coming with centralised systems, it can still be used alongside traditional storage systems, like data warehouses or data lakes. It simply means that use has shifted from a single, centralised data platform to multiple decentralised data platforms.

Why the growing need? According to studies, only32% of IT leaders realise tangible value from data, and 77% of them integrate up to 5 different types of data in their data pipelines. 65% of organisations are using at least 10 different data engineering tools. And another94% of organisations would like to deploy a data catalogue, but only around 1/3rd say that their data catalogue has met their expectations

In this light, we can break down data mesh in 4 main sub-categories:

The data mesh is complex to navigate, so we need specific orchestration and observability players to keep up with the wave of decentralisation that data mesh offers. But what do we mean when we talk about orchestration, observability and access management, and why do they matter?

Data orchestration is the process of taking siloed data from multiple data storage locations, organising it, and making it easily accessible for analysis.

In this sense, we can group lineage under data orchestration — the process of understanding data as it flows through complex pipelines, from sources to consumption.

Companies in this data orchestration category include DBT, and Balderton portfolio company Kili Technology, as well as companies focused on extraction and ingestion like Fivetran and Airbyte. See my colleague Sivesh’ post for visual representations of this value chain.

Meanwhile, data observability refers to an organisation’s ability to fully understand the health of the data in their system, and works by applying DevOps Observability best practices to eliminate data downtime — such as automated monitoring, alerting. It can be thought of in 5 main sub-categories:

Finally, Data Access Management players help manage access to data, making it more secure, usable and available. Players like Atlan, Alation, Privitar , BigID allow for data cataloguing. Data catalogs are collections of metadata, search tools and data management, to help users find data that they need within an organisation. These businesses are more or less focussed on one of the four data mesh subcategories above - they can provide data discovery tools aimed at helping users understand the context around data; connect to data warehouses and business intelligence tools; update data documentation etc. Seehere for a more detailed comparison of feature differences.

Therefore, data mesh offers a more agile way of combining observability, orchestration and access management — rather than waiting for the perfect data warehouse or data lake. With data mesh, one can more flexibly adapt to changing data sources and create multiple data products.

Now that we have a clear understanding of what makes up the data mesh, we can look into companies innovating in the vertical and where they fit in the paradigm shift.

Can these players exist as stand-alone businesses, or are they merely integrated layers on top of existing data warehouses?

A number of companies are innovating in the orchestration and data access management space in Europe:

A number of recent players are also innovating in data observability:

These companies have different specialities in their offerings, eg: Castor centred on a Notion-like data catalogue with 15-min onboarding; Raito automating data access requests and pipelines with a focus on privacy and security; Y42 automating pipelines with any modelling language of choice; Sifflet focused on metadata monitoring and ML-based anomaly alerting; Soda with an open-source framework and scanning data from a command-line.

In an ecosystem that is already crowded, I believe one of the keys to success for many of these startups will be their ability to identify and leverage their key differentiator: Is it data quality? Security? Administration data preparation for modelling? One good example is Stemma in the US,which focuses on building a self-serve data culture within data orchestration.

I believe the longer-term success of many of these will also depend on their ability to master sales, and effectively partner with larger players (Collibra, MonteCarlo, DBT etc).

To conclude, data mesh is taking off as a solution to address IT leaders’ most challenging pain-points. While the space is getting increasingly crowded to navigate, players specialised in orchestration, observability and access management are innovating at pace.

Not every company will make it, but success will largely depend on product differentiators, and strength of partnerships and integrations with other DataOps/MLOps larger players.

Stay tuned for my next article on this topic, where we’ll be looking more closely at data lineage, reverse ETL, and the metrics that matter most.

If you are a technical investor, founder, or operator, please feel free to share your thoughts tomwehr@balderton.comand feedback in the comments.

Images Powered by Shutterstock