17 min read
The Modern Data Stack Through ‘The Gervais Principle’
Data doesn’t move left-to-right in an organization, it moves through Losers, the Clueless, and Sociopaths.
Go and Google the term “Modern Data Stack” and search through images. What do you see? It’s one big slew of architecture diagram after architecture diagram, with data flowing throughout various systems from the left to the right in most, much sound and fury signifying nothing other than somewhere between 5 and 100 different vendor solutions to purchase to help move data around.
Fundamentally the left-to-right flow is flawed as it is a dressed up back-of-napkin representation of technology flows, not decision flows or capital allocation flows within organizations.
Ultimately the Modern Data Stack diagram is typically a vendor or VC firm or staff augmentation firm’s view of whatever is most economically beneficial for them at present.
At one point earlier this year, while smashing my head on my desk and asking myself why I did not choose a nobler and more honest profession like land speculation or political lobbying or carbon offset management, I compiled the Modern Data Stack of Modern Data Stacks from Google images of these nonsensical diagrams.
But what if we turn the stack on its side and look at how information and capital actually flow through organizations? What if we look at data flow in terms of the pathological nature of organizations on a vertical axis, not a horizontal one?
Enter ‘The Gervais Principle’
Hugh MacLeod’s Company Hierarchy
Venkatesh Rao’s six-part The Gervais Principle provides the perfect backdrop for the lens of data stack problems. Data is merely a subset of information, and information flows up-and-down in organizations.
According to the paradigm, the organization of corporations is as follows:
The Sociopaths are the capitalist striving, will-to-power types who drive organizations to function. This is most business owners, founders, and some executives, among others. Rao notes David Wallace and Jan Levinson of The Office as members.
The Clueless are the Kool-Aid drinkers, chained to corporate mantras, chained to operating paradigms that have done them a small favor like an incremental bump in pay or title for pledged fealty. Most effectively shuffle paper, data, and information around, and they construct narratives and grandiose stories to give their lives and labor meaning. Rao notes Andy Bernard, Dwight Schrute, and Michael Scott as members of the Clueless group.
The Losers are those who have given up capitalist striving in favor of checking boxes within their organizational hierarchies, as in the checked out and the “I am just here for a paycheck” crowd. Stanley Hudson and Kevin Malone are Losers.
Of course, these titles/classifications are somewhat tongue-in-cheek. The Sociopaths are not clinical sociopaths, although one does assume this is where many clinical sociopaths congregate over time. The Losers are not uncool individuals, rather those who have made a bargain with their employers not to be bothered too much so long as they check the boxes laid out for them.
As Rao notes, the MacLeod firm follows a natural cycle. A Sociopath with an idea recruits enough Losers to kick off the cycle and build Version 1. Inevitably, as Version 1 kicks off, a Clueless layer is brought in to manage the increase in production, created by an increasing amount of Losers. This Clueless layer functions as a cushion between the capital holders/decision makers and those closest to production.
As the firm grows over time through cycles, the Clueless layer becomes so large that it makes the firm unsustainable. Eventually, the Clueless layer takes over and collapses the company as the Sociopaths and Losers both make their exits, as they live closer to reality and can most freely move between organizations.
I posit that what we are seeing now with the cloud data warehouse/Modern Data Stack and the high headcount teams that support it is a manifestation of Stage 4 of the MacLeod Life Cycle of the Firm, in which a middleware layer of Clueless has become so unprofitable and so bloated and so distanced from both production and meaningful decision-making, that at many tech companies it is prepared to collapse on itself.
Below is my take on the Modern Data Stack/Team equivalent of the MacLeod Firm.
The Losers are the Data Producers — the application teams and product engineering teams tasked with running some set of applications or systems or services that produce data.
The Clueless are the Modern Data Stack Data Teams tasked with processing this data for the Sociopaths to use.
The Sociopaths are the Data Consumers — the executives and department heads and buyers of commercialized data who run P&Ls and allocate capital to fuel up or down incremental budget to support the continued movement of data throughout the organization.
Here is the ideal of how the Modern Data Stack works, per a vendor. It has all the classic marks.
Sources -> Ingestion Layer -> Storage/Warehouse of many tables -> BI or Operations or AI/ML.
But what I have yet to see laid out in any vendor/VC/staff aug firm diagram is the breakdown of humans involved.
As a first pass at this, here is the Modern Data Stack turned 90 degrees, compared against the MacLeod Hierarchy.
Congratulations, you’ve built an increasingly complex layer of human middleware making tables and tables and tables on a centralized cloud data warehouse, chasing down KPIs for the business (Sociopaths) but dependent on the application owners who are not incentivized to care about quality of the data they produce (Losers).
Much becomes revealed about the inherently pathological nature of organizations once the flow of data is presented vertically, with stages tied to outcomes.
Each of these stages represents a transfer from one group to another, a shift in custody of the data. Truly understanding how bloat accumulates, however, requires an understanding of the handoffs of data among these groups.
The Data Contract
Understanding and examining the Loser-to-Clueless handoff
Ask any Modern Data Stack leader about their top three biggest problems, and I guarantee you one of the top three, if not number one, is the handoff between applications and the data team.
Fundamentally, this is a Garbage In-Garbage Out problem. Put most simply, if the data is not quality in a source, or if it changes in definition or structure — be it a Salesforce object owned by the RevOps team, or a set of Postgres tables owned by an application team — and this is replicated, copied, integrated or whatever else into some kind of centralized store like Snowflake, BigQuery, or similar, it will result in incorrect data that is not reflective of the reality it is supposed to represent.
The Losers, the application teams, made a change to check off their box for feature deployment.
The Sociopaths, the executives, notice the dashboard or data product is broken.
The Clueless, the centralized data team members, run around and either make some kind of ‘patch’ or ‘workaround’ in SQL in the cloud data warehouse or go back to the application team and charges them for failing to adhere to the standards of joint accountability that were put in place.
The latter almost never happens. The Modern Data Stack Data Team, naturally positioned in the middle of this dynamic, makes point fixes and accrues the technical debt and incremental human capital and cloud costs of taking these actions.
They partially lose credibility in front of the Sociopaths.
They become go-fers patching up technical debt, and making even more tables in the cloud data warehouse. The time to compile and run through incremental SQL models from source-to-store-to-dashboard increases. More compute is eaten to deliver on the baseline.
The Losers don’t care. The Sociopaths just want their set of dashboards fixed.
When iterated over 2, 3, 10, 20, 50 and more times, this leads to more hiring of more people in the Clueless layer. Just as middle management bloats in the MacLeod organization, so too do data teams, and for the same reason, the inherent disconnect between the Sociopaths and the Losers.
This kicks off a never-ending cycle.
A job requisition is written up. Needed: Data/Analytics Engineer; must be good with dbt and Airflow and know cloud data warehouses. $180,000 per year plus benefits.
Then, because the amount of human heads and tables in the warehouse has increased, the Clueless layer may need an observability tool to help them be less clueless about what is happening in/at/around/to the increasing number of tables and SQL they have made.
They may also need a catalog, a lineage tool separate from how they are running their DAGs, and a testing tool to make sure the PRs they create result in properly tested new tables.
At a certain point, usually as early as 4–5 people in my experience, now you’re just adding human heads and increasingly complex DAGs. Each nth hire covers the tech debt and communications challenges created from the previous heads added to make graphs of SQL on the cloud data warehouse.
The solution, it is said, is enforcing data contracts. This at least limits the downtime and patching resulting from the Loser-to-Clueless handoff.
Many who have run data platform teams have at least framed the problem of data contracts, and offered some framework of a solution with practical examples and advice.
James Densmore lays out a thoughtful methodology here .
Chad Sanderson lays out the problem in great detail and offers a similar solution here .
These are all steps in the right direction, and I think they at least form a paradigm that can be applied across multiple organizations.
But what both of these miss is a carrot-or-stick for what happens when a contract is violated.
Fundamentally, Data Contracts is a question of how a middleware Clueless team must incentivize a Loser team to play by rules that the Clueless team wishes to establish, in order for the Clueless team to appease Sociopath Data Consumers who do not have the time nor the technical know-how to sort through contract negotiations between technical teams that work for the same company, where the contracts are data schema manifested as acceptance criteria or ‘Definition of Done’ on a Github PR or similar.
In seeing this play out a number of times in my career, I truly believe that the only organizations where Data Contracts are primed to work are those that directly sell or otherwise capitalize their data, by offering it to data brokers, hedge funds, manufacturers who wish to understand their full sell-through of the supply chain they sell in, and similar.
If the Loser-to-Clueless handoff results in datasets that are directly capitalized or otherwise reflected as assets, then it can work. Failure to adhere to the Data Contract is risk to revenue, and therefore accountability begins to exist.
If the Loser-to-Clueless handoff exists at an organization where the Clueless Data Team is a pure cost center, forget it.
The Almighty Dashboard
Understanding and examining the Clueless-to-Sociopath handoff
Congratulations.
You’ve gotten your Data Contracts in place.
But who makes the dashboards?
One of several things happens here in dashboard creation on the Modern Data Stack, because someone has to make or collect dashboards or analyses or visualizations.
One path is for analysts (dashboard makers) to report to specific business units. This way, these people report into the P&Ls managed by the Sociopaths. This is a great stepping stone opportunity for many young people who aspire to be in the Sociopath class one day.
One path is to have analysts report to the Clueless, the team that is making tables, tables, tables, in Snowflake (or whatever) writing SQL on SQL on SQL. This is a great stepping stone for people who aspire to ‘get more technical’ over time and likely wish to align themselves to the Clueless layer.
However, as an analyst this is often quite bad career-wise should you wish to go the Sociopath route. Your resume will be filled with names of vendors and languages and will be unlikely to contain information about business outcomes. There is nothing wrong with this, per se, it is just a fundamentally different outcome and young or new data professionals should understand this when evaluating which type of analyst or data scientist team they would report into when evaluating opportunities.
Another path altogether is to have various business people have their ops people make dashboards. This is usually deemed as ‘self-service analytics’ but really most of the Sociopaths are not going to do very much self-service.
This one is great, it usually makes the least sense of them all.
Many times, the same human beings are both the application owners (eg. Salesforce Ops, Customer Service Ops, etc.) and now dashboard makers too, with a layer of the Clueless team running a data warehouse in the middle, with the application owners and ops people functioning as both Data Producers and as Data Consumers.
The spreading human middleware layer
At the core of most data infrastructure and data engineering work is the fact that these teams and the humans that comprise them are stuck in the middle, by default doomed to MacLeod Cluelessness as soon as they start a new job on Day 1. They are most often not Data Producers, and act in service of Data Consumers, effectively human middleware tasked with managing technology middleware — the integrations, schema, data contracts, and interoperability among systems.
In fact, even the leaders of these groups often fail. Many Chief Data Officers fail. HBR has written about this . Gartner said in 2019 that only half succeed .
In today’s tech organizations, many of which don’t have a defined CDO but do have some functional ‘Head of Data’ in a VP or Director title, we too see a number of issues, the primary one being human middleware turning into human bloatware.
In just the past few weeks several articles have come out about challenges within the Modern Data Stack.
This one , from dbt Labs, a vendor in the space, shows that in their own dogfooded dbt project, they have over 1,700 SQL models in Snowflake and had been running full window function table scans on expensive cloud data warehouse compute cycles, 4x a day, unnecessarily adding tens of thousands of dollars a year to their cloud bills and significantly impacting throughput. Additionally, they maintain a team of their own internal analytics engineers, increasing in size as time passes.
This one , from a data employee of loyalty platform ShopBack, shows an increasing number of heads — 6 to 18 — as well as an increasing number of SQL sequences — ~700 over 6 months — created for analytics purposes.
In both cases, both businesses are not just venture-backed, but have gone through several rounds of heavy financing in a short period of time.
The cycle plays out in a similar way at these types of companies, almost every time.
Cash is injected. This means more employees of all sorts are hired. More employees increases the demand for more reports and analytics. More demand means more human middleware is created in the Clueless layer when you adopt the Modern Data Stack paradigm of throwing everything into your cloud data warehouse of choice.
More Clueless human middleware creates more tables, tables, tables to many more reports and KPIs and metrics. They have to buy new products and hire new people to manage the complexity. Then a new round of financing occurs. The data team hires more people and products. The organization hires more people, which creates more demand for more KPIs and metrics, and results in more reports and KPIs and metrics.
The cycle repeats until the money runs out.
All this effectively does is throw human bodies at the problem, to piece together solutions, some of which may not have ever had to occur in the first place.
The excess costs of this are passed back to cloud vendors and the services around them, where increasing numbers of Clueless layer employees make increasing numbers of tables and SQL sequences to chase down increasing numbers of metrics and KPIs and reports for increasing numbers of Sociopaths based on increasing numbers of Losers owning more applications and generating more data.
The MacLeod Cycle compresses in time when fueled by venture capital markup after markup.
Instead of taking several years to accrue increasing human middleware and middle management within organizations, this now occurs over months or a year or two.
While a bootstrapped or traditional business sees employee growth as a function of profitability and potentially debt (“don’t hire until it hurts”), many VC-backed digital companies run through this lifecycle more quickly (“always be hiring”), with heavy cash thrown at these businesses, round after round, sometimes with mere months separating checks.
In the past few years of extended quantitative easing and effectively zero interest rates in the US, this has led to business after business creating data functions with effectively free money to shoot data around the organization.
Is data centralization actually the enemy?
One of the biggest scams of the Modern Data Stack is the idea that all data needs to come through a centralized repository, to a centralized team, in order to produce some kind of outcome.
In reality, most business decisions are made at the line-of-business level, on data that can remain isolated or ‘siloed’ within the team and application ownership responsible for it.
Why does the Sales Operations team need a report to run through Snowflake or similar when all they need to report on is dials per SDR or campaign attribution, already in the jointly integrated Salesforce and Marketo that they use?
Right now, the narrative says bring in everything raw from each individual application or upstream database, then fold it all together in SQL inside the cloud data warehouse.
Then, hire a team to manage all the SQL joins inside the cloud data warehouse.
Then, the data is deemed correct by the time it arrives at the end of the process of the cloud data warehouse.
Then, hire analysts to make the outputs of reporting and BI.
This is all extremely capital inefficient.
There is no way most of these companies that have adopted this operating paradigm can continue paying for this at scale, given that many in software are facing downrounds and layoffs and many in consumer/consumer finance have taken on debt or may soon have to in order to finance continued operations.
The headcount costs are too high, and the incremental cloud costs produced by headcount chasing down the nth KPI or metric or analysis or report are too high.
Fundamentally, the Modern Data Stack/cloud data warehouse story is one of the late 2010's/early 2020’s, with essentially free money, grow-grow-grow mentality, and very little attention to profitability.
Just a few years ago, dbt was marketed as a tool used by “companies like Casper, InVision, Away Travel and many more…”
Casper was never even remotely profitable , it IPO’d at half its peak valuation, and was eventually taken private post-IPO at yet another markdown, as many armchair MBAs note in countless videos and articles.
Away has gone through all kinds of dramatics over the last two years .
The list goes on and on. Feel free to look up your favorite Modern Data Stack point solution tool of choice in their archives or on the Wayback Machine and go see how the companies featured in 2018–2019 or so are doing now.
Controlling human middleware bloat has got to be the path going forward
Almost nowhere at any Modern Data Stack company I’ve consulted for is the ‘Head of Data’ meaningfully compensated for their budget management. It happens but it is rare, though this is how many larger businesses operate.
Marketing sends out 10x the amount of emails and SMS’s in November-December, and then people scratch their heads as to why their Fivetran (or similar) credits were toasted and the business must spend another $20–30k to re-up early. Well, this is because Fivetran and similar row-based pricing models charge on the amount of data ingested. If you’ve hooked up all of your email systems and you’re generating 10x the volume, your costs are going to go up on a volume-based model.
New IoT or log data gets ingested into a centralized data warehouse for a new initiative. This is put together quickly, ingested, stored, and now each day 500 million new rows (and growing!) start filling tables. Of course, this is then joined to existing DAGs of SQL, and just like that now the cloud data warehouse bill has gone up $7K a month the first month, and may end up costing an incremental $100k this year depending on its place in the graph of SQL and the reports and analyses derived from it.
All of these problems (and there are many more I can name) are fundamentally driven by the lack of fiscal responsibility and discipline around how these data teams operate.
As a solution and to stay ahead of likely-upcoming recession pressures, I recommend the following:
Executive leadership set tight budgets around spend for their data teams, with data leadership compensated based on adherence to the budgets they manage. If 25–30% of a data leader’s total compensation package is dependent on staying within spend limits, this will immediately build a culture of cost management. As a tradeoff, this may mean fewer requests are serviced from people asking questions of the data team and it may mean fewer systems get ingested to the cloud data warehouse. This is a good thing, as lower-level asks will run through operational teams and only directionally significant requests will be served.
Ensure that data team individual contributors have access to understanding the fiscal impact of building the nth pipeline or writing additional SQL. In fact, I can say that in about 50% of the organizations I have worked with, individual contributor employees building pipelines and SQL and analyses on the cloud do not have any insight into the costs of their actions due to role restrictions or various user permissions configuration. I believe this is a mistake, and I have heard from many of these people that they at least want to know the costs of their efforts. If you have spinach in your teeth, it is better that someone lets you know about it rather than you walking around with spinach in your teeth all day, not knowing about it.
Develop a culture of ‘firing’ dashboards and DAG nodes and pipelines that do not serve a purpose or serve a limited purpose or have been otherwise unused. In fact, in most organizations with spiraling cloud costs this can be a major cost saver, especially when combined with properly materializing DAG nodes that feed significant daily query usage.
Finally, one of the biggest drivers of costs, if not in cloud spend in headcount cost, is servicing finance requests that come through the cloud data warehouse. I have seen this time and time again working with VC-backed digital companies, and this ‘Shadow Finance’ and how to avoid it merits another article entirely. There should not be data engineers or analytics engineers or analysts or whatever other titled person joining together disparate transactions and order systems data in SQL in the cloud data warehouse, then dumping these out to BI dashboards for someone in finance to download or even worse — Reverse ETLing this into a financial system of record when it didn’t enter the warehouse correctly reconciled in the first place. This can be the main driver of data team costs at some companies, with dedicated data employees using SQL on the cloud to clean and join together data and create datasets that are used to close books or pay taxes or perform any other kind of financial operation. Let finance resolve their own entities in their own system, then bring it into the cloud data warehouse if so desired, not the other way around.
Ultimately, shoveling data from System A to B and normalizing it and denormalizing it has almost no inherent value. As data bloat continues to burst as we enter a recession, remember that the best teams are the ones that drive the most direct impact, not the ones that make big binders of config.
--