It was the second game of a double-header, and the Washington Nationals had a problem. Not on the field, of course: The soon-to-be World Series champions were performing beautifully. But as they waited out a rain delay, something went awry behind the scenes. A task scheduler deep within the team’s analytics infrastructure stopped running.
The scheduler was in charge of collecting and aggregating game-time data for the Nationals’ analytics team. Like many tools of its kind, this one was based on cron, a decades-old workhorse for scheduling at regular intervals. Cron works particularly well when work needs to start on a specific day, hour, or minute. It works particularlypoorly— or not at all — when work needs to start at the same time as, say, a rain-delayed baseball game. Despite the data team’s best efforts to add custom logic to the simple scheduler, the circumstances of the double-header confused it … and it simply stopped scheduling new work.
It wasn’t until the next day that an analyst realized the discrepancy when the data — critical numbers that formed the very basis of the team’s post-game analytics and recommendations — didn’t include a particularly memorable play. There were no warnings or red lights, because the process simply hadn’t run in the first place. And so a new, time-consuming activity was added to the data analytics stack: manually checking the database each morning to make sure everything had functioned properly.
This is not a story of catastrophic failure. In fact, I’m certain any engineer reading this can think of countless ways to solve this particular issue. But few engineers would find it a good use of time to sit around brainstorming every edge case in advance — nor is it even possible to proactively anticipate the billions of potential failures. As it is, there are enough pressing issues for engineers to worry about without dreaming upnewerrors.
The problem here, therefore, wasn’t the fact that an error occurred. There will always be errors, even in the most sophisticated infrastructures. The real problem was how limited the team’s options were to address it. Faced with a critical business issue and a deceptive cause, they were forced to waste time, effort, and talent in an effort to make sure thisoneunexpected quirk wouldn’t rear its head again.
So, what would be a better solution? I think it’s something akin to risk management for code or, more succinctly, negative engineering.Negative engineering is the time-consuming and sometimes frustrating work that engineers undertake to ensure the success of their primary objectives. Ifpositive engineeringis taken to mean the day-to-day work that engineers do to deliver productive, expected outcomes, thennegative engineeringis the insurance that protects those outcomes by defending them from an infinity of possible failures.
After all, we must account for failure, even in a well-designed system. Most modern software incorporates some degree of major error anticipation or, at the very least, error resilience. Negative engineering frameworks, meanwhile, go a step further: Theyallow users to work with failure, rather than against it. Failure actually becomes a first-class part of the application.
You might think about negative engineering like auto insurance. Purchasing auto insurance won’t prevent you from getting into an accident, but it can dramatically reduce the burden of doing so. Similarly, having proper instrumentation, observability, and even orchestration of code can provide analogous benefits when something goes wrong.
“Insurance as code” may seem like a strange concept, but it’s a perfectly appropriate description of how negative engineering tools deliver value: Theyinsurethe outcomes that positive engineering tools are used to achieve. That’s why features like scheduling or retries that seem toy-like — that is, overly simple or rudimentary — can be critically important: They’re the means by which users input their expectations into an insurance framework. The simpler they are (in other words, the easier it is to take advantage of them), the lower the cost of the insurance.
In applications, for example, retrying failed code is a critical action. Each step a user takes is reflected somewhere in code; if that code’s execution is interrupted, the user’s experience is fundamentally broken. Imagine how frustrated you’d be if every now and then, an application simply refused to add items to your cart, navigate to a certain page, or charge your credit card. The truth is, these minor refusals happen surprisingly often, but users never know because of systems dedicated to intercepting those errors and running the erroneous code again.
To engineers, these retry mechanisms may seem relatively simple: “just” isolate the code block that had an error, and execute it a second time. To users, they form the difference between a product that achieves its purpose and one that never earns their trust.
In mission-critical analytics pipelines, the importance of trapping and retrying erroneous code is magnified, as is the need for a similarly sophisticated approach to negative engineering. In this domain, errors don’t result in users missing items from their carts, but in businesses forming strategies from bad data. Ideally, these companies could quickly modify their code to identify and mitigate failure cases. The more difficult it is to adopt the right tools or techniques, the higher the “integration tax” for engineering teams that want to implement them. This tax is equivalent to paying a high premium for insurance.
But what does it mean to go beyond just a feature and provide insurance-like value? Consider the mundane activity of scheduling: A tool thatschedulessomething torunat 9 a.m. is a cheap commodity, but a tool thatwarns youthat your 9 a.m. processfailed to runis a critical piece of infrastructure. Elevating commodity features by using them to drive defensive insights is a major advantage of using a negative engineering framework. In a sense, these “trivial” features become the means of delivering instructions to the insurance layer. By better expressing what theyexpectto happen, engineers can be more informed about any deviation from that plan.
To take this a step further, consider what it means to “identify failure” at all. If a process is running on a machine that crashes, it may not even have the chance to notify anyone about its own failure before it’s wiped out of existence. A system that can only capture error messages will never even find out it failed. In contrast,a framework that has a clear expectation of success can infer that the process failed when that expectation isn’t met. This enables a new degree of confidence by creating logic around theabsence of expected successrather than waiting forobservable failures.
It’s in vogue for large companies to proclaim the sophistication of their data stacks. But the truth is that most teams — even those performing sophisticated analytics — employ relatively simple stacks that are the product of a series of pragmatic decisions made under significant resource constraints. These engineers don’t have the luxury of time to both achieve their business objectivesandcontemplate every failure mode.
What’s more, engineers hate dealing with failure, and no one actually expects their own code to fail. Compounded with the fact that negative engineering issues often arise from the most mundane features — retries, scheduling, and the like — it is easy to understand why engineering teams might decide to sweep this sort of work under the rug or treat it as Someone Else’s Problem. It might not seem worth the time and effort.
To the extent that engineering teams do recognize the issue, one of the most common approaches I’ve seen in practice is to produce a sculpture of band aids and duct tape: the compounded sum of a million tiny patches made without regard for overarching design. And trembling under the weight of that monolith is an overworked, under-resourced team of data engineers that spend all of their time monitoring and triaging their colleagues’ failed workflows.
FAANG-inspired universal data platforms have been pitched as a solution to this problem, but fail to recognize the incredible cost of deploying far-reaching solutions at businesses still trying to achieve engineering stability. After all,none of them come packaged with FAANG-scale engineering teams. To avoid a high integration tax, companies should instead balance the potential benefits of a particular approach against the inconvenience of implementing it.
But here’s the rub: The tasks associated with negative engineering often arise fromoutsidethe software’s primary purpose, or in relation to external systems: rate-limited APIs, malformed data, unexpected nulls, worker crashes, missing dependencies, queries that time out, version mismatches, missed schedules, and so on. In fact, since engineers almost always account for the most obvious sources of error in their own code,these problems are more likely to come from an unexpected or external source.
It’s easy to dismiss the damaging potential of minor errors by failing to recognize how they will manifest in inscrutable ways, at inconvenient times, or on the screen of someone ill-prepared to interpret them correctly. A small issue in one vendor’s API, for instance, may trigger a major crash in an internal database. A single row of malformed data could dramatically skew the summary statistics that drive business decisions. Minor data issues can result in “butterfly effect” cascades of disproportionate damage.
The following story was originally shared with me as a challenge, as if to ask, “Great, but how could a negative engineering system possibly help withthisproblem?” Here’s the scenario: Another data team — this time at a high-growth startup — was managing an advanced analytics stack when their entire infrastructure suddenly and completely failed. Someone noticed that a report was full of errors, and when the team of five engineers began looking into it, a flood of error messages greeted them at almost every layer of their stack.
Starting with the broken dashboard and working backward, the team discovered one cryptic error after another, as if each step of the pipeline was not only unable to perform its job, but was actually throwing up its hands in utter confusion. The team finally realized this was because each stage was passing its own failure to the next stage as if it were expected data, resulting in unpredictable failures as each step attempted to process a fundamentally unprocessable input.
It would take three days of digital archaeology before the team discovered the catalyst: the credit card attached to one of its SaaS vendors had expired. The vendor’s API was accessed relatively early in the pipeline, and the resulting billing error cascaded violently through every subsequent stage, ultimately contaminating the dashboard. Within minutes of that insight, the team resolved the problem.
Once again, a trivial external catalyst wreaked havoc on a business, resulting in extraordinary impact. In hindsight, the situation was so straightforward that I was asked not to share the name of the company or the vendor in question. (And let any engineer who has never struggled with a simple problem cast the first stone!) Nothing about this situation is complex or even difficult, conditional on being aware of the root problem and having the ability to resolve it. In fact, despite its seemingly unusual nature, this is actually a fairly typical negative engineering situation.
A negative engineering framework can’t magically solve a problem as idiosyncratic as this one — at least, not by updating the credit card — but it cancontainit. A properly instrumented workflow would have identified the root failure and prevented downstream tasks from executing at all, knowing they could only result in subsequent errors. In addition to dependency management, the impact of having clear observability is similarly extraordinary: In all, the team wasted 15 person-days triaging this problem. Having instant insight into the root error could have reduced the entire outage and its resolution to a few minutes at most, representing a productivity gain of over 99 percent.
Remember: All they had to do was punch in a new credit card number.
“Negative engineering” by any other name is still just as frustrating — and it’s had many other names. I recently spoke with a former IBM engineer who told me that, back in the ‘90s, one of IBM’s Redbooks stated that the “happy path” for any piece of software comprised less than 20 percent of its code; the rest was dedicated to error handling and resilience. This mirrors the proportion of time that modern engineers report spending on triaging negative engineering issues — up to an astounding 90 percent of their working hours.
It seems almost implausible: How can data scientists and engineers grappling with the most sophisticated analytics in the world be wasting so much time on trivial issues? But that’s exactly the nature of this type of problem. Seemingly simple issues can have unexpectedly time-destructive ramifications when they spread unchecked.
For this reason, companies can find enormous leverage in focusing on negative engineering. Given the choice of reducing model development time by 5% or reducing time spent tracking down errors by 5%, most companies would naively choose model development because of its perceived business value. But in a world where engineers spend 90% of their time on negative engineering issues, focusing on reducing errors could be 10 times as impactful. Consider that a 10% reduction of those negative engineering hours — from 90% of time down to 80% — would result in a doubling of productivity from 10% to 20%. That’s an extraordinary gain from a relatively minor action, perfectly mirroring the way such frameworks work.
Instead of tiny errors bubbling up as major roadblocks, taking small steps to combat negative engineering issues can result in huge productivity wins.