A Primer on Resilient Intelligent Systems

Editor’s note: Dan Shiebler is a speaker for ODSC West this November 1st-3rd. Be sure to check out his talk, “Resilient...

Editor’s note: Dan Shiebler is a speaker for ODSC West this November 1st-3rd. Be sure to check out his talk, “Resilient Machine Learning,” there!

The world around us is shaped by intelligent systems that solve problems. These systems process large amounts of raw and structured data to make intelligent decisions. Recommendation, detection, creation, content discovery, and many more workflows are increasingly driven by larger and more complex intelligent systems.

Any such system needs to draw a line between hand-engineered heuristics and machine learning. Historically, hand-engineered heuristics have been the central driver of system success. However, the dramatic availability of open data on the internet has enabled machine learning approaches to drive immense value. The most effective intelligent systems balance both pillars.

Unfortunately, intelligent systems can be notoriously unreliable.

At Abnormal Security, we design intelligent systems to recognize cyberattacks. Our systems process vast quantities of raw data to unearth malicious behavior, and our customers rely on us to protect them at all times, every day, no matter what. One of our core values is #customer-obsession, and we take this commitment very seriously. We are therefore deeply invested in building resilient intelligent systems. This requires carefully inspecting data, developing a deep understanding of system behavior, and reducing risk through rigorous engineering.

Today, I will go over several common challenges that arise when designing resilient intelligent systems. I will then dive into how we have overcome these challenges at Abnormal Security.

Any important software system that interacts with humans will need to cope with adversarial actors. Intelligent systems that consume user behaviors as inputs are particularly susceptible. A resilient intelligent system will therefore need to resist manipulation and attack.

Some adversarial actions are very basic, such as submitting different permutations of messages to see which will avoid blockage. Other attacks can be quite complex. For example, attackers might attempt to poison counting data or training labels by submitting fictitious positive or negative feedback. Attackers may also use fake accounts to produce fake behavioral signals that pollute system data.

At Abnormal Security, we pride ourselves on the resilience of our detection system to attacker manipulation. Our sophisticated user behavioral models enable us to spot attacker behavior that stands out from normal customer behaviors without relying on matching particular attacker patterns. This makes it extremely difficult for attackers to get past our systems or poison our data distributions.

In addition, we use data augmentation to predict the manipulations that attackers may apply and defend against them. This enables our detection engine to learn as much as possible from each attack that we see.

Most importantly, the detection team carefully inspects every attack that manages to sneak through our defenses. This enables us to proactively close the chinks in our armor and stay one step ahead of attackers.

Every intelligent system engineering team needs to manage data mismatch between online data and offline data.

Data scientists and machine learning engineers typically think of their data as static rows in a table. A request object enters the system, the system extracts features from that object, and the model runs on these features to generate a prediction. If we log the values of these features then we can test our models offline.

In reality, the situation can be quite different. The features on a particular request object are not static, and may change as different models run on the object. The logged request is a static view of a dynamic entity, and it may or may not be appropriate to treat it as a representation of what the model will see online.

For example, suppose multiple models will run on the same logged request and append their predictions to the object. Each model will therefore “see” a version of the object with a different set of data attached. No static snapshot of the object will represent this.

At Abnormal we embrace this dynamism. In addition to logging raw requests and feature values we also maintain a sophisticated simulation system to reconstruct online behavior offline. This system uses the same code paths as our production engine and allows us to mimic our online environment offline.

One of the great bogeymen of intelligent systems design is feedback loops. In many circumstances the data that feeds back into an intelligent system is dependent on the predictions it generates. This insidious effect can give rise to extremely complicated system dynamics.

Some classic examples of this situation include:

These effects can cause serious damage to machine learning systems by tying the labeled data that is used to train a model to the predictions of previous versions of that model. However, this is not the only vector of feedback loop system disruption.

Many intelligent systems consume aggregate counting signals like the number of previous times that a user has seen content of type X or the number of spammy emails that a particular account has sent. These counts can be heavily influenced by the previous predictions of the intelligent system itself.

At Abnormal we have designed sophisticated systems to protect our intelligent systems from bias introduced by feedback loops. For example, we have designed importance sampling-based pipelines that select data points for labeling in a way that allows us to reconstruct serving distributions. This enables us to train and evaluate machine learning models on data distributions that are not biased by serving distributions. As another example, we accumulate our aggregate signals over a variety of stratifications of data. This allows us to limit the bias of system decisions on these features.

Every complex system will eventually fail. Large software systems can fail in an enormous number of ways, including bugs, unexpected load, vendor issues, or even hardware failures.

Anytime these issues occur, the data flowing through an intelligent system may become polluted. Important signals may suddenly disappear, and downstream components will need to cope with their absence.

The most dangerous kind of outage is one in which components fail in unexpected ways. Such outages can expose system paths without good default behavior, thereby causing major problems.

At Abnormal, our customers count on us to protect them 24/7. It is completely unacceptable for a system outage to substantially reduce the efficacy of our detection engine.

We defend our system uptime with multiple layers of fallbacks. First, we maintain both synchronous and asynchronous processing paths in order to ensure low latency and a high tolerance for fluctuations in load. Next, our detection engine uses a layered approach. We maintain a collection of simple and complex rules and models that consume signals from a variety of different upstream sources. Since different components rely on different upstream signals the overall engine is resilient to system failures and outages.

Finally, at Abnormal we hold our ML models to a very high standard. Rather than only use clean data to train and evaluate our models, we force our ML models to deal with noisy, missing, or corrupted data during the training process. Mike Tyson once said “everybody has a plan until they get punched in the face.” This mantra rings true in ML as well. A model that only ever sees clean data during training will fail spectacularly when the data it sees in production gets corrupted. By allowing our ML models to get “punched in the face” during the training process we can better equip them to deal with unexpected adversity.

Suppose you own a product that uses an intelligent system. If your product is successful, the set of people who use your product will grow. This will force the intelligent system that drives your product to encounter new customers or users who may behave very differently from anything previously seen. Now suppose that you add new features to your product. This can cause existing customers or users to use the product differently.

These effects can translate into large and sudden shifts in the distribution of data that your intelligent system will need to handle. To do this, it will need to adapt to new user data as quickly as possible.

At one extreme, consider a large machine learning model that uses a categorical feature representation of users. When making an inference about a particular user, this model only has access to data up until the last time the model was retrained. This can severely limit the model’s ability to adapt to user-level data distribution changes.

Now consider instead supplying that same model with a user representation that describes recent user behavior, such as features like “number of products purchased in the last hour” or “sentiment of the last message sent”. Such a model can adapt as quickly as the data feeding into this representation.

At Abnormal Security we build models of customer behavior and design feature representations that don’t rely on rigid components. Instead, we represent each customer and user in terms of their historical data distribution. Distributional changes cause this representation to change, which enables our system performance to automatically adapt and improve.

In order for machine learning to continue to drive impact in new applications, we will need to address this problem directly. We start with testing. ML is software, and good tests are an irreplaceable tool for building a resilient system. We will explore how to design end-to-end simulations to assess our models’ resilience. Next, certain feature encoding strategies are more resilient to sudden distribution shifts than others. For example, models trained with clever default values or hashed bucketized features can be particularly resilient to localized feature outages. We will discuss the dynamics that drive this phenomenon. Finally, an ML model is a reflection of the task we train it to solve. By cleverly introducing noise to the training process we can build models that perform well even during software incidents. We will dig into the best strategies to engineer tasks for robust and resilient models.

As the Head of Machine Learning at Abnormal Security, Dan Shiebler builds cybercrime detection algorithms to keep people and businesses safe. Before joining Abnormal Dan worked at Twitter: first as an ML researcher working on recommendation systems, and then as the head of web ads machine learning. Before Twitter Dan built smartphone sensor algorithms at TrueMotion and Computer Vision systems at the Serre Lab.

Images Powered by Shutterstock

The Data Daily

A Primer on Resilient Intelligent Systems