Logo

The Data Daily

Shapley Values: A Gateway Drug To Data Science Experiments

Shapley Values: A Gateway Drug To Data Science Experiments

Your model is a hypothesis. Trained models are the beginning of Data Science experiments. When I started explaining this concept 3 years ago, it was probably my most controversial in a long list of controversial things I say. Then and now, the thought of deploying a hypothesis to production is scary.

That is what most businesses have. It’s the reason model monitoring is so critical. Models need constant attention and improvement because they are incomplete. A large scale model alone is an unsupported hypothesis and has little chance of achieving the reliability businesses are looking for.

Without explainability frameworks, large scale models are opaque. Across businesses, models are neither supported nor understood. They don’t know how it works or if it works. Model maintenance frameworks keep the wheels from falling off, but you can see the flaws in legacy Data Science methodology. Train, test, statistically validate, and ship to production will lead to disaster.

Experimentation is required. That starts with running model architectures against each other with different feature sets. Why?

I am looking for models that have found something interesting and worth exploring further. Outside of basic accuracy metrics, how do I know the model has found something interesting? I’ve been setting this up with many of my prior posts so I’m going to bring a lot of them full circle.

I start out looking for places where the existing approach, be it an expert system or an in place model, fails. These classes of problems present the greatest opportunity for improvement and the lowest risk of failure. It doesn’t work now so my model performing questionably is still an improvement upon failure.

I explained the statistical experiment process in an earlier post, and I am going to pick up from there. The statistical experiment has given me an excellent model of my problem space. I have a good predictor of the class(es) causing failure and some data on intervention effectiveness. Well defined classes are critical to this next step.

I explained the need for a Data Librarian’s work around data discovery in a prior post as well. This process is critical because it is unlikely my current dataset is comprehensive enough to learn a function which closely approximates the behavior I’m trying to predict or the system I am attempting to model. I need to test out different combinations of features and integrate new features into the hypothesis generation stage.

I have trained several models, and some appear to perform well on my classes of interest. Those may have found something interesting and worth exploring. Now I need a framework to evaluate what they may have found. This is where model explainability comes into the picture. The features are the interesting details. Based on the model, there is a good likelihood that those features are at least related to the behavior I’m interested in.

In simple linear models, the features are easy to extract from the function. In large scale models, the function is too complex. Enter Shapley values.

Machine Learning is several different types of math hiding under 1 trench coat. Shapley comes to us from game theory. Cooperative games are exactly what they sound like. Players form a team or coalition, and that team can generate a better outcome or gain. Coalitions of people are a lot like coalitions of features. In most teams, individuals contribute differently to the outcome of the game. Shapley values help determine the impact of an individual player or feature on the outcome or prediction.

As with all borrowed math, the implementation comes with hidden assumptions. Shapley values bake in an assumption of feature independence which is a really bad assumption. As a result, Shapley values alone need improvement.

One reason understanding the class is so important is that Shapley values have reliability issues for a given prediction but do very well across a class of predictions. However, when features are highly correlated, Shapley values are unreliable which defeats the purpose. In early experimental runs where we don’t understand how features may be correlated very well, this can lead to a lot of unnecessary experimental dead ends.

New versions of Shapley and versions that integrate with other explainability frameworks have been developed over the last 4 years. There are frameworks integrating causal methods now as well. Each improvement has led to better a better understanding of the model’s hypothesis which leads to experiments with a higher likelihood of success.

For large scale models with lots of features, the calculation of Shapley values is computationally intensive. Really intensive. It cannot be done on a laptop and requires distributed compute resources and frameworks. Microsoft open sourced SynapseML which supports SHAP running on Spark.

This is one of the pieces of infrastructure necessary to support Data Science experiments and few experiment management frameworks offer it.

The strength of the relationship between feature and prediction forms the basis for the hypothesis behind my model. This model believes these features are related to the prediction for this class. This model appears to perform well which supports that belief, making it worth further exploration. Are the features really, though?

I am going to build an experiment to support or refute those relationships. Each experiment is expensive. Many experiments are not feasible. For any experiment to move forward, the return of model reliability must be significant.

Remember, the experiment does not improve model quality. It creates a body of evidence to support the model’s hypothesis. In most cases, the experiment proves the model is flawed and indicates new lines of exploration. Experimental results often support the Data Librarian’s data discovery efforts.

Both experiment result types have value. For business critical or product critical functionality, model reliability requirements justify the cost of the experiments. For functionality with safety nets, as I’ve discussed in a prior post, reliability requirements cannot justify the costs.

Shapley values and improvements on them, show us the relationship between the prediction and the features. The experiment shows us the relationship between the features and real world outcomes. Let’s look at some examples.

For housing prices, there are features that contribute to the value. If they were static, home prices would be a known quantity easily predicted. However, people are involved in the marketplace. People are not rational actors so home prices have volatility based not only on features that contribute to home value but features that contribute to buyer and seller perceptions of value.

In my Zillow post, the business did a lot of validation on the home value contributing features but not on the behavioral features. The home value experiments are cheap and accessible. The behavioral experiments are expensive and often inaccessible. As a result, they did not experimentally support their models and they were not reliable enough to support the business model. In this case, the cost of experimentation was justified by the model reliability requirements.

For customer churn it is easier to run the experiments associated with the causes of churn than it is to run the experiments associated with the causes of the interventions that prevent churn. The business has a lot of observational data about customers who leave or stop buying. Generating a hypothesis to explain why customers leave and validating it experimentally is inexpensive.

The business has less data on customers who are projected to churn and are retained by an intervention. It’s a common attribution problem to prove the customer was going to leave and stayed because of the intervention. It’s even harder to segment the features associated with an intervention in a way that allows us to find out what about that intervention caused retention. For most businesses, this line of experimentation is infeasible without significant expenditure.

You can find a set of interventions that work to reduce most classes of churn with basic trial and error. Analytics provides a lot of value when it’s used to direct which intervention to try. This is our statistical experimental framework. Interventions are not well supported, and the reason one intervention works better than some random intervention is not well understood.

The complex churn predictor and the analytical intervention predictor reduce overall churn by a certain amount. To justify the intervention line of experimentation, I would have to give the business a reasonable return. How much is churn costing right now? Is a 25% reduction (or whatever the approach supports) in churn enough to offset the costs?

If I can think up a creative, low cost experiment, I might be able to make the case. Often that experiment is made possible by a Data Librarian discovering or creating a novel behavioral dataset. If I discover a way to indirectly observe retention behaviors, it can facilitate a low cost experiment.

Experimental design is a critical capability because lowering the cost or figuring out how to execute a previously unfeasible experiment can lead to a lot of value for the business. Those are the types of models that create a competitive advantage which is difficult to duplicate. However, none of that is possible unless the Data Science team has a process to evaluate models. Step 1 in the experimental design process is to come up with a promising hypothesis. Machine Learning provides a framework to streamline hypothesis generation and Shapley values provide a framework for evaluation.

With an experimental process and research phase, I can explain the risks around any model. This allows the business to decide what experiments must move forward and which ones are not justified. It also allows the business to understand reliability guarantees versus reliability requirements. Take Zillow. They had a very good understanding of their reliability requirements but did not make the connection to reliability guarantees. Their decision to implement a business model was flawed because of it. Sometimes businesses get lucky but, in this case, it didn’t work out.

In the churn example, any intervention set which prevents churn better than what’s in place now is worth deploying. The risk of not understanding why the interventions work is minimal. The current cost of experimental runs is too high but it’s something the team will be thinking about. If we can find a lower cost framework, it is worth revisiting. Again, the business has enough information to make a quality decision.

Sometimes you live with the risks. Sometimes you mitigate them. Sometimes you abandon the project.

Images Powered by Shutterstock