Experimentation in Data Science
Aug 7 · 13 min read
Today I am going to talk about experimentation in data science, why it is so important and some of the different techniques that we might consider using when AB testing is not appropriate. Experiments are designed to identify causal relationships between variables and this is a really important concept in many fields and particularly relevant for data scientists today. Let’s say we are a data scientist working in a product team. In all likelihood, a large part of our role will be to identify whether new features will have a positive impact on the metrics we care about. i.e. if we introduce a new feature making it easier for users to recommend our app to their friends, will this improve user growth? These are the types of questions that product teams will be interested in and experiments can help provide an answer. However, causality is rarely easy to identify and there are many situations where we will need to think a bit deeper about the design of our experiments so we do not make incorrect inferences. When this is the case, we can use often use techniques taken from econometrics and I will discuss some of these below. Hopefully, by the end, you will get a better understanding of when these techniques apply and also how to use them effectively.
Most people reading this have probably heard of AB testing as it is an extremely common method of experimentation used in industry to understand the impact changes we make to our product. It could be as simple as changing the layout on a web page or the colour of a button and measuring the effect this change has on a key metric such as click-through rates. I won't get it into the specifics too much here as I want to focus more on the alternative techniques but for those interested in learning more about AB testing the following course on Udacity provides a really good overview. In general, we can take two different approaches to AB testing. We can use the Frequentist approach and the Bayesian approach, each of which has its own advantages and disadvantages.
I would say frequentist AB testing is by far the most common type of AB testing done and follows directly from the principles of frequentist statistics. The goal here is to measure the causal effect of our treatment by seeing if the difference between our metric in the A and B groups is statistically significant at some significance level, 5 or 1 per cent is typically chosen. More specifically, we will need to define a null and alternate hypothesis and determine if we can or cannot reject the null. Depending on the type of metric we choose we might use a different statistical test but chi-square and t-tests are commonly used in practice. However, there are some limitations to the frequentist methodology and I think it is a bit harder to interpret and explain compared to the Bayesian approach but perhaps because the underlying maths is more complex in the Bayesian setting it is not as commonly used. A key point about the frequentist approach is that the parameter or metric we compute is a constant. Therefore, there is no probability distribution associated with it.
The key difference in the Bayesian approach is that our metric is a random variable and therefore has a probability distribution. This is quite useful as we can now incorporate uncertainty about our estimates and make probabilistic statements which are often much more intuitive to people than the frequentist interpretation. Another advantage of using a Bayesian approach is that we may reach a solution faster compared to AB testing as we do not necessarily need to assign equal numbers of data to each variant. This means that a Bayesian approach may converge to a solution faster using fewer resources. Choosing which approach to take will obviously depend on the individual situation and is largely up to the data scientist. Whichever method you choose they are both nonetheless powerful ways to identify causal effects.
When AB testing doesn’t cut it
In many cases, however, AB testing is just not a suitable technique to identify causality. For example, for AB testing to be valid we must have random selection into both the A and B groups. This is not always possible as some interventions may target individuals for a specific reason making them fundamentally different than other users. In other words, selection into each group is non-random. I will discuss and provide code for a specific example of this that I ran into recently a bit later in the post.
Another reason AB testing may not be valid is when we have confounding. In this situation, looking at correlations between variables may be misleading. We want to know if X causes Y but it may be the case that some other variable Z drives both. This makes it impossible to disentangle the effect of just X on Y making it very difficult to infer anything about causality. This is also often called omitted variable bias and will result in us either over or underestimating the true impact of X on Y. In addition to this, it may not be feasible from a business standpoint to design a randomised experiment as it may cost too much money or be seen as unfair if we gave some users new features and not provide those features to other users as well. In these circumstances, we must rely on quasi-random experiments.
Ok, so we have discussed some of the reasons we may not be able to apply AB testing but what can we do instead? This is where econometrics comes in. Compared to machine learning, econometrics is much more focused on causality and as a result, economists/social scientists have developed a breadth of statistical techniques aimed at understanding the causal impact of one variable on another. Below I will show some of the techniques that data scientists could and should borrow from econometrics to avoid making incorrect inferences from experiments that suffer from the problems mentioned above.
Let's say we are a company with an app like medium and users can subscribe to topics to get updates on content related to these topics. Now imagine your product owner comes to you with a hypothesis: subscribing to more topics will make users more engaged because they will be getting more relevant and interesting new content to read. Well, this sounds like a reasonable hypothesis and we may think that this is likely to be true but how can we actually measure this using an experiment?
The first thing you might consider doing is a traditional AB test by randomly splitting users into two groups. This will be very difficult, however, as users will have a differing number of topic subscriptions already and we cannot really randomly assign topic subscriptions to users. The main issue is that users who are subscribed to more topics will likely be inherently more engaged than other users for some unobserved reasons. This is essentially endogeneity as there is some unseen factor driving engagement for those users.
Luckily for us, economics has the answer. We can use something known as instrumental variables to help us solve this endogeneity problem. If we can find a variable (instrument) that is correlated with our endogenous variable (number of subscriptions) but not with our dependent variable (engagement) we can use this to purge the effect of the unseen factor and get an unbiased causal estimate of the impact of the number of subscriptions on engagement. But what can we use as an instrument? This is usually the biggest issue in economics as it is usually quite difficult to find instruments in many settings. However, in our case, if we are at a company that conducts a lot of AB testing we can use this to our advantage. We need two assumptions to hold to have a valid instrument:
Strong first stage
Let's say in the past we ran an AB test that randomly split users into two groups and showed the B group a reminder on the home page to subscribe to topics. Let’s assume that this AB test also showed a positive impact and that users who received the treatment ended up subscribing to more topics than the control group. We can use this experiment as an instrument to help us identify causality in our engagement question. Why does this work? Well, the first assumption has been satisfied because the AB test had a statistically significant positive impact on the number of subscriptions. The second assumption is also met because the assignment was random so being treated has no correlation with engagement and only impacts engagement through increasing the number of subscriptions.
Now to actually implement this we can use something known as two-stage least squares. The good news is that this is pretty easy to do and just involves estimating two regressions. As the name implies there are two stages, in the first stage we will regress the treated variable (dummy variable identifying if a user was in the treatment or control group for the AB test) on the number of subscriptions. This will give us the predicted number of subscriptions for those in the treatment vs those in the control. In the second stage, we then regress the predicted values from the first stage on engagement. This will give us our unbiased estimates of the causal impact of the number of subscriptions on engagement.
Regress treatment on the number of subscriptions: get predicted values
Regress predicted values on engagement.
You could do this in Python using the statsmodels library and it would look similar to the code below.
import statsmodels.formula.api as smfreg1 = smf.ols(formula = "number_of_subscriptions ~ treated", data = data).fit()
pred = reg1.predict()reg2 = smf.ols(forumla = "engagement ~ pred", data = data)
Regression Discontinuity Design (RDD)
RDD is another technique available taken from economics and is particularly suited when we have a continuous natural cut-off point. Those just below the cut-off do not get the treatment and those just above the cut-off do get the treatment. The idea is that these two groups of people are very similar so the only systematic difference between them is whether they were treated or not. Because the groups are considered very similar these assignments essentially approximates randomised selection. In particular, any difference in the Y variable between the two groups is attributed to the treatment.
The classic example of using RDD which will hopefully make the idea clearer is what is the impact of receiving social recognition (certificate of merit) from outstanding academic achievement? Does it lead to better outcomes compared to those who did not this social recognition? The reason we can apply RDD to this example is that certificates of merit were only given to individuals who scored above a certain threshold on a test. Using this approach we can now just compare the average outcome between the two groups at the threshold to see if there is a statistically significant effect.
It may help to visualise RDD. Below is a graph which shows the average outcome below and above the threshold. Essentially all we are doing is measuring the difference between the two blue lines beside the cutoff point. There are many more examples of using RDD in economics and here are a few that will hopefully give you a flavour of the different use cases. For those who want more of a background on some of these topics, you can check out the following free course .
Source : Example: RDD cutoff: Threshold at 50 on the x-axis
Difference in Differences (Diff in Diff)
I will go into a bit more detail on Diff in Diff as I recently used this technique on a project. The problem was one that many data scientists might come across in daily work and was related to preventing churn. It is very common for companies to try and identify customers who are likely to churn and then to design interventions to prevent it. Now identifying churn is a problem for machine learning and I won't go into that here. What we care about is whether we can come up with a way to measure the effectiveness of our churn intervention. Being able to empirically measure the effectiveness of our decisions is incredibly important as we want to quantify the effects of our features (usually in monetary terms) and it is a vital part of making informed decisions for the business. One such intervention would be to send an email to those at risk of churning reminding them of their account and in effect to try and make them more engaged with our product. This is the basis of the problem we will be looking at here.
Ok, so we want to know whether this email campaign was effective or not so how can we do this? Well, the first thing we need to do is to come up with some sort of metric that we want to measure that we would expect to be impacted by the campaign. Thinking about it logically, if we prevent users from churning then they will keep using our product and generate revenue for the company. This seems to make sense so let's use average revenue as our metric. The metric that you pick will largely depend on the problem and if we didn't expect our metric to affect revenue in a relatively direct way then it would probably not be the best choice for this experiment. To sum up, our hypothesis states that sending this email to people at risk of churning will prevent them churning leading to improved average revenue from those customers. If this hypothesis is true we can expect that average revenue in the treatment group to be higher than the average revenue in the control group (those who did not receive the email).
Now you might think this sounds like a regular old AB test but there is one major problem with this approach. The A and B group are fundamentally different because selection into the groups was not random. Users in group B are those who have a higher likelihood of churning whereas the users in group A are less likely to churn based on the results of our machine learning model. In this scenario, we are not comparing apples with apples which will cause our results to be biased. The technical term for this is selection bias.
We can, however, deal with this using Diff in Diff. Essentially it compares the differences for the metric in the control and treatment group before and after the treatment takes place. Doing this allows us to control for pre-existing differences between the groups reducing the selection bias. In equation form it will look similar to below: where a is the control group, b is the treatment group and t indicates the pre-intervention time period and t+1 is post-intervention.
Figure 1: diff in diff estimator
The good thing about this method is that it is pretty easy to estimate and we can do it using a regression framework. This also provides the added benefit of being able to control for other characteristics such as age and location across users reducing the likelihood of omitted variable bias. One thing to be aware of when using this approach is that there are certain assumptions that must be met for the technique to be valid. Probably the most important of them is the parallel trends assumption. Simply put, it states that if there were no intervention then the difference between the treatment and control would be constant over time. This is hard to test in practice and often the best way is to just visually inspect the trends over time.
Next up, I am going to walk through a quick example of Diff in Diff using python to demonstrate how you can estimate this type of model in practice. Note: Unfortunately, because the data I was using was sensitive I won't be able to dig into the dataset but here is a similar dataset with customer demographic information and transaction information.
Diff in Diff Case Study
Continuing on from the explanation above, we are now going to implement Diff in Diff in Python to show you how straightforward it is to estimate it in practice. In the following code, I am assuming that we have two main sources of data, a transactions table recording each users transactions and a user table containing information on customers such as their age and location etc. Our goal is to estimate an equation of the following form:
Figure 2: Diff in Diff regression equation
Y is our average revenue metric
Treat = 1 if the users are in the treatment group and 0 otherwise
Time = 1 if we are post-treatment and 0 for pre-treatment
???? is our Diff in Diff estimator (DD) from figure 1 above
X is a matrix of our control variables
Keeping with the equation above we first need to define the treat and time dummy variables which will help us to split users into the control and treatment and before and after the treatment. We can then define our DD interaction term by multiplying these two features together. The final result is a data frame containing our user's characteristics (X), two dummy variables and our interaction term which is what we are most interested in. Now, all we have to do is regress these variables on our average revenue metric and see if the ???? is statistically significant at our chosen significance level (5 per cent). The statsmodels library makes implementing the regression very simple. For brevity, I have left out any data cleaning and feature engineering but this is an important part of the process and is something you should do as part of the normal workflow.
import statsmodels.formula.api as smfreg1 = smf.ols(formula='revenue ~ treated + time_dummy + DD + X', data = data)
res = reg1.fit()
Figure 3: Diff in Diff Results
We can see from the results in figure 3 that our DD estimator is significant and positive at the 5 per cent level indicating that our treatment has had a positive impact. More specifically, what this p-value means is that the probability that we would observe an effect this large given the null hypothesis (the intervention having no effect) is true is very small. We can, therefore, reject the null hypothesis and conclude that our email campaign has been successful in reducing customer churn and as a result, we have higher average revenue for that group because they continue to use our product.
Ok, guys, that's all for this post. Hopefully, this highlights the relevance of undertaking experiments in data science and how econometric techniques can be particularly useful in identifying causality when we face problems such as endogeneity and sample selection bias. Below are some useful links for anyone wanting to dive into these concepts a bit further.
References and Useful Resources: