It goes without saying that knowledge about the dataset you are working with is necessary in order to be able to perform effective analyses and develop sound models.
However, what is not often said, with regards to data leakage, is that you should refrain from studying the distributions or basic statistics your dataset until after you split your data into train-validation-test groups (more on splitting your dataset later). If you examine your data before you split it, you will gain insights about rows of data that might end up in your test group, which is effectively data leakage.
Data cleaning is a necessary step in most, if not all, data science projects. One common data cleaning task is to handle rows with duplicate information. In the context of data leakage, it is important to remove duplicate rows so that duplicates do not end up in both train and test groups.
The important next step in preventing data leakage is to select features, or independent variables, that are not correlated with the target variable and were available at the time of prediction.
If a variable is highly correlated with the target variable, it will undoubtedly increase your model’s ability to predict the target. For example, if your target variable is household income and you include household expenses as a feature in your model, then it stands to reason that your model would be better at predicting household income because household expenses are a pretty direct indicator of household income. Therefore, it is best to omit such features from your model in order to prevent data leakage.
Predicting the past with the future data is a form of data leakage. Time series data, features not available when prediction occurred, or categorical features about future data are all data leakage threats related to the temporal order (sequential timing) of a data.
When dealing with time series data, the challenge is making sure to not leak information about the future into the past. For example, including 2019 data in a model that was trained to make predictions about 2018. One option to overcome data leakage with time series data is to have a train-validation-test split for each period of time.
Another option is to remove all data/features not available at the time of prediction. For example, information about whether a customer will be late on a loan payment might not be available until after the customer is actually late and the loan has already been approved. Therefore, if information about late payment status was included in a model that decided whether or not to approve a loan to that customer, there would be data leakage that would render the model unrealistic in a real-world setting.
Additionally, it is worth examining categorical features that label information about the future. For example, a customer could have been labeled “big spender” based on a high frequency of purchases, but this information would cause data leakage if a model trying to predict which customers will be frequent return customers.
For more information about preventing data leakage when working with time series data checkout Rafael Pierre’s “Data Leakage, Part I: Think You Have a Great Machine Learning Model? Think Again.” And for a more in-depth explanation of temporal ordering, checkout Devin Soni’s “Data Leakage in Machine Learning.”
Splitting data into two groups, train and test groups, is standard practice. However, I recommend taking it a step further and splitting data into three groups — train, validation, and test groups. By having three groups you have a validation group to help hyper-tune model parameters as well as a final test group to use after the model is tuned.
After splitting, do not perform exploratory data analysis on the validation and test sets — only the train set! Any additional features or model updates generated by insights from examining the validation or test sets are instances of data leakage.
Partitioning is an important step to consider when splitting a dataset into train, validation, and test groups when there are multiple rows from the same source. Partitioning involves grouping that source’s rows and only including them in one of the split sets, otherwise data from that source would be leaked across multiple sets.
Many machine learning algorithms require normalization. However, it is important to normalize AFTER splitting data. If you normalize before splitting, the mean and standard deviation used to normalize the data will be based on the full dataset and not the training subset — therefore leaking information about the test or validation sets into the train set.
After splitting the data into train, validation, and test sets, the optimal approach is to first normalize the train set, then apply the mean and standard deviation of the train set normalization to the validation and test set normalization.
However, if you plan to use Grid Search Cross Validation to hyper-tune parameters, scaling the train set will cause data leakage if done before cross validation, because cross validation further divides a train set into additional train and test sets. It is recommended to use Pipeline with GridSearchCV to more appropriately scale your data with data preprocessors such as Standard Scaler. I have included a screenshot below of code that uses a pipeline to appropriately scale data when using GridSearchCV.
Last, but not least, it is important to have a healthy skepticism when assessing model performance.
Multiple sources warn data scientists to be weary when a model has high performance scores, because excellent scores can indicate data leakage. What follows are six points to consider when assessing a model’s performance. The answers to these questions could have implications as to whether or not data leakage is occurring.
When assessing your model, first think about the machine learning algorithm being used. Is the algorithm a weak algorithm performing well on a complex problem? Has the algorithm been used on similar data/problems in the past? How does the algorithm’s performance in past cases compare to your current model’s performance?
Look back to your baseline model and look for performance that is “too good to be true.”
If your algorithm returns feature importance, try to come up with a logical explanation as to why each important feature is important. If no logical reason exists, there is potential of data leakage. Also try taking features out of your model one at a time, and if there is a dramatic reduction in performance, there could be a chance the eliminated feature contains leakage.
It is recommended to use multiple measures when assessing a model’s performance. If you are only using one metric (e.g. accuracy), that metric could be hiding data leakages that other metrics would uncover.
After fitting a model and measuring performance, it would be of benefit to explore the distributions of the train, validation, and test groups to discover differences and any patterns that could explain the results.
In a final point, if you have the resources, it is always advantageous to retrieve more data to test the model’s performance on new validation sets that are as close to reality as possible.