What is data leakage? What causes data leakage? And how do I know if there is data leakage in my experiment/model?
What is data leakage?
Data leakage happens when information from the test dataset is shared with the training dataset and vice versa. And because of the information sharing between the datasets, you will always get good results. It usually occurs during the preprocessing and feature engineering stage.
How would I know if there is data leakage in my model?
There is a high chance of data leakage in your model if the model produces great results (based on accuracy + other metrics) but when fed by new data in production, it gives you bad results.
Another case of data leakage is data leakage of the model target. Meaning that you basically have a cheat sheet for the model’s target in another input to the model.
For example, say that you want to classify whether a store will have more than 50 customers for specific days and a number of columns. However, say that you also have another column containing the total number of customers (which you would not have access to realistically in production). This would cause data leakage as the model would likely perform really well, but what it will actually learn to do is cheat and simply focus on the “total customers” column.