While information leakage focuses on largely temporal effects in data science techniques, bias errors often result from complexities in the underlying data distributions. There are two common kinds of bias that occur when prototyping machine learning models: reporting bias and selection bias.
A common bias that is often overlooked relates to the provenance of the training data available to data scientists. Early in C3 AI’s history, for example, we developed machine learning algorithms to detect customer fraud. In one customer deployment, it was clear to us that the algorithms were significantly underperforming in one particular geography, a remote island. When we examined the situation further, we realized there was substantial reporting bias in the data set from the island. Every historical investigation performed on the island was a fraud case, skewing the data distributions from that island.
It turns out that because of the island’s remoteness, investigators wanted to be sure that a case would be fraudulent before they would travel there. Because AI/ML algorithms are inherently greedy, in this example the algorithm incorrectly maximized performance by marking all customers on the island with a high fraud score.
Because the frequency of events, properties, and outcomes in the training set from that island differed from their real-world frequency, our teams had to counteract the implicit bias caused by the selective fraud inspections on the island.
Reporting bias is common where humans are engaged in the initiation, sampling, or recording of data used for eventual machine learning model training.
Another common bias in machine learning training data refers to the selection of data for training models. It is imperative that teams focus on ensuring their training data are representative of the real-world situation in which the model is to perform. For example, AI/ML models that seek to predict customer attrition for a bank may need to carefully consider the demographics of the population. Attrition for high-net-worth individuals is likely to have substantially different characteristics than attrition for lower-net-worth individuals. A model trained on one set would likely perform quite poorly against the other.
Selection bias is common in situations where prototyping teams are narrowly focused on solving a specific problem without regard to how the solution will be used and how the data sets will generalize. A machine learning modeler must ensure that training data properly represent the population or take alternative steps to mitigate introduction of bias to the model.