Thinking traps

Human beings are subject to cognitive biases by virtue of being human, including rationalization and confirmation bias. Alberto Cairo writes, "Rationalization is the human brain's default mode."¹ Very often, people expect or want a particular result, then look for data or evidence to support that result.

When working with or evaluating data and models, which can come from many different sources, ask about potential sources of bias. For example:

Who is funding this model or study? What is the market or commercial application?
What kinds of incentives exist for the people involved in collecting data?
What kinds of incentives exist for the researchers training the model or conducting the study, including publication and tenure?
Who is licensing the model or publishing the study, and what are their incentives?

Descriptive statistics

Mean (sum of values divided by count), median (middle value, when values are ordered), and mode (most frequent value) are often helpful in getting a sense of the shape of one's dataset. If the median and mean are far apart, for example, there may be fairly extreme and asymmetric values in the set.

The range, which is the difference between the highest and lowest values, and the variance, which is the mean squared difference between each value and the set's mean, also provide useful information on the spread and shape of the dataset.

Prior to training a model on your data, also ask if the dataset is imbalanced and, if so, whether that imbalance should be addressed.

Probable improbabilities and p-values

Given enough time and enough chances, the occurrence of an improbable event becomes very probable. See the theoretical Baltimore stockbroker scam for one possible example.

By scientific consensus, a result is considered statistically significant (and therefore publishable) when the p-value is less than .05. That means there is a <5% chance that the same result, or one more extreme, would occur under the null hypothesis—that is, as the result of chance. More colloquially, researchers are only able to publish if there's a 1-in-20 chance or less that their results are the outcome of randomness. Alternatively, and more alarmingly, about once in twenty experiments, a spurious result will appear to be significant, although it isn't, and the other nineteen results will not be published. In a 2005 paper, "Why Most Research Findings Are False," John Ioannidis laid out multiple factors, from statistical to financial, contributing to the publication of spurious results.

For example, given the strong incentives to publish, researchers sometimes fudge p-values around .05 to fall below that threshold. Other times, published study results, which naturally select for unexpected and unusual results, turn out to be not replicable (and therefore possibly the outcome of chance), which has led to a crisis of confidence in multiple fields. It has also led to the creation of organizations dedicated to testing reproducibility.

In the field of ML, models are only considered state-of-the-art if they meet or exceed the evaluation benchmarks of most other competitive models. It's possible that similar pressures arise around model evaluation scores, which can be artificially boosted by benchmark leakage.²

P-values can be useful in feature selection for regression models. ANOVA (Analysis of Variance) is a statistical method that compares variance within groups to variance between the groups, returning a F-statistic and p-value for each feature. Choosing the most significant features, with the lowest p-values, can reduce the number of features a model has to consider, without losing much predictive power. This both saves compute and avoids the problem of too many features, discussed in a later section. See scikit's Feature selection guide for details.

The multiple comparisons problem

The significance-threshold problem is especially severe in situations where multiple comparisons to the null hypothesis are being conducted at the same time. This is a particular issue for fMRI studies.

In a fMRI, each voxel (volume unit) of the brain is independently tested for statistically significant activity, and highlighted if so. This leads to something on the order of 100,000 independent significance tests being conducted at once. At a p=.05 significance threshold, statistical theory expects approximately 5,000 false positives appearing in a single fMRI.³

The problem is probably best illustrated by the 2009 Bennett et al. poster, "Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon," which won the Ig Nobel Prize. The researchers showed 15 photographs of human beings in highly emotional situations to a dead salmon in a fMRI machine, asking the dead salmon to determine what emotions the pictured human beings were experiencing. They located a statistically significant cluster of active voxels in the salmon's brain cavity and concluded, tongue in cheek, that the dead salmon was indeed engaging in perspective-taking. More seriously, the researchers were calling attention to the multiple comparisons problem in fMRI and similar imaging situations, and the need for mitigations.

One obvious, coarse-grained solution is to lower the threshold p-value that indicates significance. The inherent tradeoff is between sensitivity (capturing all true positives) and specificity (identifying all true negatives). A discussion of sensitivity, also called the true positive rate, can be found in the Classification module of Machine Learning Crash Course.

Another mitigation is controlling for the the family-wise error rate (FWER), which is the probability of at least one false positive. Another is controlling for the false discovery rate (FDR), or the expected proportion of false positives to all positives. See Evidence in Governance and Politics' guide to the multiple comparisons problem, as well as Lindquist and Mejia's "Zen and the art of multiple comparisons," for explanations of these methods and a few walkthroughs. In the situation with the dead salmon, controlling for FDR and FWER showed that no voxels were, in fact, statistically significant.

Training ML models on scans from fMRI and other imaging methods is increasingly popular both in the area of medical diagnosis⁴ and in reconstructing images from brain activity.⁵ If these models are trained on a sufficiently large dataset, this might reduce the likelihood of issues from the multiple comparisons problem. However, particularly in the realm of diagnosis, the model may make inaccurate inferences on new individual scans if 20% of "active" voxels are indeed false positives. Note that the diagnostic fMRI classification models described in Li and Zhao have ~70-85% accuracy.

Too many variables in regression analysis

The multiple comparisons problem extends to multiple regression analysis. Regression analysis, or linear regression, is the backbone of many numerical predictive models. Regression analysis uses one of several methods, like ordinary least squares, to find the regression coefficient that best describes how one variable affects another. Researchers can ask how age and smoking affect lung cancer rates by representing each factor as a variable in a regression analysis of cancer incidence in smokers and nonsmokers of various ages. A linear regression model works in much the same way, and is therefore highly interpretable compared to other types of ML models. Finding the regression coefficients of those variables will describe the linear relationships between these variables and lung cancer rates.

It can be tempting to include all possible variables in a regression analysis, not least because not including a critical factor can lead to its contribution being overlooked. However, adding too many variables to a regression analysis increases the odds that an irrelevant variable will appear statistically significant. If we add eighteen more irrelevant variables to our analysis, like "movies watched" and "dogs owned," it's likely that one of those irrelevant variables, by pure chance, will appear to be associated with higher lung cancer rates.⁶

In the ML context, the analogous situation is giving too many features to the model, which can result in overfitting, among other problems.

Inferences and decision-making

One way to sidestep some of these thinking traps is to treat statistics and ML models, which are derived from statistics, as tools for making decisions, rather than answering questions. This was the position taken by Jerzy Neyman and Egon Sharpe Pearson.⁷

In this framework, data, data statistics, and derivatives, including ML models, are best suited for making probabilistic predictions, disproving universal statements, improving and focusing research questions, and assisting in decision-making. They are not well suited for making affirmative claims about truth.

According to David Ritter, decisions based on correlations from even gigantic amounts of data should be based on two factors:

"Confidence that the correlation will reliably recur in the future," which should be based both on how frequently that correlation has occurred in the past and an accurate understanding of what is causing that correlation.
The risks and rewards of acting.⁸

Similarly, not all research questions may be well suited for AI. Anastassia Fedyk offers two criteria for an AI-appropriate problem:

The problem requires prediction, not understanding causal relationships.
The data being fed to AI contains all that needs to be known about the problem; that is, the problem is self-contained.⁹

References

Bennett, Craig M., Abigail A. Baird, Michael B. Miller, and George L. Wolford. "Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction." Neuroimage (2009).

Cairo, Alberto. How Charts Lie: Getting Smarter about Visual Information. NY: W.W. Norton, 2019.

Davenport, Thomas H. "A Predictive Analytics Primer." In HBR Guide to Data Analytics Basics for Managers (Boston: HBR Press, 2018) 81-86.

Ellenberg, Jordan. How Not to Be Wrong: The Power of Mathematical Thinking. NY: Penguin, 2014.

Fedyk, Anastassia. "Can Machine Learning Solve Your Business Problem?" In HBR Guide to Data Analytics Basics for Managers (Boston: HBR Press, 2018) 111-119.

Gallo, Amy. "A Refresher on Statistical Significance." In HBR Guide to Data Analytics Basics for Managers (Boston: HBR Press, 2018) 121-129.

Huff, Darrell. How to Lie with Statistics. NY: W.W. Norton, 1954.

Ioannidis, John P.A. "Why Most Published Research Findings Are False.". In PLoS Med 2 no. 8: e124.

Jones, Ben. Avoiding Data Pitfalls. Hoboken, NJ: Wiley, 2020.

Li, Jiangxue and Peize Zhao. "Deep learning applications in fMRI – a Review Work" ICBBB 2023 (Tokyo, Japan, January 13–16, 2023): 75-80. https://doi.org/10.1145/3586139.3586150

Lindquist, Martin A. and Amanda Mejia. "Zen and the art of multiple comparisons." Psychosomatic Medicine 77 no. 2 (Feb-Mar 2015): 114–125. doi: 10.1097/PSY.0000000000000148.

Ritter, David. "When to Act on a Correlation, and When Not To." In HBR Guide to Data Analytics Basics for Managers (Boston: HBR Press, 2018) 103-109.

Tagaki, Yu and Shinji Nishimoto. "High-resolution image reconstruction with latent diffusion models from human brain activity." 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (Vancouver, BC, Canada, 2023): 14453-14463. doi: 10.1109/CVPR52729.2023.01389.

Wheelan, Charles. Naked Statistics: Stripping the Dread from the Data. NY: W.W. Norton, 2013

Zhou, Kun, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. "Don't Make Your LLM an Evaluation Benchmark Cheater." arXiv:2311.01964 cs.CL.

Cairo 182. ↩
Zhou et al. ↩
Lindquist and Mejia. ↩
Li and Zhao 77-78. ↩
Tagaki and Nishimoto. ↩
Wheelan 221. ↩
Ellenberg 159. ↩
Ritter 104. ↩
Fedyk 113. ↩

Data quality and interpretation

Analysis traps