Thoughts On The Two Cultures Of Statistical Modeling
Accuracy beats interpretability and other takeaways from “Statistical Modeling: The Two Cultures” by Leo Breiman
In the paper: “Statistical Modeling: The Two Cultures”, Leo Breiman — developer of the random forest as well as bagging and boosted ensembles — describes two contrasting approaches to modeling in statistics:
- Data Modeling: choose a simple (linear) model based on intuition about the data-generating mechanism. Emphasis is on model interpretability and validation, if done at all, is done through goodness-of-fit.
- Algorithmic Modeling: choose the model with the highest predictive validation accuracy with no consideration for model explainability.
At the time of writing in 2001, Breiman estimated 98% of statisticians were in the data modeling group while 2% (himself included) were in the algorithmic modeling culture. The paper is written as a call to arms for statisticians to stop relying solely on data modeling — which leads to “misleading conclusions” and “irrelevant theory” — and embrace algorithmic modeling to solve novel real-world problems arising from massive data sets. Breiman was an academic, working as a statistician at Berkely for 21 years, but he had previously worked for 13 years as a freelance consultant giving him a well-formed perspective on how statistics can be useful in industry.