I'm just a short way into Donald Sull and Kathleen Eisenhardt's excellent book, Simple Rules, and came across a passage that struck me as interesting given the breathless hype we're hearing around Big Data these days.
"Why can simpler models outperform more complex ones? When underlying cause-and-effect relationships are poorly understood, decision makers often look for patterns in historical data under the assumption that past events are a good indicator of future trends. The obvious problem with this approach is that the future may be genuinely different from the past. But a second problem is subtler. Historical data includes not only useful signal, but also noise - happenstance correlations between variables that do not reveal an enduring cause-and-effect relationship. Fitting a model too closely to historical data hardwires error into the model, which is known as overfitting. The result is a precise prediction of the past that may tell us little about what the future holds. Throwing more data and computing horsepower into the mix doesn't necessarily resolve this problem, because big data mixed with little theory is a recipe for overfitting. IBM recently released a study, based on a hundred years of data, showing that the increases in height of women's heels were a leading indicator of economic downturns. The flat shoes favoured by 1920s flappers gave way to high heels during the Depression, 1960s sandals to platform shoes during the 1970s oil crisis, and the low heels of the grunge look were replaced by stilettos as the dot com bubble burst. The correlation worked, until it didn't. In the aftermath of the 2008 financial crisis, heel height trended downward. If you crunch a lot of numbers without a good theory, you find correlations - the problem is, they may be spurious."
With our already challenging predisposition of too easily jumping to conclusions, and confusing correlation with causation, we now need to add overfitting to our list of things to consider.