Someone is once again raining on the big data parade, urging us to consider carefully before jumping on the bandwagon. FT Magazine warns, “Big Data: Are We Making a Big Mistake?” Writer Tim Harford points to Google’s much-lauded Google Flu Trends as an emblematic example in the field. That project notes an increase in certain search terms, like “flu symptoms” or “pharmacies near me”, by point of origin. With those data points, its algorithm extrapolates the spread of the disease. In fact, it does so with only one day’s delay, compared to a week or more for the CDC’s analysis based on doctors’ reports.
The thing is, this successful project is also an example of the blind faith many are putting into the results of data analysis. The scientists behind it aren’t afraid to admit they don’t know which search terms are most fruitful or how, exactly, its algorithm is constructing its correlations—it’s all about the results. Correlation over causation, as Harford puts it. However, Google Flu Trends hit a speed bump in 2012: it greatly over-estimated the flu’s spread, unnecessarily alarming the public. Correlation is much, much easier to determine than causation, but we must not let ourselves believe it is just as good.
The article cautions:
“Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote ‘The End of Theory’, a provocative essay published in Wired in 2008, ‘with enough data, the numbers speak for themselves’.
“Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be ‘complete bollocks. Absolute nonsense.’”
Another quote from Spiegelhalter summarizes the problem with letting ourselves be seduced by big data’s promise of certainty: “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.” The article goes on to discuss in detail the statistical flaws behind big data’s promises. It is an important read for anyone facing the alluring shimmer of the big data trend.
Cynthia Murrell, April 25, 2014