March 15, 2014
Run a query for Google Flu Trends on Google. The results point to the Google Flu Trends Web site at http://bit.ly/1ny9j58. The graphs and charts seem authoritative. I find the colors and legends difficult to figure out, but Google knows best. Or does it?
A spate of stories have appeared in New Scientist, Smithsonian, and Time that pick up the threat that Google Flu Trends does not work particularly well. The Science Magazine podcast presents a quite interesting interview with David Lazar, one of the authors of “The Parable of Google Flu: Traps in Big Data Analysis.”
The point of the Lazar article and the greedy recycling of the analysis is that algorithms can be incorrect. What is interesting is the surprise that creeps into the reports of Google’s infallible system being dead wrong.
For example, Smithsonian Magazine’s “Why Google Flu Trends Can’t Track the Flu (Yet)” states, “The vaunted big data project falls victim to periodic tweaks in Google’s own search algorithms.” The write continues:
A huge proportion of the search terms that correlate with CDC data on flu rates, it turns out, are caused not by people getting the flu, but by a third factor that affects both searching patterns and flu transmission: winter. In fact, the developers of Google Flu Trends reported coming across particular terms—those related to high school basketball, for instance—that were correlated with flu rates over time but clearly had nothing to do with the virus. Over time, Google engineers manually removed many terms that correlate with flu searches but have nothing to do with flu, but their model was clearly still too dependent on non-flu seasonal search trends—part of the reason why Google Flu Trends failed to reflect the 2009 epidemic of H1N1, which happened during summer. Especially in its earlier versions, Google Flu Trends was “part flu detector, part winter detector.”
Oh, oh. Feedback loops, thresholds, human bias—Quite a surprise apparently.
Time Magazine’s “Google’s Flu Project Shows the Failings of Big Data” realizes:
GFT and other big data methods can be useful, but only if they’re paired with what the Science researchers call “small data”—traditional forms of information collection. Put the two together, and you can get an excellent model of the world as it actually is. Of course, if big data is really just one tool of many, not an all-purpose path to omniscience, that would puncture the hype just a bit. You won’t get a SXSW panel with that kind of modesty.
Scientific American’s “Why Big Data Isn’t Necessarily Better Data” points out:
Google itself concluded in a study last October that its algorithm for flu (as well as for its more recently launched Google Dengue Trends) were “susceptible to heightened media coverage” during the 2012-2013 U.S. flu season. “We review the Flu Trends model each year to determine how we can improve—our last update was made in October 2013 in advance of the 2013-2014 flu season,” according to a Google spokesperson. “We welcome feedback on how we can continue to refine Flu Trends to help estimate flu levels.”
The word “hubris” turns up in a number of articles about this “surprising” suggestion that algorithms drift.
Forget Google and its innocuous and possibly ineffectual flu data. The coverage of the problems with the Google Big Data demonstration have significance for those who bet big money that predictive systems can tame big data. For companies licensing Autonomy- or Recommind-type search and retrieval systems, the flap over flu trends makes clear that algorithmic methods require baby sitting; that is, humans have to be involved and that involvement may introduce outputs that wander off track. If you have used a predictive search system, you probably have encountered off center, irrelevant results. The question “Why did the system display this document?” is one indication that predictive search may deliver a load of fresh bagels when you wanted a load of mulch.
For systems that do “pre crime” or predictive analyses related to sensitive matters, uninformed “end users” can accept what a system outputs and take action. This is the modern version of “Ready, Fire, Aim.” Some of these actions are not quite as innocuous as over-estimating flu outbreaks. Uninformed humans without knowledge of context and biases in the data and numerical recipes can find themselves mired in a swamp, not parked at the local Starbuck’s.
And what about Google? The flu analyses illustrate one thing: Google can fool itself in its effort to sell ads. Accuracy is not the point of Google or many other online information retrieval services.
Painful? Well, taking two aspirins won’t cure this particular problem. My suggestion? Come to grips with rigorous data analysis, algorithm behaviors, and old fashioned fact checking. Big Data and fancy graphics are not, by themselves, solutions to the clouds of unknowing that swirl through marketing hyperbole. There is a free lunch if one wants to eat from trash bins.
Stephen E Arnold, March 15, 2014
March 13, 2014
Phi Beta Iota: Google’s culture is severely flawed because it presumes internal intelligence and external idiocy, and it simply does not get the fact that human brains are vastly more powerful than any algorithm. Google started with a crime (the theft of Yahoo’s search engine) and went downhill from there. Google is a perfect example of what E. O. Wilson would call the stunted sciences — sciences without the humanities (his book, CONSILIENCE: The Unity of Knowledge, strives to answer the question “why do the sciences need the humanities?” Put another way, “Why do Larry Page and Vint Cerf need someone like Robert Steele?” Here is a quote from another review of this book:
“As J. von Neumann & H.H. Goldstine said: `a mathematical formulation necessarily represents only a theory of some phase (aspects) of reality, and not reality itself.’ ” Luc REYNAERT