Stephen E. Arnold: From Bad Search to Predictive Analytics Snake Oil

IO Impotency
Stephen E. Arnold

From Search to Prediction 

The economic vise is closing on some search and content processing vendors. There are some significant repositioning’s underway. We are working on three at this time, and, believe me, the vendors are doing more than changing the color of the logo.

As we work on our projects, we have been aware of the emergence of a new buzzword closely allied to text mining, metatagging, and analytics. The word is “predictive” and we are seeing it in a number of different contexts.

Wikipedia marches through applications, statistical techniques, and tools. You can find that 5,000-word article at

My problem with the application of “predictive” to everything from biosciences as in “predictive bioscience” to conferences as in “Predictive Analytics World” is that it sounds so darned good. Yet few know much about the numerical recipes upon which predictive operations rest. Do you recall “the axiom of choice”?  Even more disturbing is that most of the professionals with whom I work do not recognize that selecting a different mathematical procedure can generate quite different results. In effect, the “prediction” is more of an expression of what the algorithm generates than a manifestation of what the data may mean in the real world. A 70% “score” may mean wrong 30% in the output.

The New York Times, Sunday, July 8, 2012, ran “When the Crowd Isn’t Wise.” The story points out that experts and prediction markets are equally lousy. No big surprise. I have found the notion of predicting the future a matter of data management and luck, not wisdom.

Now that predictive analytics are one of the tools some search vendors include in their shop cart will the phrase “promise more than they can deliver” plague these companies? My thought is that at this time search and content processing vendors are struggling to generate revenues and contain costs. In order to keep their jobs, some search vendor CEOs are grabbing for any straw available. But most of these outfits are not Google and many lack Google’s pool of scientific and mathematical talent. Google is in the predictive analytics game and at a high level. The company filed a predictive search query patent document earlier this year. See “Google Files Predictive Search Query and Results Patent.”

But marketers often don’t get involved in details. Analytics, no problem. Predictive analytics? Even better.

What can predictive methods provide to a user looking for information to elucidate a business issue? If the issue is one that crops up again and again, monitoring user behavior and analyzing log files using basic counts, not fancy math, can provide some extremely useful insights. For example, the marketing department runs X queries across Y topics every week.The most frequent queries can inform personalization routines. Open source software like AW Stats or WebLog Expert Lite can be helpful. The math is comprehensible to most college graduates.

But to embrace such methods for exponential family random graph models or graph limits may be tough for a person with a BA in Liberal Arts.

The vendors ignore this reality, writing collateral is easy; for example:

  • Point-and-click access to sophisticated analytic methods
  • Automated reports which provide answers to real-world questions
  • Dynamic graphics that show real-time trends in streaming data.

Does anyone think that making decisions based on reports canned half a world away by a programmer with zero knowledge of the licensee’s “real” need will be reliable? How does a search vendor with expertise in indexing text make the leap into predictive models without stubbing a toe?

I am troubled by the arrogance of search and content processing vendors who use mathematical malarkey try to rejuvenate a gasping product line. The number of PhDs in mathematics in the US has been stuck in the 1,200 to 1,600 per year range for a while. When I worked at Halliburton Nuclear, the scarcity of individuals who could perform the algorithm selection required for nuclear applications was a very, very small number of individuals. Would you believe fewer than 20 candidates per year?

We maintain a list of about 200 search and content processing vendors. Many of these are pitching the analytic story. My reaction is that vendors who get too far from their core competence are likely to find themselves in the same predicament that forced Convera, Delphes, Entopia, and other search vendors to go out of business.

To sum up, if prediction outfits had algorithms that delivered answers, would these companies be pitching software or would these firms be dealing in stocks and investing in real winners?

Stephen E Arnold, July 24, 2012

Phi Beta Iota:  Governments and corporations have spent trillions on collecting information, only 2% of which is visible on the Internet — much less if one counts secret technical collection.  No one has invested seriously in desktop or back-office analytics, oil and pharmaceutical endeavors notwithstanding.  Put another way:  predictive analytics today applies ignorant algorithms to a fraction of the information available for machine processing, and completely neglects both the development of an analytic model that is holistic in nature and has documented cause and effect parameters fully integrated; and the development of desktop analytic tool-kits for humans to leverage the combination of the human brain and machine-speed processing.  We remain a dumb society and a dumb nation.

1989 Webb (US) CATALYST: Computer-Aided Tools for the Analysis of Science & Technology

Paul Fernhout: Open Letter to the Intelligence Advanced Programs Research Agency (IARPA)

Worth a Look: 1989 All-Source Fusion Analytic Workstation–The Four Requirements Documents