Stephen E. Arnold: Content Processing SUCKS

0Shares

Next Generation Content Processing: Tail Fins and Big Data

Posted: 18 Aug 2013 03:11 PM PDT

Note: I wrote this for Homeland Security Today. It will appear when the site works out its production problems. As background, check out “The Defense Department Thinks Troves of Personal Data Pose a National Security Threat.” If the Big Data systems worked as marketers said, the next generation systems would these success stories provide ample evidence of the value of these Big Data systems?]

Next-generation content processing seems, like wine, to improve with age. Over the last four years, smart software has been enhanced by design. What is your impression of the eye-popping interfaces from high-profile vendors like Algilex, Cybertap, Digital Reasoning, IBM i2, Palantir, Recorded Future, and similar firms? ((A useful list is available from Carahsoft at http://goo.gl/v853TK.)

For me, I am reminded of the design trends for tail fins and chrome for US automobiles in the 1950s and 1960s. Technology advances in these two decades moved forward, but soaring fins and chrome bright work advanced more quickly. The basics of the automobile remained unchanged. Even today’s most advanced models perform the same functions as the Kings of Chrome of an earlier era. Eye candy has been enhanced with creature comforts. But the basics of today’s automobile would be recognized and easily used by a driver from Chubby Checker’s era. The refrain “Let’s twist again like we did last summer” applies to most of the advanced software used by law enforcement and the intelligence community.

clip_image001 The tailfin of a 1959 Cadillac. Although bold, the tailfins of the 1959 Plymouth Fury and the limited production Superbird and Dodge Daytona dwarfed GM’s excesses. Source: https://en.wikipedia.org/wiki/File:Cadillac1001.jpg

Try this simple test. Here are screenshots from five next-generation content processing systems. Can you match the graphics with the vendor?

Here are the companies whose visual outputs appear below. Easy enough, just like one of those primary school exercises, simply match the interface with the company

The vendors represented are:

A Digital Reasoning (founded in 2000 funded in part by SilverLake. The company positions itself as providing automated understanding as did Autonomy, founded in 1996)

B IBM i2 (industry leader since the mid 1990s)

C Palantir (founded a decade ago with $300 million in funding by Founders fund, Glynn Capital Management, and others)

D Quid (a start up funded in part by Atomico, SV Angel, and others)

E Recorded Future (funded in part by In-Q-Tel and Google, founded by the developer of Spotfire)

Display 1 from http://goo.gl/YoX8v8

Display 2 from http://goo.gl/ZUBjrp

Display 3 from http://goo.gl/BZ9WHH

Display 4 from http://goo.gl/OhpgVR

Display 5 from http://goo.gl/PKHsJ

Here are the answers: Display 1 is from Palantir. Display 2 is from Recorded Future. Display 3 is from IBM i2, one of the true innovators in the field. Display 4 is from the 14-year-old start up Digital Reasoning. Display 5 is from Quid, a hot start up in San Francisco.

The visual snapshots of these interfaces are like the tail fins on the 1959 Cadillac. The interfaces suggest the future. The graphics are bold. The meaning of the outputs is not immediately evident. What was the purpose of the tail fin? Like these PowerPoint ready screen displays, the underlying information is subordinate to eye candy.

An analyst with an expert’s knowledge of the underlying data and appropriate mathematical training can figure out what the interfaces “show.” In my experience, the reality of today’s workplace is that there are too few analysts with the requisite expertise to configure the systems, verify the data, check the index freshness, and verify that the reports generate outputs in which the user can have confidence. And once the outputs are in hand, isn’t more work needed to make certain that the mathematical procedures have performed as desired?

I am approaching 70 years in age. I have sat through many briefings. I know first hand that certain presenters focus on the visual impact of their graphics. The “plumbing” is kept in the basement.

Each of these systems can be useful in certain situations. However, the systems pivot on valid data, correct configuration of the system, and application of appropriate mathematical techniques. The math is quite important. I have decades of experience watching professionals’ eyes glaze over when the discussion shifts to the effort required to validate data, normalize data, select mathematical procedures, configure the thresholds and settings for those procedures, set up the reports, and do the work needed to make certain the outputs hit the degree of confidence the user or situation requires.

A number of years ago I was working in Washington, DC. The firm with which I had a relationship responded to a request for proposal from a government agency. The basic idea was simple to articulate. The agency wanted a way for a staff member to plug in a name and learn via a green, yellow, or red flag the degree of risk associated with a person.

We used some basic methods in our system, mindful of http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf. Like other content processing firms, we tapped:

Industry standard database
Standard word lists and custom-built controlled term lists
Expectation maximization routines
Clustering
Latent semantic indexing
Fuzzification methods
k-Means methods
k-Nearest Neighbor methods
Bayesian methods
Graphics inspired by work at the Georgia Institute of Technology, the University of Maryland, and elsewhere.

Our graphics were based on a familiar metaphor. Plug in a name and the system would display a traffic light that glowed red, yellow, or green. Red indicated that the entity tested required immediate attention. A green light was the okay signal. A yellow meant that a person required additional research or investigation.

clip_image007 An intuitive interface based on mathematical methods that could be computed within available resources of CPU cycles, bandwidth, and index refresh capabilities. Users could plug in a name and get an indication of the action the entity warranted.

The system used specified data sets and worked reasonably well. Issues we identified were related to the names of the entities in the test data set, errors introduced via user mistyping or misspelling a name, and rejected records during content processing due to inconsistencies in provided data format.

How different are today’s state-of-the-art system from our early work? The answer, based on my team’s experience and research, is, “Not much.”

The reason is that the guts for many of today’s extraordinarily sophisticated and expensive systems have to deal with computability or what mathematicians call P versus NP. See http://en.wikipedia.org/wiki/P_versus_NP_problem. The reason is that the guts for many of today’s extraordinarily sophisticated and expensive systems are based on the same procedures my team used in the red-yellow-green project.

The reason for the lack of technical progress in the face of increased pressure for fast decisions and flows of Big Data is that yesterday’s and today’s computer systems are up against some hard limits. Without a major breakthrough in how problems can be attacked by mathematicians, systems use the same methods. Quantum computers and nano-computers remain in the future for many government entities.

In my opinion, the differences among and between systems are now superficial, like those 1950s tail fins.

What I mean is that the cosmetics of the system are more important than finding new ways to use more sophisticated mathematical procedures. Perhaps a sudden computing innovation will hasten the arrival of science-fiction style computing.

Until then, most government entities rely on the same chunks of mathematics. The marketers assert many important differences. The outputs of the system are subject to the computational constraints I mentioned. The fancy interfaces are offered with assurances that no programmer is needed to analyze data that are processed automatically by smart software. Today’s software understands meaning. In my experience, marketers pay scant attention to what the mathematical building blocks actually do. The focus is on closing the sale. A little sugar and a dollop of science fiction are more palatable to the promotion hungry or a procurement team with modest math knowledge and subject matter expertise.

I have one question, “If these 2013 systems worked, why are the developers fighting to make sales to a government entity?” If I had a 2013 system that could understand information and identify important facts overlooked by others, I would play the stock market or use the system to pick horses at the Kentucky Derby.

If anyone knows of a company implementing this type of business plan, please, let me know.

Stephen E Arnold, August 15, 2013