Berto Jongman: Washington Post Discovers Deep Web — and the World Bank's Unindexed PDFs — PBI Technical Team Comments

Berto Jongman
Berto Jongman

Only fifteen years after Abe Lederman said the same thing at OSS!

The solutions to all our problems may be buried in PDFs that nobody reads

What if someone had already figured out the answers to the world’s most pressing policy problems, but those solutions were buried deep in a PDF, somewhere nobody will ever read them?

According to a recent report by the World Bank, that scenario is not so far-fetched. The bank is one of those high-minded organizations — Washington is full of them — that release hundreds, maybe thousands, of reports a year on policy issues big and small. Many of these reports are long and highly technical, and just about all of them get released to the world as a PDF report posted to the organization’s Web site.

The World Bank recently decided to ask an important question: Is anyone actually reading these things? They dug into their Web site traffic data and came to the following conclusions: Nearly one-third of their PDF reports had never been downloaded, not even once. Another 40 percent of their reports had been downloaded fewer than 100 times. Only 13 percent had seen more than 250 downloads in their lifetimes. Since most World Bank reports have a stated objective of informing public debate or government policy, this seems like a pretty lousy track record.

Read full article.

Phi Beta Iota: The Post has published something useful. We held it back for a day while we queried our technical team. The Deep Web is not new — Deep Web Technologies led by Abe Lederman remains the best in the world — and also completely ignored by the US Government and the US Intelligence Community. There are some aspects of this discussion that we want to bring forward:

01 PDFs are only as good as the source diversity and integrity, the analytic processing, and the ostensibly expert analysis that goes into them, and the degree to which their findings — not necessarily the report in full — are disseminated and have effect. At this time the World Bank and virtually all other organizations fail on all four fronts.

02 Google Search is a very shallow service — and one that has been corrupted by paid advertising such that Google Search will often show you what someone else wants you to see, not what you need to see — but Google also has reserves of informaton. Google has indexed stuff that is not visible to the public, including proprietary and secret information accessed via its Enterprise services, information that was never supposed to escape those confines. If there is a financial demand for Deep Web access, Google has capabilities that are considerable.

03 PDFs can be indexed in full text. PDFs should be indexed in full text. Indeed, we would go so far as to suggest that among the greatest failings of the entire web infrastructure has been the absence of the combination of precision URLs for each and every document, and full text indexing of all forms of knowledge as a responsibility of the provider. This is something that could be franchised and leased, with free versions for edu and org.

04 Abe Lederman has thought about this more than any other person on the planet (at least in English), and has pointed out to us that “the long tail” reprsents an exponential opportunity for repurposing knowledge. As Amazon has discovered, the greatest value is not in the convenience of the now, but the convenience of the hard to find. In gross generalization, the web and the use of information on the web stop at the 20% that can “see” information as it is presented, and use it then. The other 80% remain unwitting of the existence, relevance, and potential ease of access of that information. If that information were more broadly available, this would have monetization implications all around.

05 All of the above refers to the digital world marketplace of knowledge. It does not cover the other 80% of knowledge, the knowledge that is in analog form or known by humans that must be found, approached, and the tailored knowledge elicited in near real time. That is the full spectrum Human Intelligence (HUMINT) challenge that no one is taking seriously, in part because no one is actually committed to providing decision-support for every Cabinet Department, every Congressional oversight committee, state and local officials, and so on across the eights tribes. Deep Web Technologies, in our view, is the foundation for doing what the secret world cannot do — get a grip on useful knowledge across all mission areas in all languages and mediums. There are other pieces, with geospacial tiling of all information in all langauges and mediums being among them, that no one, anywhere, is focusing on. Our mission at Earth Intelligence Network is to focus on these issues, and help anyone what wants help, to get it right.

