Stephen E. Arnold: Open Source HTTP Crawler (Norconex)

Stephen E. Arnold
Stephen E. Arnold

Norconex Offers Open Source HTTP Crawler

Most commercial enterprise search vendors offer their own HTTP crawler, and several are open-source. One new entry to the field stands out, though, for its odd blend of web and enterprise search functionality. In the post, “Norconex Gives Back to Open-Source,” Norconex describes their crawler and associated libraries:

“The Norconex HTTP Collector is an HTTP Crawler meant to give the greatest flexibility possible for developers and integrators. It makes it easy for Java developers to add custom features, so no one will get stuck again when dealing with odd requirements, difficult websites, or close-source crawler limitations. . . . The HTTP collector can be used stand-alone or embedded as a library in your own software.

“Norconex may release other collectors for various data sources in the future. In the meantime, we have encapsulated the document parsing process and sending of parsed data to your target search engine or repository into two separate libraries. We are releasing them as Norconex Importer and Norconex Committer.”

Norconex tells us that they focused on a simple configuration, as well as providing features that cannot be found in some existing crawlers. The enterprise search firm was founded in 2007 and is based in Ottawa, Canada.

Cynthia Murrell, July 16, 2013

Sponsored by, developer of Augmentext

Opt in for free daily update from this free blog. Separately The Steele Report ($11/mo) offers weekly text report and live webinar exclusive to paid subscribers, who can also ask questions of Robert. Or donate to ask questions directly of Robert.