Open WEB spider platform. Uses Akka Cluster for distributed processing, along with Distributed PubSub.
The webspider-demo module contains the simple web application that starts one task scheduler node, and couple of web processing nodes, and exposes the interface at http://localhost:8080/
- extract text from HTML/PDF documents
- process only documents, matching given patterns in names/content types
- extract data using XPath expressions from not well-formed HTML pages or XHTML ones
- maintain website graph (links between ancestor / successor pages)
- process websites behind the authentication (HTTP Basic/Digest, Form-Based authentication)
- handle failures and restart processing from point where application was aborted
- provide extension API for document type handlers, protocol handlers
- concurrent processing of website pages
- minimize traffic using bzip/gzip encoding when possible, avoid donloading of same link twice or more times
- HTTP(S)