Giter Site home page Giter Site logo

polipus's Introduction

Build Status Coverage Status Code Climate RubyGems

Polipus

A distributed web crawler written in ruby, backed by Redis This project has been presented to the RubyDay2013 http://www.slideshare.net/francescolaurita/roll-your-own-web-crawler-rubyday

Features

  • Easy to use
  • Distributed and scalable
  • It uses a smart/fast and space-efficient probabilistic data structure to determine if an url should be visited or not
  • It doesn't exaust your Redis server
  • Play nicely with MongoDB even if it is not strictly required
  • Easy to write your own page storage strategy
  • Focus crawling made easy
  • Heavily inspired to Anemone https://github.com/chriskite/anemone/

Supported Ruby Interpreters

  • MRI 1.9.x >= 1.9.1
  • MRI 2.0.0
  • MRI 2.1.2
  • JRuby 1.9 mode
  • Rubinius

Survival code example

require "polipus"

Polipus.crawler("rubygems","http://rubygems.org/") do |crawler|
  # In-place page processing
  crawler.on_page_downloaded do |page|
    # A nokogiri object
    puts "Page title: '#{page.doc.css('title').text}' Page url: #{page.url}"
  end
end

Installation

$ gem install polipus

Testing

$ bundle install
$ rake

Contributing to polipus

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
  • Fork the project.
  • Start a feature/bugfix branch.
  • Commit and push until you are happy with your contribution.
  • Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
  • Install Rubocop and make sure it is happy
  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Copyright

Copyright (c) 2013 Francesco Laurita. See LICENSE.txt for further details.

polipus's People

Contributors

hendricius avatar janpieper avatar lepek avatar nofxx avatar parallel588 avatar pcboy avatar proffard avatar stefanofontanelli avatar taganaka avatar tmaier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

polipus's Issues

invalid byte sequence in US-ASCII (ArgumentError)

/usr/lib/ruby/1.9.1/uri/common.rb:304:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
from /usr/lib/ruby/1.9.1/uri/common.rb:304:in `escape'
from ./vendor/bundle/ruby/1.9.1/gems/polipus-0.3.2/lib/polipus/page.rb:169:in `to_absolute

Note:

export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8

doesn't help here

Support for other charsets than UTF-8

I have some issues crawling Japanese websites with SHIFT-JIS encoding.
Nokogiri is not doing any automatic charset conversion to UTF-8.

I fixed it by rewriting the Page#doc method and using kconv.

    require 'kconv'

    def doc
      return @doc if @doc

      noko_en_id = {
        Kconv::UTF8 => 'UTF-8',
        Kconv::EUC => 'EUC-JP',
        Kconv::SJIS => 'SHIFT-JIS',
        Kconv::ASCII => 'ASCII',
        Kconv::JIS => 'ISO-2022-JP'
      }[Kconv.guess(@body || '')]

      @doc = Nokogiri::HTML(@body, nil, noko_en_id) if @body && html? rescue nil                                                                            
    end

Gzip decoded body not used anywhere

At HTTP#fetch_pages you try to decode the gziped content of a page.

https://github.com/taganaka/polipus/blob/master/lib/polipus/http.rb#L34-L39

          body = response.body.dup
          if response.to_hash.fetch('content-encoding', [])[0] == 'gzip'
            gzip = Zlib::GzipReader.new(StringIO.new(body))    
            body = gzip.read
          end
          pages << Page.new(location, :body          => response.body.dup,

but body is not used anywhere. :body should get it's value.

In general, I'm not sure it this necessary at all, as http://www.ruby-doc.org/stdlib-2.1.1/libdoc/net/http/rdoc/Net/HTTP.html#class-Net::HTTP-label-Compression states this is done by Net::HTTP automatically

Change logging format

Polipus: [worker #2] Page [http://www.example.com/89237459832475] downloaded

in iTerm 2, I can CMD+click on the link. But it then tries to open http://www.example.com/89237459832475%5D.

Therefore, I suggest to change the log message to

Polipus: [worker #2] Page (http://www.example.com/89237459832475) downloaded

or

Polipus: [worker #2] Page <http://www.example.com/89237459832475> downloaded

I could provide a pull request if you want

Cannot install on JRuby 1.7.13. Error with bson_ext-1.9.2

For some reason I'm not able to install polipus gem on JRuby 1.7.13. Tried both Windows 8.0 and Ubuntu 12.04. Got the same error message.

Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native extension.

C:/torquebox-3.1.1/jruby/bin/jruby.exe extconf.rb 

NotImplementedError: C extension support is not enabled. Pass -Xcext.enabled=true to JRuby or set JRUBY_OPTS.

(root) at C:/torquebox-3.1.1/jruby/lib/ruby/shared/mkmf.rb:8
require at org/jruby/RubyKernel.java:1065
(root) at C:/torquebox-3.1.1/jruby/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1
(root) at extconf.rb:1

Gem files will remain installed in C:/torquebox-3.1.1/jruby/lib/ruby/gems/shared/gems/bson_ext-1.9.2 for inspection.
Results logged to C:/torquebox-3.1.1/jruby/lib/ruby/gems/shared/gems/bson_ext-1.9.2/ext/cbson/gem_make.out
An error occurred while installing bson_ext (1.9.2), and Bundler cannot
continue.
Make sure that gem install bson_ext -v '1.9.2' succeeds before bundling.

RegularExpression To Follow a Link

When I use the regex like below it would not crawl.

crawler.follow_links_like(/show.php\?id=[A-Z]/) 

however if I remove the id parameter it works.

crawler.follow_links_like(/show.php/) 

Please let me know regex to match dynamic parameters are supported? Many thanks!

Robots.txt option

Hello,

Is there an option for robots.txt file, something like : obey_robots:true/false ?

Thanks.

Whitelist start urls?

If you use #follow_links_like and the given start urls does not match the configured regexps, the crawler stops working. Is there a reason why the start urls aren't whitelisted?

start_urls = [ "http://www.example.com/foo/bar" ]
Polipus.crawler("dummy", start_urls, options) do |crawler|
  crawler.follow_links_like(/\/bar\/foo/)
end

The links on the start page match the given regexp.

Support for headless crawling

Does it make sense to have support for headless crawling built-in to the framework? A lot of the websites these days are Single Page apps and crawling that using the current framework won't work.

We could try to do this using phantonjs or capybara-webkit. I've been able to do a headless crawl using capybara-website and poltergeist before.

What are your thoughts on this?

Internet connection lost; Page still stored and processed

Today, I lost the internet connection. I got the following error message in my logs:

W, [2014-05-18T18:44:44.684461 #13198]  WARN -- Polipus: Page http://www.example.com/foobar has error: No route to host - connect(2) for "www.example.com" port 80

Polipus should not proceed processing a page in this case.
It should requeue the page and proceed with the next page.

Fails when response["Set-Cookie"] is nil

TypeError: no implicit conversion of nil into String
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie/scanner.rb:20:in `initialize'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie/scanner.rb:20:in `initialize'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie.rb:281:in `new'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie.rb:281:in `block in parse'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie.rb:280:in `tap'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie.rb:280:in `parse'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:191:in `parse'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus/http.rb:180:in `get_response'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus/http.rb:149:in `get'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus/http.rb:47:in `fetch_pages'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus.rb:183:in `block (3 levels) in takeover'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/redis-queue-0.0.4/lib/redis/queue.rb:56:in `block in process'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/redis-queue-0.0.4/lib/redis/queue.rb:54:in `loop'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/redis-queue-0.0.4/lib/redis/queue.rb:54:in `process'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus.rb:158:in `block (2 levels) in takeover'

Unicode pages does not work anymore on 0.5.0

I was able to crawl Unicode pages in 0.4.0 but after upgrading to 0.5.0 only some English characters would be in a crawled page. Please let me if there any settings I have to change?

URL patching

Hello, I have a pattern "%E2%80%93" in the URL strings and need to replace that with "%96" before a Page is saved. Some websites use strange characters in the URLs and I discovered that some of those strange characters must be replaced, otherwise URL cannot be visited. I believe these URLs are stored as links on a page.

Please let me know if there is a way to replace URL based on some regex pattern before a page stored?

Make it work with Mongo 2.x

I added a new mongo2_store, a mongo2_queue and a few other methods to make it work with the new driver MongoDB 2.

I wonder if you are interested in a pull request with this, supporting mongo 1.x and mongo 2.x ruby driver, or your plans are to replace the current mongo implementation and update it in favor of the new driver.

SocketError could mean, domain is gone or no internet connection

When you try to resolve a domain which does not exist, polipus creates an error page with SocketError.

Actually, the page does not exist anymore. So it's like a 404 error. Just on DNS level.

But at the same time, SocketError will be raised if the internet connection got lost for any reason.

So to be sure, the site is gone, we would need a method like this

    def internet_connection_available?
      Excon.head('http://www.google.com')
      logger.debug { 'Webpage not available anymore' }
      true
    rescue Excon::Errors::SocketError
      logger.error { 'Internet connection lost' }
      false
    end

Or maybe even better, something like this: http://stackoverflow.com/questions/2385186/check-if-internet-connection-exists-with-ruby/22837368#22837368

I use it like this:

        crawler.on_page_error do |page|
          page.storable = false
          webpage_gone = page.error.is_a?(SocketError) && internet_connection_available?
          crawler.add_to_queue(page) unless page.not_found? || webpage_gone
        end

shall we add something for this case directly to polipus?

How to setup in a cluster environment?

This is really a awesome project, however I didn't figure out how to setup in a cluster environment. can you guys provide a simple cluster(separated machines) setup?

Cannot use with mongoid ~> 4.0.0

https://github.com/taganaka/polipusBundler could not find compatible versions for gem "bson":
In snapshot (Gemfile.lock):
bson (1.9.2)

In Gemfile:
mongoid (> 4.0.0) ruby depends on
moped (
> 2.0.0) ruby depends on
bson (~> 2.2) ruby

Running bundle update will rebuild your snapshot from scratch, using only
the gems in your Gemfile, which may resolve the conflict.

[Question] Using MongoDB as 'cache' aside of Redis

Hi. I actually have a question connected with excellent presentation from RubyDay 2013. I'd like to crawl really big sites (<= 100 M of pages) and I'm thinking about best approach. I'm giving as much memory as it's possible for Redis, specifying maxmemory and maxmemory-policy to noeviction, I've also enabled MongoDB base.

Does it works like I think it should, that when used Redis memory is close to maxmemory value then part of data is switched to Mongo? If not - how can it work on huge amount of pages? Thanks!

OK, I found 'queue_overflow_adapter' example. That should be it!

Incremental Crawling

Hi, I have a setup like below and works fine for the first time. All the pages are crawled as intended. However, when I run for the second time in an anticipation of getting new updates, it would hang and got an message indicating that it is "already stored". When I set the cleaner option to true then it wipes out the entire database and starting from scratch so this is what I like to avoid. Obviously the page is already stored, however shouldn't it still look for the updates?

Polipus::Plugin.register Polipus::Plugin::Cleaner, reset:false
starting_urls = ["http://www.abc.com/home/"]

[worker #0] Page [http://www.abc.com/home/] already stored.

Thread seems to hang in HTTP Call

Hi!

it seems one of our threads is stuck in an HTTP call. I think the function is:

https://github.com/taganaka/polipus/blob/master/lib/polipus/http.rb#L170

It looks like the connection is never closed. Any idea what this could be?

Thanks!

Here is a full stacktrace:

/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:155:in `rescue in rbuf_fill'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:152:in `rbuf_fill'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:134:in `readuntil'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:144:in `readline'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http/response.rb:39:in `read_status_line'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http/response.rb:28:in `read_new'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1406:in `block in transport_request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1403:in `catch'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1403:in `transport_request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1376:in `request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rest-client-1.6.7/lib/restclient/net_http_ext.rb:51:in `request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:149:in `get_response'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:123:in `get'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:32:in `fetch_pages'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus.rb:179:in `block (3 levels) in takeover'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:56:in `block in process'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:54:in `loop'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:54:in `process'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus.rb:154:in `block (2 levels) in takeover'

#queue_overflow_adapter and #overflow_adapter; Same thing?

There's an option :queue_overflow_adapter at Polipus::PolipusCrawler::OPTS
and #overflow_adapter.

Latter is defined twice:

overflow_adapter is not documented and seems not to be used anywhere. I think it is redundant.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.