taganaka / polipus Goto Github PK

Polipus: distributed and scalable web-crawler framework

License: MIT License

Ruby 100.00%

polipus's Introduction

Polipus

A distributed web crawler written in ruby, backed by Redis This project has been presented to the RubyDay2013 http://www.slideshare.net/francescolaurita/roll-your-own-web-crawler-rubyday

Features

Easy to use
Distributed and scalable
It uses a smart/fast and space-efficient probabilistic data structure to determine if an url should be visited or not
It doesn't exaust your Redis server
Play nicely with MongoDB even if it is not strictly required
Easy to write your own page storage strategy
Focus crawling made easy
Heavily inspired to Anemone https://github.com/chriskite/anemone/

Supported Ruby Interpreters

MRI 1.9.x >= 1.9.1
MRI 2.0.0
MRI 2.1.2
JRuby 1.9 mode
Rubinius

Survival code example

require "polipus"

Polipus.crawler("rubygems","http://rubygems.org/") do |crawler|
  # In-place page processing
  crawler.on_page_downloaded do |page|
    # A nokogiri object
    puts "Page title: '#{page.doc.css('title').text}' Page url: #{page.url}"
  end
end

Installation

$ gem install polipus

Testing

$ bundle install
$ rake

Contributing to polipus

Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
Fork the project.
Start a feature/bugfix branch.
Commit and push until you are happy with your contribution.
Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
Install Rubocop and make sure it is happy
Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Copyright

polipus's People

Contributors

Stargazers

Watchers

polipus's Issues

invalid byte sequence in US-ASCII (ArgumentError)

/usr/lib/ruby/1.9.1/uri/common.rb:304:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
from /usr/lib/ruby/1.9.1/uri/common.rb:304:in `escape'
from ./vendor/bundle/ruby/1.9.1/gems/polipus-0.3.2/lib/polipus/page.rb:169:in `to_absolute

Note:

export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8

doesn't help here

Where is the Gem

Please let me know how to get a gem for this? Thank you.

Support for other charsets than UTF-8

I have some issues crawling Japanese websites with SHIFT-JIS encoding.
Nokogiri is not doing any automatic charset conversion to UTF-8.

I fixed it by rewriting the Page#doc method and using kconv.

    require 'kconv'

    def doc
      return @doc if @doc

      noko_en_id = {
        Kconv::UTF8 => 'UTF-8',
        Kconv::EUC => 'EUC-JP',
        Kconv::SJIS => 'SHIFT-JIS',
        Kconv::ASCII => 'ASCII',
        Kconv::JIS => 'ISO-2022-JP'
      }[Kconv.guess(@body || '')]

      @doc = Nokogiri::HTML(@body, nil, noko_en_id) if @body && html? rescue nil                                                                            
    end

Kill s3 entirely, use Fog, yo!

Current s3 implementation is partially broken and it doesn't work well under heavy load

Providing a fog adapter looks a way better to me

http://fog.io/storage/

I've added RethinkDB:
https://github.com/nofxx/polipus/tree/rethink
But I merged my gem update pull request too (needed it, was having gem issues ;), think I'll get to generate a clean one if you accept that pull request.

Gzip decoded body not used anywhere

At HTTP#fetch_pages you try to decode the gziped content of a page.

https://github.com/taganaka/polipus/blob/master/lib/polipus/http.rb#L34-L39

          body = response.body.dup
          if response.to_hash.fetch('content-encoding', [])[0] == 'gzip'
            gzip = Zlib::GzipReader.new(StringIO.new(body))    
            body = gzip.read
          end
          pages << Page.new(location, :body          => response.body.dup,

but body is not used anywhere. :body should get it's value.

In general, I'm not sure it this necessary at all, as http://www.ruby-doc.org/stdlib-2.1.1/libdoc/net/http/rdoc/Net/HTTP.html#class-Net::HTTP-label-Compression states this is done by Net::HTTP automatically

Change logging format

Polipus: [worker #2] Page [http://www.example.com/89237459832475] downloaded

in iTerm 2, I can CMD+click on the link. But it then tries to open http://www.example.com/89237459832475%5D.

Therefore, I suggest to change the log message to

Polipus: [worker #2] Page (http://www.example.com/89237459832475) downloaded

Polipus: [worker #2] Page <http://www.example.com/89237459832475> downloaded

I could provide a pull request if you want

Cannot install on JRuby 1.7.13. Error with bson_ext-1.9.2

For some reason I'm not able to install polipus gem on JRuby 1.7.13. Tried both Windows 8.0 and Ubuntu 12.04. Got the same error message.

Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native extension.

C:/torquebox-3.1.1/jruby/bin/jruby.exe extconf.rb

NotImplementedError: C extension support is not enabled. Pass -Xcext.enabled=true to JRuby or set JRUBY_OPTS.

(root) at C:/torquebox-3.1.1/jruby/lib/ruby/shared/mkmf.rb:8
require at org/jruby/RubyKernel.java:1065
(root) at C:/torquebox-3.1.1/jruby/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1
(root) at extconf.rb:1

Gem files will remain installed in C:/torquebox-3.1.1/jruby/lib/ruby/gems/shared/gems/bson_ext-1.9.2 for inspection.
Results logged to C:/torquebox-3.1.1/jruby/lib/ruby/gems/shared/gems/bson_ext-1.9.2/ext/cbson/gem_make.out
An error occurred while installing bson_ext (1.9.2), and Bundler cannot
continue.
Make sure that gem install bson_ext -v '1.9.2' succeeds before bundling.

RegularExpression To Follow a Link

When I use the regex like below it would not crawl.

crawler.follow_links_like(/show.php\?id=[A-Z]/)

however if I remove the id parameter it works.

crawler.follow_links_like(/show.php/)

Please let me know regex to match dynamic parameters are supported? Many thanks!

Robots.txt option

Hello,

Is there an option for robots.txt file, something like : obey_robots:true/false ?

Thanks.

Whitelist start urls?

If you use #follow_links_like and the given start urls does not match the configured regexps, the crawler stops working. Is there a reason why the start urls aren't whitelisted?

start_urls = [ "http://www.example.com/foo/bar" ]
Polipus.crawler("dummy", start_urls, options) do |crawler|
  crawler.follow_links_like(/\/bar\/foo/)
end

The links on the start page match the given regexp.

Support for headless crawling

Does it make sense to have support for headless crawling built-in to the framework? A lot of the websites these days are Single Page apps and crawling that using the current framework won't work.

We could try to do this using phantonjs or capybara-webkit. I've been able to do a headless crawl using capybara-website and poltergeist before.

What are your thoughts on this?

Internet connection lost; Page still stored and processed

Today, I lost the internet connection. I got the following error message in my logs:

W, [2014-05-18T18:44:44.684461 #13198]  WARN -- Polipus: Page http://www.example.com/foobar has error: No route to host - connect(2) for "www.example.com" port 80

Polipus should not proceed processing a page in this case.
It should requeue the page and proceed with the next page.

Fails when response["Set-Cookie"] is nil

TypeError: no implicit conversion of nil into String
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie/scanner.rb:20:in `initialize'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie/scanner.rb:20:in `initialize'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie.rb:281:in `new'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie.rb:281:in `block in parse'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie.rb:280:in `tap'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie.rb:280:in `parse'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:191:in `parse'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus/http.rb:180:in `get_response'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus/http.rb:149:in `get'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus/http.rb:47:in `fetch_pages'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus.rb:183:in `block (3 levels) in takeover'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/redis-queue-0.0.4/lib/redis/queue.rb:56:in `block in process'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/redis-queue-0.0.4/lib/redis/queue.rb:54:in `loop'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/redis-queue-0.0.4/lib/redis/queue.rb:54:in `process'
/usr/local/opt/rbenv/versions/2.1.1/lib/ruby/gems/2.1.0/gems/polipus-0.3.0/lib/polipus.rb:158:in `block (2 levels) in takeover'

Unicode pages does not work anymore on 0.5.0

I was able to crawl Unicode pages in 0.4.0 but after upgrading to 0.5.0 only some English characters would be in a crawled page. Please let me if there any settings I have to change?

Anchor links converted to %23 causing 404 errors

When anchor links are found during the crawl (i.e http://www.example.com/abc.html#foo), they are encoded : the anchor tag is replaced with the escaped character %23, which causes the page to respond with a 404 error code. Could you fix it please ? The idea is to prevent Polipus from escaping this character "#".
Many thanks for your job on Polipus gem.

URL patching

Hello, I have a pattern "%E2%80%93" in the URL strings and need to replace that with "%96" before a Page is saved. Some websites use strange characters in the URLs and I discovered that some of those strange characters must be replaced, otherwise URL cannot be visited. I believe these URLs are stored as links on a page.

Please let me know if there is a way to replace URL based on some regex pattern before a page stored?

Make it work with Mongo 2.x

I added a new mongo2_store, a mongo2_queue and a few other methods to make it work with the new driver MongoDB 2.

I wonder if you are interested in a pull request with this, supporting mongo 1.x and mongo 2.x ruby driver, or your plans are to replace the current mongo implementation and update it in favor of the new driver.

SocketError could mean, domain is gone or no internet connection

When you try to resolve a domain which does not exist, polipus creates an error page with SocketError.

Actually, the page does not exist anymore. So it's like a 404 error. Just on DNS level.

But at the same time, SocketError will be raised if the internet connection got lost for any reason.

So to be sure, the site is gone, we would need a method like this

    def internet_connection_available?
      Excon.head('http://www.google.com')
      logger.debug { 'Webpage not available anymore' }
      true
    rescue Excon::Errors::SocketError
      logger.error { 'Internet connection lost' }
      false
    end

Or maybe even better, something like this: http://stackoverflow.com/questions/2385186/check-if-internet-connection-exists-with-ruby/22837368#22837368

I use it like this:

        crawler.on_page_error do |page|
          page.storable = false
          webpage_gone = page.error.is_a?(SocketError) && internet_connection_available?
          crawler.add_to_queue(page) unless page.not_found? || webpage_gone
        end

shall we add something for this case directly to polipus?

How to setup in a cluster environment?

This is really a awesome project, however I didn't figure out how to setup in a cluster environment. can you guys provide a simple cluster(separated machines) setup?

Cannot use with mongoid ~> 4.0.0

https://github.com/taganaka/polipusBundler could not find compatible versions for gem "bson":
In snapshot (Gemfile.lock):
bson (1.9.2)

In Gemfile:
mongoid (> 4.0.0) ruby depends on
moped (> 2.0.0) ruby depends on
bson (~> 2.2) ruby

Running bundle update will rebuild your snapshot from scratch, using only
the gems in your Gemfile, which may resolve the conflict.

Show code coverage and results of quality analysis

I really like that you added TravisCI.

code coverage

You could also add https://coveralls.io/ so that we can see details about the code coverage of the branches and the pull requests.

https://coveralls.io/docs/ruby

automated quality analysis

Using https://codeclimate.com/

https://codeclimate.com/github/taganaka/polipus

[Question] Using MongoDB as 'cache' aside of Redis

Hi. I actually have a question connected with excellent presentation from RubyDay 2013. I'd like to crawl really big sites (<= 100 M of pages) and I'm thinking about best approach. I'm giving as much memory as it's possible for Redis, specifying maxmemory and maxmemory-policy to noeviction, I've also enabled MongoDB base.

Does it works like I think it should, that when used Redis memory is close to maxmemory value then part of data is switched to Mongo? If not - how can it work on huge amount of pages? Thanks!

OK, I found 'queue_overflow_adapter' example. That should be it!

Incremental Crawling

Hi, I have a setup like below and works fine for the first time. All the pages are crawled as intended. However, when I run for the second time in an anticipation of getting new updates, it would hang and got an message indicating that it is "already stored". When I set the cleaner option to true then it wipes out the entire database and starting from scratch so this is what I like to avoid. Obviously the page is already stored, however shouldn't it still look for the updates?

Polipus::Plugin.register Polipus::Plugin::Cleaner, reset:false
starting_urls = ["http://www.abc.com/home/"]

[worker #0] Page [http://www.abc.com/home/] already stored.

Thread seems to hang in HTTP Call

Hi!

it seems one of our threads is stuck in an HTTP call. I think the function is:

https://github.com/taganaka/polipus/blob/master/lib/polipus/http.rb#L170

It looks like the connection is never closed. Any idea what this could be?

Thanks!

Here is a full stacktrace:

/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:155:in `rescue in rbuf_fill'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:152:in `rbuf_fill'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:134:in `readuntil'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:144:in `readline'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http/response.rb:39:in `read_status_line'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http/response.rb:28:in `read_new'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1406:in `block in transport_request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1403:in `catch'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1403:in `transport_request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1376:in `request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rest-client-1.6.7/lib/restclient/net_http_ext.rb:51:in `request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:149:in `get_response'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:123:in `get'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:32:in `fetch_pages'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus.rb:179:in `block (3 levels) in takeover'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:56:in `block in process'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:54:in `loop'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:54:in `process'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus.rb:154:in `block (2 levels) in takeover'

#queue_overflow_adapter and #overflow_adapter; Same thing?

There's an option :queue_overflow_adapter at Polipus::PolipusCrawler::OPTS
and #overflow_adapter.

Latter is defined twice:

attr_reader :overflow_adapter (at

polipus/lib/polipus.rb

Line 78 in d7c44a1

attr_reader :overflow_adapter

)
def overflow_adapter (at

polipus/lib/polipus.rb

Lines 295 to 297 in d7c44a1

def overflow_adapter

@options[:overflow_adapter]

end

)

overflow_adapter is not documented and seems not to be used anywhere. I think it is redundant.

	def overflow_adapter
	@options[:overflow_adapter]
	end