calculatedcontent / cloud-crawler Goto Github PK

View Code? Open in Web Editor NEW

123.0 123.0 37.0 884 KB

Distributed Ruby Web Crawler, backed up by Redis

Ruby 100.00%

cloud-crawler's People

Contributors

Stargazers

Watchers

cloud-crawler's Issues

bloomfilter should be replaced with bloomd

speaks for itself
replace redis bloom filter with bloomd
need chef recipe

save batch not efficient for local long job

batch job does not optimize save_batch on a long job
if the job is not queued up

dsl instance_eval call should be tested and optimized

instance eval is not a great way to evaluate a dsl over and over
instance_eval should/could be replaced with a singleton method that optimized performance

Make CC AWS region independent

bin/restart_workers.rb has us-west-1 hardcoded (other places might have it too)

qid / jid not created in batch job

http connection pool expects a qid and/jid
these ids should be auto-created on a submit and/or during a crawl

Driver should provide convenience loader for long curl jobs

Driver should provide convenience loader for long curls jobs

long running batch jobs might be terminated

a long running batch job could be terminated if it is too long
need to warn user

batch crawl job uses qid?

why is batch crawl job using a qid to pool the http connections?

normalize urls before putting into bloomfilter

pierre's idea

Could not run the test crawl

Started redis-server
bundle exec ./test/test_crawl.rb -u http://calculatedcontent.com gives below mentioned error.
/cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser/scanner.rb:19:in process': Sourcify::NoMatchingProcError (Sourcify::NoMatchingProcError) from cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:40:inextracted_source'
from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:22:in sexp' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:17:insource'
from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/methods/to_source.rb:39:in to_source' from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:234:incrawl'
from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:49:in standalone_crawl' from ./test/test_crawl.rb:27:in
'

I am using ruby version 2.1.1.

http connection pool requires qid

http connection pool was hacked into batch_crawl_job to test
should be implemented in batch job properly

on migrating to ruby 2.1

rvm use 2.1
bundle install
bundle rake test

127 examples, 9 failures

these errors need to be flushed out before moving forward with ruby 2.1

save page store not in batch job

batch job should save page store, not batch_curl and batch_crawl

CC script for just getting all urls, nothing else

add a convenience method to Driver or a standard example script
to allow crawling all urls in a site and saving them
is actually a specific kind of DSL/ job for a long crawl

should be optimized for just this

Could not find active_support-3.0.0 in any of the sources while running bundle install

While doing bundle install I am getting following error.

Fetching gem metadata from http://rubygems.org/........
Fetching additional metadata from http://rubygems.org/..
Resolving dependencies...
Could not find active_support-3.0.0 in any of the sources

I have ruby 2.1.1p76

calculatedcontent / cloud-crawler Goto Github PK

cloud-crawler's People

Contributors

Stargazers

Watchers

Forkers

cloud-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org