calculatedcontent / cloud-crawler Goto Github PK
View Code? Open in Web Editor NEWDistributed Ruby Web Crawler, backed up by Redis
Distributed Ruby Web Crawler, backed up by Redis
speaks for itself
replace redis bloom filter with bloomd
need chef recipe
batch job does not optimize save_batch on a long job
if the job is not queued up
instance eval is not a great way to evaluate a dsl over and over
instance_eval should/could be replaced with a singleton method that optimized performance
bin/restart_workers.rb has us-west-1 hardcoded (other places might have it too)
http connection pool expects a qid and/jid
these ids should be auto-created on a submit and/or during a crawl
Driver should provide convenience loader for long curls jobs
a long running batch job could be terminated if it is too long
need to warn user
why is batch crawl job using a qid to pool the http connections?
pierre's idea
Started redis-server
bundle exec ./test/test_crawl.rb -u http://calculatedcontent.com gives below mentioned error.
/cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser/scanner.rb:19:in process': Sourcify::NoMatchingProcError (Sourcify::NoMatchingProcError) from cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:40:in
extracted_source'
from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:22:in sexp' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:17:in
source'
from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/methods/to_source.rb:39:in to_source' from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:234:in
crawl'
from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:49:in standalone_crawl' from ./test/test_crawl.rb:27:in
I am using ruby version 2.1.1.
http connection pool was hacked into batch_crawl_job to test
should be implemented in batch job properly
rvm use 2.1
bundle install
bundle rake test
127 examples, 9 failures
these errors need to be flushed out before moving forward with ruby 2.1
batch job should save page store, not batch_curl and batch_crawl
add a convenience method to Driver or a standard example script
to allow crawling all urls in a site and saving them
is actually a specific kind of DSL/ job for a long crawl
should be optimized for just this
While doing bundle install I am getting following error.
Fetching gem metadata from http://rubygems.org/........
Fetching additional metadata from http://rubygems.org/..
Resolving dependencies...
Could not find active_support-3.0.0 in any of the sources
I have ruby 2.1.1p76
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.