istresearch / scrapy-cluster Goto Github PK

View Code? Open in Web Editor NEW

1.2K 107.0 324.0 28.69 MB

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

Home Page: http://scrapy-cluster.readthedocs.io/

License: MIT License

Python 90.45% HTML 8.50% Shell 1.05%

python scrapy kafka redis scraping distributed

scrapy-cluster's Introduction

Scrapy Cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.

Dependencies

Please see the requirements.txt within each sub project for Pip package dependencies.

Other important components required to run the cluster

Python 2.7 or 3.6: https://www.python.org/downloads/
Redis: http://redis.io
Zookeeper: https://zookeeper.apache.org
Kafka: http://kafka.apache.org

Core Concepts

This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:

The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
Scale Scrapy instances across a single machine or multiple machines
Coordinate and prioritize their scraping effort for desired sites
Persist data across scraping jobs
Execute multiple scraping jobs concurrently
Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
Enables completely different spiders to yield crawl requests to each other, giving flexibility to how the crawl job is tackled

Scrapy Cluster test environment

To set up a pre-canned Scrapy Cluster test environment, make sure you have Docker.

Steps to launch the test environment:

Build your containers (or omit --build to pull from docker hub)

docker-compose up -d --build

Tail kafka to view your future results

docker-compose exec kafka_monitor python kafkadump.py dump -t demo.crawled_firehose -ll INFO

From another terminal, feed a request to kafka

curl localhost:5343/feed -H "content-type:application/json" -d '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc123"}'

Validate you've got data!

# wait a couple seconds, your terminal from step 2 should dump json data
{u'body': '...content...', u'crawlid': u'abc123', u'links': [], u'encoding': u'utf-8', u'url': u'http://dmoztools.net', u'status_code': 200, u'status_msg': u'OK', u'response_url': u'http://dmoztools.net', u'request_headers': {u'Accept-Language': [u'en'], u'Accept-Encoding': [u'gzip,deflate'], u'Accept': [u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], u'User-Agent': [u'Scrapy/1.5.0 (+https://scrapy.org)']}, u'response_headers': {u'X-Amz-Cf-Pop': [u'IAD79-C3'], u'Via': [u'1.1 82c27f654a5635aeb67d519456516244.cloudfront.net (CloudFront)'], u'X-Cache': [u'RefreshHit from cloudfront'], u'Vary': [u'Accept-Encoding'], u'Server': [u'AmazonS3'], u'Last-Modified': [u'Mon, 20 Mar 2017 16:43:41 GMT'], u'Etag': [u'"cf6b76618b6f31cdec61181251aa39b7"'], u'X-Amz-Cf-Id': [u'y7MqDCLdBRu0UANgt4KOc6m3pKaCqsZP3U3ZgIuxMAJxoml2HTPs_Q=='], u'Date': [u'Tue, 22 Dec 2020 21:37:05 GMT'], u'Content-Type': [u'text/html']}, u'timestamp': u'2020-12-22T21:37:04.736926', u'attrs': None, u'appid': u'testapp'}

Documentation

Please check out the official Scrapy Cluster documentation for more information on how everything works!

Branches

The master branch of this repository contains the latest stable release code for Scrapy Cluster 1.2.

The dev branch contains bleeding edge code and is currently working towards Scrapy Cluster 1.3. Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.

scrapy-cluster's People

Stargazers

Watchers

Forkers

curtiszimmerman adeze curiousg102 sayi21cn ngoanhtan commonfire maxssage tswann derekluo codywilbourn bopo mrg7 vishwakarmarahul priestd09 shirk3y rayhon1014 quasiben kalatestimine pombredanne huanglvjun zookeeperss xiaomufeng cjilyy aimer1027 gitter-badger carlesm seraph2000 liujinliang99 stupidboy2015 josestylesage jtsay362 7webpages anukat2015 sunchen009 mezhou887 superf2t eitazhou rksaxena datafibers saq7 animalmatsuzawa tammachari dingwf knirbhay iou2much publicbull quixey tarhan benpoor huoran559 hpetru vuchau kangkangv5 thangphuocnguyen keimhaqi rinetd pchief thrbowl bclowcode louiekang shangliuyan rickyall dgo2dance ichenfujun carolusian kh9n musicmessenger 3rawkz chen-cohen zja711 liadahao sanyambansal76 felixqk ashbt alexeystolyarov tuhaolam dws-pankaj danieljamieson zofuthan brucedone ruilvcom leezqcst chenweisomebody126 bay-area-engineering-study-club hunny-lh kazuar vickzhang martinsugj meyerbro mengguiyouziyi wediors k0urge gitxxx yousuowei123 jsean662 qdj0511 topia47 dsbib nilportugues jallyhe

scrapy-cluster's Issues

Vagrant test environment missing cryptography method

Hi. I am in the process of spinning up the test environment, and have run into a few issues, some of which I was able to work around, but one which I've not been able to.

First, the successful workarounds:

I had to manually change the Kafka and Zookeeper download urls in order to spin up the VM. The urls that were included were inaccessible.
I had to manually install certain packages before being able to successfully install the project requirements. (sudo apt-get install build-essential libssl-dev libffi-dev python-dev)

Unfortunately, now that I'm running the automated tests I keep getting an error about a certain crypto method not being present. Everything I'm finding online says the problem is due to a missing dependency on libssl-dev, but I already have that package installed, so I'm at a loss.

Here is the output I'm seeing at the end of running ./run_offline_tests.sh:

Traceback (most recent call last):
File "tests/tests_offline.py", line 14, in
from crawling.redis_dupefilter import RFPDupeFilter
File "/vagrant/crawler/crawling/redis_dupefilter.py", line 1, in
from scrapy.dupefilters import BaseDupeFilter
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/init.py", line 48, in
from scrapy.spiders import Spider
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/spiders/init.py", line 10, in
from scrapy.http import Request
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/http/init.py", line 15, in
from scrapy.http.response.html import HtmlResponse
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/http/response/html.py", line 8, in
from scrapy.http.response.text import TextResponse
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/http/response/text.py", line 13, in
from scrapy.utils.response import get_base_url
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/utils/response.py", line 12, in
from twisted.web import http
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/web/http.py", line 92, in
from twisted.internet import interfaces, reactor, protocol, address
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/reactor.py", line 38, in
from twisted.internet import default
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/default.py", line 56, in
install = _getInstallFunction(platform)
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/default.py", line 44, in _getInstallFunction
from twisted.internet.epollreactor import install
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/epollreactor.py", line 24, in
from twisted.internet import posixbase
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 18, in
from twisted.internet import error, udp, tcp
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/tcp.py", line 29, in
from twisted.internet._newtls import (
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/_newtls.py", line 21, in
from twisted.protocols.tls import TLSMemoryBIOFactory, TLSMemoryBIOProtocol
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/protocols/tls.py", line 41, in
from OpenSSL.SSL import Error, ZeroReturnError, WantReadError
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/OpenSSL/init.py", line 8, in
from OpenSSL import rand, crypto, SSL
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/OpenSSL/rand.py", line 11, in
from OpenSSL._util import (
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/OpenSSL/_util.py", line 7, in
binding = Binding()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 114, in init
self._ensure_ffi_initialized()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 126, in _ensure_ffi_initialized
cls._modules,
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/hazmat/bindings/utils.py", line 31, in load_library_for_binding
lib = ffi.verifier.load_library()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cffi/verifier.py", line 101, in load_library
return self._load_library()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cffi/verifier.py", line 211, in _load_library
return self._vengine.load_library()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cffi/vengine_cpy.py", line 155, in load_library
raise ffiplatform.VerificationError(error)
cffi.ffiplatform.VerificationError: importing '/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/_Cryptography_cffi_a269d620xd5c405b7.so': /home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/_Cryptography_cffi_a269d620xd5c405b7.so: undefined symbol: EC_GFp_nistp224_method

Any suggestions would be much appreciated. Thanks!

Offload virtual machine deployment

The Vagrant VM scdev is finicky, uses a custom set of ansible scripts to do the deployment, and runs under a conda virtual environment.

Both real and virtual deployments of Scrapy Cluster should be supported by Ansible Symphony (https://github.com/istresearch/ansible-symphony) and use the standardized Ubuntu 14.04 box.

This will eliminate the ansible folder at the top level of the project.

Improved Debugging

It is a pain to write print statements throughout the cluster, only to remove them so it is not in the code release. Just like in Scrapy, if DEBUG = True we should dump out logger statements about what is going on in each of the three main components.

As a second thought, we could also have a flag for JSON_LOGS so the same human readable logs are turned into JSON structures for piping elsewhere (like ElasticSearch).

_get_bin takes hours with queue size 1M.

I'm scraping etsy.com and queue size become more than 1M. When I query for info/statistics it stuck on _get_bin function in scrapy-cluster/redis-monitor/plugins/info_monitor.py file. Also i takes 500MB memory for redis-monitor in that moment.

What is the best way to keep queue size small?
Perhaps _get_bin should be rewritten in more efficient way to calc statistics in the database.

Status

Noticed master hadn't had an update in quite some time. Is this project currently still underdevelopment ?

Sample Kibana and Logstash configs

Once all of the JSON logging is finalized and we are towards the end of development for SC 1.1, create some sample logstash configurations and Kibana dashboards for visualizing the cluster. This would pair nicely with documentation that used Kibana logs and graphs for examples, verification, pictures, etc.

Concurrent requests from a spider to a domain

Hi madisonb,

Currently i am working in a single spider cluster ( 1 machine ).

Scrapy supports concurrent requests to a domain from a single spider. I had tried to emulate this in scrapy-cluster but no luck. Even if i increase queue hits, it just increases the total number of requests it can take from redis within the window. But the requests are still processed one at a time, which ultimately becomes a bottleneck.

Only way i could find to parallelise requests is to spawn two identical spiders with different names and dividing seed urls between them.

I maybe wrong, but please tell me if there is any way to parralelise requests within a machine.

Add Examples Folder in Utils

We have a bunch of example scripts for how to use the scutils package in the documentation, add the formal code into a place within the utils/examples folder and update documentation.

Zookeeper dependency?

Hey guys!

We've met at Memex Summer workshop last time. I was looking through this code, and didn't found any dependency on Zookeeper. Am I missed something? If not, probably it makes sense to remove it from docs.

Modularize via Plugins/Middleware

Everything needs to move to a plugin based system, similar to how Scrapy operates with its middleware components. It should be called something distinct (like plugins) so as to not be confused with Scrapy Middlewares.

This will allow us to easily extend functionality of kafka-monitor and redis-monitor into an agnostic framework which translates and operates on the data based on the cluster's needs. Already testing this functionality out via kafka-monitor and loving it.

Queue epiphany here.

Integration Tests

Scrapy Cluster needs an integration test that you can run to see if you have everything set up properly. This would be very helpful for new users of the application, which has so many moving parts.

Currently the integration test is just 'getting it working on live data'.

Improve Unit Tests

At time of writing our code coverage is 57%. We should strive to get more coverage and at least get to 65 or 70%.

Also, unit tests do not exist for the following major areas:

Statistics Collection without 3rd Party Tools

It would be nice to collect metrics and make available some simple stats without having to use yet another big data app (like ELK). Since we already are using redis, lets use it to collect total and rolling metric counts of incoming kafka requests, crawls, and monitor actions.

This also will need one new plugin each for the kafka monitor and redis monitor.

References so we don't forget:
Rolling time windows hit counters: https://opensourcehacker.com/2014/07/09/rolling-time-window-counters-with-redis-and-mitigating-botnet-driven-login-attacks/

Useful for keeping hit counts of say: 15 mins, 1 hr, 6 hr, 12 hr, 24 hr, 7 days (anything more is probably wasted space for the usefulness of the metric)

Unique item counts: http://redis.io/topics/data-types-intro#hyperloglogs

Fixed size data store to count the total number of unique hits seen by something. Useful for seeing 'total' cluster stats.

I think both of these would be useful and should be considered part of future development.

Multiple Spiders in Single Process

Scrapy 1.0 allows us to run full crawler instances within a process thanks to its internal API.

Docs at http://doc.scrapy.org/en/0.24/topics/practices.html#running-multiple-spiders-in-the-same-process
Example code within scrapy cluster at https://goo.gl/61h7g9 .

We should be able to dynamically run X number of spiders defined in the settings.py file, per spider type. So if you have a link_spider.py and a example_spider.py, in your settings file you define how many of each type you want to run, and the overarching process spawns them all.

This is currently not an issue if you use a process manager like Supervisord to spin up multiple instances of scrapy runspider blah, but would be nice to simply wrap everything into a single running instance.

Zookeeper Domain blacklist

Once #39 is completed it would be nice to be able to add domains to be blacklisted to the crawlers through the API. The blacklist would propagate to all crawlers immediately thanks to Zookeeper.

The implementation of this would be to load the blacklist into a list in the scheduler and then ensure we do not pull from that particular queue when looking for a new crawl. We can spawn a Redis Monitor job to truncate the queue, and lastly if any new crawls of depth > 0 are active the blacklist would not allow the request to be added back to Redis.

This helps eliminate and extra Redis calls we need to make.

Plugin for Zookeeper Crawler Control

It would be nice to have a plugin to dynamically control the crawler domain specific configuration controlled within Zookeeper. This should be done in the Redis monitor with a connection and ability to manipulate the ZK yaml file, instead of using the manual file pusher right now.

UI for displaying information about Cluster

We need a small stand-alone web UI that ties in with the rest components in #24 to visualize the data generated by the cluster. You should also be able to submit API requests to the cluster.

Preferably this web ui and rest services are together and it is just deployed as a single running process.

Reduce potential Redis key collisions

Scrapy Cluster may not be the only process that is operating within a Redis Instance. We should add a unique identifier to the beginning of every key used so that doing *:*:queue does not collide with any other potential key being used in the cluster.

I propose using sc: as the identifier, so that every single thing scrapy cluster uses is easily distinguishable from other keys in use. The new query would be sc:*:*:queue.

Slow Scheduler Memory Build Up

The Distributed Scheduler keeps every domain queue it has ever seen in memory, so we can do extremely fast look up loops against all known domain keys. In turn, we only update the domain queues once every X seconds for new domains that we have never seen before. With every new domain we see, we create an object in memory representing the way to access the redis based queue.

The problem is that these objects inside of the scheduler are never deleted. If we see new domains every time we check, eventually we will get to a point where we run out of available memory on the host because we simply keep adding more and more objects to our lookup dictionary.

The solution to this problem can be solved in a number of ways:

An expiring timer on every dictionary key, in memory, and if there has not been a successful pop() from that queue within X seconds it gets cleaned from memory.
A LRU cache based setup where if the key count exceeds a certain threshold (like 10,000) we begin to delete keys least recently used every time we add new ones. This may also involve both a timestamp and a count of the number of times a key has been used.
Delete all keys every X seconds, whether it is every day, hour, etc. This is very naive.

I am in favor of point 1. Implementation would be then:

replace the ThrottledQueue class object stored at queue_dict[final_key] with a tuple of (ThrottledQueue, timestamp)
when looping over the throttled queues, if pop() succeeds update the timestamp, else check if the timestamp diff is greater than our delete threshold.
if diff is greater than threshold, remove key from

If you set the threshold to something like 1 year (large), you would expect the queue to grow until all available memory is used. If you set the threshold to 10 minutes (small), you would expect the queue dictionary to only grow to the size of all known domains within the past 10 minutes.

I think this solves the slow memory growth we see when crawling millions of different domains over time.

Elastic Moderated Throttled Queue

The current RedisThrottledQueue when used under Moderation causes a slight drift in the actual processing of X number of hits in Y time. This detla d is then added for each window, so that the successful number of X hits in Y time is really X hits in Y + d time. This delay is not normally an issue, but crops up when really high velocity moderated keys occurs (ie 60 hits in 60 seconds ends up being ~61 or ~62 seconds). This may be due to network latency or improper implementation.

We should have an ability to 'catch back up' or fix the moderation implementation so that the numbers line up with exactly what is defined. This may involve adding more items to the throttle_time key, or setting a minimum moderation value so that the catch up does not cause huge spikes in domain hits.

tests_offline.py path dependency issue

When running the tests_offline.py manually using python tests_offline.py, it will fail unless it is run from the utils directory. The run_offline_test.sh works because it changes to the utils folder before running.

This can be solved by using the current directory instead of using "test" in the code.

Lines 167, 168:

                dir='test', level='INFO', stdout=False, file='test.log')
        self.test_file = 'test/test'

can be changed to:

                dir='.', level='INFO', stdout=False, file='test.log')
        self.test_file = './test'

and the test will work from any directory.

Migration Script to update from 1.0 to 1.1

Issue #2 creates a redis database mismatch in the way we handle the queue for Scrapy Cluster. We will need to write a small 'upgrade' script to help with migration from a single queue -> domain based queues

Plugin for Queue Statistics API

We should have a series of plugins to gather basic queue stats, like we are dumping to the logs to visualize basic redis queue backlog. This is not covered under the current info API nor the current stats api.

For example, the Stats API call would be stats: queue and you would get back the number of domains and the total backlog for each spider type.

Upgrade to Scrapy 1.0.4

Scrapy released 1.0.4 with lots of new fixes.

http://doc.scrapy.org/en/latest/news.html

Pass "spiderid" param to feed function and got "invalid json received" error

From http://scrapy-cluster.readthedocs.org/en/latest/topics/kafkamonitor.html?highlight=feed#scraper-schema-json I found I could pass "spiderid" to feed function. So I did like this: python kafka-monitor.py feed -s settings_crawling.py '{"url": "http://istresearch3.com", "appid":"testapp", "crawlid":"ABC123", "spiderid":"aaaa"}'. And then I got the "invalid json received" error, I've check the kafka-monitor.py source code and didn't know why.
My purpose of this is that I want to put different links into one single db, and let different spiders (ex: spidera get job from "aaaa:queue" and spiderb get job from "bbbb:queue") get their own job queue. Am I doing it in the right way?

Integrate Travis CI for offline tests

Since the offline tests don't have external dependencies (kafka, redis, etc), we could automatically run them upon any new commits or pull requests to the repo. Travis CI is easy to set up and is free for open source repos. I can also add a badge to show that the offline tests have successfully passed or failed.

Kafka-Monitor Unit testing

Kafka Monitor needs offline unit tests!

Production ready ?

I am up to setting up my own cluster for scrapping using individual components such as

Supervisor
Scrapy
Kafka for msg
Celery for queue and Rebbitmq/redis as broker
Flask for rest
Etc

And I'm trying to stitch all this above parts together and make my crawling system. Came across this repo and specs looks awesome.

Zookeeper
Kafka
Redis
Crawler(s) Scrapy
Kafka Monitor
Redis Monitor

And by reading document i must admin every component seems to be stitch together nicely.

The problem is that document seems promising but after searching online could not find anyone using it for real-world applications. My question is very simple " is it really production ready ? "

Anyone who is using this on production or thought of using can help me with comment. I know this question is very opinionated but any advice would help.

I thought this is most appropriate place to ask this question as i could not find anywhere else peple using it. In case if this sounds inappropriate here feel free to close this.

Do help

LogFactory rolling log doesnt actually roll

The LogFactory logger instance doesn't actually roll the log when it hits the max size, completely negating any added benefit of the ConcurrentRotatingFileHandler we use under the hood.

from scutils.log_factory import LogFactory

logger = LogFactory.get_instance(
        json=False,
        stdout=False,
        level='DEBUG',
        name='test',
        dir='/tmp/',
        file='test.log',
        bytes='5MB',
        backups=2
)

while True:
    logger.info("test log")

Should write 3 log files total, 2 backups of ~5MB each and the main log file. It does no such thing and continues to write to the same log file, growing as big as the system can handle.

Domain based queueing

It would be nice if the spider queue was split up based on the domain you are crawling, in order to see a better picture into the cluster. This affects all three core components of the system, as there is much more work to manage them.

Current code is in beta and needs integration/testing in this main project.

1.1 Troubles

Having a bit of trouble getting started. Below I've included commands and their outputs (note: some outputs are truncated):

python kafka_monitor.py run
2015-12-06 19:59:00,030 [kafka-monitor] INFO: Kafka Monitor Stats Dump:
{
    "fail_21600": 0,
    "fail_3600": 0,
    "fail_43200": 0,
    "fail_604800": 0,
....
    "plugin_StatsHandler_lifetime": 0,
    "total_21600": 13,
    "total_3600": 13,
    "total_43200": 13,
    "total_604800": 13,
    "total_86400": 13,
    "total_900": 1,
    "total_lifetime": 13
}

python redis_monitor.py
....
    "total_604800": 6,
    "total_86400": 6,
    "total_900": 0,
    "total_lifetime": 6
}
2015-12-06 20:02:39,862 [redis-monitor] INFO: Crawler Stats Dump:
{
    "total_spider_count": 0
}


scrapy runspider crawling/spiders/link_spider.py
2015-12-06 19:56:46,817 [scrapy-cluster] INFO: Changed Public IP: None -> 52.91.192.73

(scrapy_dev)ubuntu@ip-172-31-7-147:~/scrapy-cluster/kafka-monitor$ python kafka_monitor.py feed '{"url": "http://dmoz.org", "appid":"testapp", "crawlid":"abc1234", "maxdepth":1}'
No override settings found
2015-12-06 19:58:44,573 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "url": "http://dmoz.org",
    "maxdepth": 1,
    "crawlid": "abc1234",
    "appid": "testapp"
}
2015-12-06 19:58:44,580 [kafka-monitor] INFO: Successly fed item to Kafka

python kafkadump.py dump -t demo.crawled_firehose


(scrapy_dev)ubuntu@ip-172-31-7-147:~/scrapy-cluster/kafka-monitor$ python kafkadump.py dump -t demo.outbound_firehose
No override settings found
2015-12-06 19:35:31,640 [kafkadump] INFO: Connected to localhost:9092
{u'server_time': 1449430706, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}

I haven't changed any of the default settings and I'm currently using the dev branch. However, I don't think my setup is working. I was expecting some updates in dump -t demo.crawled_firehose. So while I think I've successfully feed a url to be crawled scrapy isn't doing the crawl ? Any ideas?

Dockerization

We should use Docker Swarm and Docker Compose in order to be able to spin up and spin down Scrapy Cluster and all it's dependencies.

Impacts #14 #18 #26

Can the spiders not exit after all the job done? #question

Hi, Can the spiders not exit after all the job done and waiting for new job?
I remember scrapy has this function, but I don't know how to modify the settings or code to make scrapy-cluster like that.
Thanks for reply.

Scrapy 1.0 Support

Scrapy Cluster currently supports Scrapy version 0.24.x, and there are many improvements to the new release of Scrapy that we could benefit from using.

http://doc.scrapy.org/en/latest/news.html

Multiple instances of "kafka_monitor" and "redis_monitor"

It is possible to run multiple instances of kafka_monitor and redis_monitor in a single cluster?

The idea is to setup high availability for these services.

Python 3 Support

With Scrapy soon supporting Python 3, we should consider supporting it as well. At a first glance, most of the functionality changes do not affect the code within, but I am sure there needs to be more work done.

How to get amount of crawled pages for specific crawl request?

Thank you for hard work on project.
It seems that current stats per crawler is limited to amount of pending items only.
Is there a way to get more advanced stat that includes amount of successfully parsed pages and amount of fails? I need it per crawl request.

{u'server_time': 1458001694, u'crawlid': u'tc2', u'total_pending': 9160, u'total_domains': 1, u'spiderid': u'link', u'appid': u'apid2', u'domains': {u'dmoz.org': {u'low_priority': -29, u'high_priority': -19, u'total': 9160}}, u'uuid': u'd2afgh'}

Feeding speed is slow, how to speed up?

Great job and thanks a lot, this project is the only one which on github could work correctly.

And I've met one problem: the feeding speed is slow, I have push more than 100000 records to queue, and it's really a nightmare, is there any way to speed it up?

Redis-Monitor Unit Testing

The redis monitor needs offline unit testing to some degree.

Rest Services for API requests

We need a set of Rest services to be able to pass crawl requests into the Kafka API to be processed by the Kafka Monitor. Ideally this uses something small like Flask and will run on a server that has purely Kafka access only. The rest services should not bypass the whole Kafka/Redis Monitor architecture, but provide a front end rest endpoint into submitting and reading things from Kafka.

This API should allow the passthrough of any JSON that needs to flow into Kafka Monitor Plugin, and in cases where there is an expected response, should return the JSON response from Kafka. At that point it behaves just like a rest service.

Note that the rest endpoint should not try to serve streaming data from the firehose, but rather very specific requests.

Scrapy Cluster 1.1 Documentation

Need more documentation for features added in Scrapy Cluster 1.1. This issue should be worked on after all other issues for the milestone have been completed.

New docs may include:

unit testing
how to extend or make a new spider
more pictures
updates to existing pages that are now obsolete

1.1 Final Documentation Completion:

Introduction:

Overview
Quick Start

Kafka Monitor:

Crawler:

Redis Monitor:

Redis Monitor: Design
Redis Monitor: Quick Start
Redis Monitor: Plugins
Redis Monitor: Settings

Utilities:

Advanced Topics:

Misc:

Improve documentation

Improve documentation and FAQ's/troubleshooting for the following:

Switch from Kafka-Python to PyKafka

PyKafka (https://github.com/Parsely/pykafka) supports consumer groups and will allow us to scale the Kafka Monitor horizontally so there is not a single point of failure on the API incoming requests.

This should apply to everything using Kafka Python, so crawlers and redis monitor too. Ultimately we want to remove the kafka-python dependency.

Scrapy Cluster Pip Packaging

Scrapy Cluster should have three or four distinct pip packages that allow a user to run pip install scrapy-cluster to get all available packages set up, or to allow individual component management like pip install sc-kafka-monitor, pip install sc-redis-monitor, pip install sc-crawler, or pip install sc-utils.

This requires significant development effort as it will change the way the project is laid out and used by the end user, since our project structure is dependent on all files being available from the git repo. It will function more like Scrapy, when the end user is responsible for creating the folder structure, and can call something like sc-crawler run to load both scrapy settings and scrapy cluster settings.

Cluster Speed Control

The cluster runs wide open at the moment, crawling as fast as it can through the queue. It would be nice if we could control the spiders across different machines or ip addresses in order to play nicer with domains.

Current code is in beta and needs integration into the project.

Redis Monitor Locks for processing

In order to scale the Redis Monitor horizontally, we need to be able to ensure that only a single redis monitor process is operating on a key. This involves adding an additional RedLock locking mechanism around the key that is being processed.

This eliminates the Redis Monitor single point of failure.

Persistent DNS cache

One of the sites we are crawling has alerted us to an issue where we were hitting a load balancer with a 1h TTL record over a week after they changed the DNS.

It appears that the default operation for Scrapy is to use a in-memory DNS cache, which apparently never gets flushed. Since the spiders are long running, you run into the reported issue.

I'm going to try setting DNSCACHE_ENABLED to False in my settings and see if that improves things. If it does, we should set that in settings.py so no one else runs into this problem.

Discussion: Docker vs Pip vs Virtual Machine

I would like to open up a discussion for Scrapy Cluster as to how it can be easier to work with.

As of right now, SC 1.1 (almost ready) allows you to do local development on a single Virtual Machine (at time of writing on the dev branch). This single VM is just to do local testing and should not be used in production.

This leaves you with a production deployment where a user must manually stand up Zookeeper, Kafka, and Redis at their desired scale and deploy SC to the various machines they want it to run on. This is done either manually by copying files, or (potentially) pip packages for the 3 main components. Ansible can help here to a degree but is quirky for your OS setup and is not always as modular.

Docker would allow you both the flexibility of deploying to arbitrary Docker Servers, and the ease of use of standing up your cluster. If we bundled the 3 main components as Docker Containers then with a bit of tweaking I think an easily scalable solution is possible. Especially using something like Mesos, Tumtum, Compose, or just plain Docker makes it really easy to run everything.

The downside to this is that it may be difficult for users to add custom spiders, pipelines, and middleware to their Scrapy based project, especially if heavy customization is going on and the spiders are deployed to a lot of different servers. ...Not to mention if the user adds custom plugins to the Kafka Monitor or Redis Monitor, would the user then would need to bundle their own docker container?

So the question is, what route seems like the most flexible, while allowing both local development and production scale deployment? What is the future of deploying distributed apps and how can we make SC extendable, flexible, deployable, and dev friendly?

Scrapy Cluster Utils Packaging

It would be nice if we could bundle the utils folder into its own package so that we no longer need symlinks or to duplicate code when scrapy cluster is deployed across multiple machines.

This issue should also cover the documentation of said sc-utils package and getting it on pypi.

Logging non-json to include content of `extras` dict

It would be nice to have the content of the extras dictionary in the log statement when not using the json output format.