Giter Site home page Giter Site logo

istresearch / scrapy-cluster Goto Github PK

View Code? Open in Web Editor NEW
1.2K 107.0 324.0 28.69 MB

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

Home Page: http://scrapy-cluster.readthedocs.io/

License: MIT License

Python 90.45% HTML 8.50% Shell 1.05%
python scrapy kafka redis scraping distributed

scrapy-cluster's Introduction

Scrapy Cluster

Build Status Documentation Join the chat at https://gitter.im/istresearch/scrapy-cluster Coverage Status License Docker Pulls

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.

Dependencies

Please see the requirements.txt within each sub project for Pip package dependencies.

Other important components required to run the cluster

Core Concepts

This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:

  • The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
  • Scale Scrapy instances across a single machine or multiple machines
  • Coordinate and prioritize their scraping effort for desired sites
  • Persist data across scraping jobs
  • Execute multiple scraping jobs concurrently
  • Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
  • Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
  • Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
  • Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
  • Enables completely different spiders to yield crawl requests to each other, giving flexibility to how the crawl job is tackled

Scrapy Cluster test environment

To set up a pre-canned Scrapy Cluster test environment, make sure you have Docker.

Steps to launch the test environment:

  1. Build your containers (or omit --build to pull from docker hub)
docker-compose up -d --build
  1. Tail kafka to view your future results
docker-compose exec kafka_monitor python kafkadump.py dump -t demo.crawled_firehose -ll INFO
  1. From another terminal, feed a request to kafka
curl localhost:5343/feed -H "content-type:application/json" -d '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc123"}'
  1. Validate you've got data!
# wait a couple seconds, your terminal from step 2 should dump json data
{u'body': '...content...', u'crawlid': u'abc123', u'links': [], u'encoding': u'utf-8', u'url': u'http://dmoztools.net', u'status_code': 200, u'status_msg': u'OK', u'response_url': u'http://dmoztools.net', u'request_headers': {u'Accept-Language': [u'en'], u'Accept-Encoding': [u'gzip,deflate'], u'Accept': [u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], u'User-Agent': [u'Scrapy/1.5.0 (+https://scrapy.org)']}, u'response_headers': {u'X-Amz-Cf-Pop': [u'IAD79-C3'], u'Via': [u'1.1 82c27f654a5635aeb67d519456516244.cloudfront.net (CloudFront)'], u'X-Cache': [u'RefreshHit from cloudfront'], u'Vary': [u'Accept-Encoding'], u'Server': [u'AmazonS3'], u'Last-Modified': [u'Mon, 20 Mar 2017 16:43:41 GMT'], u'Etag': [u'"cf6b76618b6f31cdec61181251aa39b7"'], u'X-Amz-Cf-Id': [u'y7MqDCLdBRu0UANgt4KOc6m3pKaCqsZP3U3ZgIuxMAJxoml2HTPs_Q=='], u'Date': [u'Tue, 22 Dec 2020 21:37:05 GMT'], u'Content-Type': [u'text/html']}, u'timestamp': u'2020-12-22T21:37:04.736926', u'attrs': None, u'appid': u'testapp'}

Documentation

Please check out the official Scrapy Cluster documentation for more information on how everything works!

Branches

The master branch of this repository contains the latest stable release code for Scrapy Cluster 1.2.

The dev branch contains bleeding edge code and is currently working towards Scrapy Cluster 1.3. Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.

scrapy-cluster's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapy-cluster's Issues

Vagrant test environment missing cryptography method

Hi. I am in the process of spinning up the test environment, and have run into a few issues, some of which I was able to work around, but one which I've not been able to.

First, the successful workarounds:

  • I had to manually change the Kafka and Zookeeper download urls in order to spin up the VM. The urls that were included were inaccessible.
  • I had to manually install certain packages before being able to successfully install the project requirements. (sudo apt-get install build-essential libssl-dev libffi-dev python-dev)

Unfortunately, now that I'm running the automated tests I keep getting an error about a certain crypto method not being present. Everything I'm finding online says the problem is due to a missing dependency on libssl-dev, but I already have that package installed, so I'm at a loss.

Here is the output I'm seeing at the end of running ./run_offline_tests.sh:

Traceback (most recent call last):
File "tests/tests_offline.py", line 14, in
from crawling.redis_dupefilter import RFPDupeFilter
File "/vagrant/crawler/crawling/redis_dupefilter.py", line 1, in
from scrapy.dupefilters import BaseDupeFilter
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/init.py", line 48, in
from scrapy.spiders import Spider
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/spiders/init.py", line 10, in
from scrapy.http import Request
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/http/init.py", line 15, in
from scrapy.http.response.html import HtmlResponse
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/http/response/html.py", line 8, in
from scrapy.http.response.text import TextResponse
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/http/response/text.py", line 13, in
from scrapy.utils.response import get_base_url
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/scrapy/utils/response.py", line 12, in
from twisted.web import http
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/web/http.py", line 92, in
from twisted.internet import interfaces, reactor, protocol, address
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/reactor.py", line 38, in
from twisted.internet import default
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/default.py", line 56, in
install = _getInstallFunction(platform)
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/default.py", line 44, in _getInstallFunction
from twisted.internet.epollreactor import install
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/epollreactor.py", line 24, in
from twisted.internet import posixbase
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 18, in
from twisted.internet import error, udp, tcp
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/tcp.py", line 29, in
from twisted.internet._newtls import (
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/internet/_newtls.py", line 21, in
from twisted.protocols.tls import TLSMemoryBIOFactory, TLSMemoryBIOProtocol
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/twisted/protocols/tls.py", line 41, in
from OpenSSL.SSL import Error, ZeroReturnError, WantReadError
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/OpenSSL/init.py", line 8, in
from OpenSSL import rand, crypto, SSL
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/OpenSSL/rand.py", line 11, in
from OpenSSL._util import (
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/OpenSSL/_util.py", line 7, in
binding = Binding()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 114, in init
self._ensure_ffi_initialized()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 126, in _ensure_ffi_initialized
cls._modules,
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/hazmat/bindings/utils.py", line 31, in load_library_for_binding
lib = ffi.verifier.load_library()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cffi/verifier.py", line 101, in load_library
return self._load_library()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cffi/verifier.py", line 211, in _load_library
return self._vengine.load_library()
File "/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cffi/vengine_cpy.py", line 155, in load_library
raise ffiplatform.VerificationError(error)
cffi.ffiplatform.VerificationError: importing '/home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/_Cryptography_cffi_a269d620xd5c405b7.so': /home/vagrant/.conda/envs/sc/lib/python2.7/site-packages/cryptography/_Cryptography_cffi_a269d620xd5c405b7.so: undefined symbol: EC_GFp_nistp224_method

Any suggestions would be much appreciated. Thanks!

Offload virtual machine deployment

The Vagrant VM scdev is finicky, uses a custom set of ansible scripts to do the deployment, and runs under a conda virtual environment.

Both real and virtual deployments of Scrapy Cluster should be supported by Ansible Symphony (https://github.com/istresearch/ansible-symphony) and use the standardized Ubuntu 14.04 box.

This will eliminate the ansible folder at the top level of the project.

Improved Debugging

It is a pain to write print statements throughout the cluster, only to remove them so it is not in the code release. Just like in Scrapy, if DEBUG = True we should dump out logger statements about what is going on in each of the three main components.

As a second thought, we could also have a flag for JSON_LOGS so the same human readable logs are turned into JSON structures for piping elsewhere (like ElasticSearch).

_get_bin takes hours with queue size 1M.

I'm scraping etsy.com and queue size become more than 1M. When I query for info/statistics it stuck on _get_bin function in scrapy-cluster/redis-monitor/plugins/info_monitor.py file. Also i takes 500MB memory for redis-monitor in that moment.

  1. What is the best way to keep queue size small?
  2. Perhaps _get_bin should be rewritten in more efficient way to calc statistics in the database.

Status

Noticed master hadn't had an update in quite some time. Is this project currently still underdevelopment ?

Sample Kibana and Logstash configs

Once all of the JSON logging is finalized and we are towards the end of development for SC 1.1, create some sample logstash configurations and Kibana dashboards for visualizing the cluster. This would pair nicely with documentation that used Kibana logs and graphs for examples, verification, pictures, etc.

Concurrent requests from a spider to a domain

Hi madisonb,

Currently i am working in a single spider cluster ( 1 machine ).

Scrapy supports concurrent requests to a domain from a single spider. I had tried to emulate this in scrapy-cluster but no luck. Even if i increase queue hits, it just increases the total number of requests it can take from redis within the window. But the requests are still processed one at a time, which ultimately becomes a bottleneck.

Only way i could find to parallelise requests is to spawn two identical spiders with different names and dividing seed urls between them.

I maybe wrong, but please tell me if there is any way to parralelise requests within a machine.

Add Examples Folder in Utils

We have a bunch of example scripts for how to use the scutils package in the documentation, add the formal code into a place within the utils/examples folder and update documentation.

Zookeeper dependency?

Hey guys!

We've met at Memex Summer workshop last time. I was looking through this code, and didn't found any dependency on Zookeeper. Am I missed something? If not, probably it makes sense to remove it from docs.

Modularize via Plugins/Middleware

Everything needs to move to a plugin based system, similar to how Scrapy operates with its middleware components. It should be called something distinct (like plugins) so as to not be confused with Scrapy Middlewares.

This will allow us to easily extend functionality of kafka-monitor and redis-monitor into an agnostic framework which translates and operates on the data based on the cluster's needs. Already testing this functionality out via kafka-monitor and loving it.

Queue epiphany here.

Integration Tests

Scrapy Cluster needs an integration test that you can run to see if you have everything set up properly. This would be very helpful for new users of the application, which has so many moving parts.

Currently the integration test is just 'getting it working on live data'.

Improve Unit Tests

At time of writing our code coverage is 57%. We should strive to get more coverage and at least get to 65 or 70%.

Also, unit tests do not exist for the following major areas:

  • Zookeeper Watcher
  • Argparse Helper
  • Redis Queue
  • Wandering Spider
  • Scrapy Pipelines

Statistics Collection without 3rd Party Tools

It would be nice to collect metrics and make available some simple stats without having to use yet another big data app (like ELK). Since we already are using redis, lets use it to collect total and rolling metric counts of incoming kafka requests, crawls, and monitor actions.

This also will need one new plugin each for the kafka monitor and redis monitor.

References so we don't forget:
Rolling time windows hit counters: https://opensourcehacker.com/2014/07/09/rolling-time-window-counters-with-redis-and-mitigating-botnet-driven-login-attacks/

  • Useful for keeping hit counts of say: 15 mins, 1 hr, 6 hr, 12 hr, 24 hr, 7 days (anything more is probably wasted space for the usefulness of the metric)

Unique item counts: http://redis.io/topics/data-types-intro#hyperloglogs

  • Fixed size data store to count the total number of unique hits seen by something. Useful for seeing 'total' cluster stats.

I think both of these would be useful and should be considered part of future development.

Multiple Spiders in Single Process

Scrapy 1.0 allows us to run full crawler instances within a process thanks to its internal API.

We should be able to dynamically run X number of spiders defined in the settings.py file, per spider type. So if you have a link_spider.py and a example_spider.py, in your settings file you define how many of each type you want to run, and the overarching process spawns them all.

This is currently not an issue if you use a process manager like Supervisord to spin up multiple instances of scrapy runspider blah, but would be nice to simply wrap everything into a single running instance.

Zookeeper Domain blacklist

Once #39 is completed it would be nice to be able to add domains to be blacklisted to the crawlers through the API. The blacklist would propagate to all crawlers immediately thanks to Zookeeper.

The implementation of this would be to load the blacklist into a list in the scheduler and then ensure we do not pull from that particular queue when looking for a new crawl. We can spawn a Redis Monitor job to truncate the queue, and lastly if any new crawls of depth > 0 are active the blacklist would not allow the request to be added back to Redis.

This helps eliminate and extra Redis calls we need to make.

Plugin for Zookeeper Crawler Control

It would be nice to have a plugin to dynamically control the crawler domain specific configuration controlled within Zookeeper. This should be done in the Redis monitor with a connection and ability to manipulate the ZK yaml file, instead of using the manual file pusher right now.

UI for displaying information about Cluster

We need a small stand-alone web UI that ties in with the rest components in #24 to visualize the data generated by the cluster. You should also be able to submit API requests to the cluster.

Preferably this web ui and rest services are together and it is just deployed as a single running process.

Reduce potential Redis key collisions

Scrapy Cluster may not be the only process that is operating within a Redis Instance. We should add a unique identifier to the beginning of every key used so that doing *:*:queue does not collide with any other potential key being used in the cluster.

I propose using sc: as the identifier, so that every single thing scrapy cluster uses is easily distinguishable from other keys in use. The new query would be sc:*:*:queue.

Slow Scheduler Memory Build Up

The Distributed Scheduler keeps every domain queue it has ever seen in memory, so we can do extremely fast look up loops against all known domain keys. In turn, we only update the domain queues once every X seconds for new domains that we have never seen before. With every new domain we see, we create an object in memory representing the way to access the redis based queue.

The problem is that these objects inside of the scheduler are never deleted. If we see new domains every time we check, eventually we will get to a point where we run out of available memory on the host because we simply keep adding more and more objects to our lookup dictionary.

The solution to this problem can be solved in a number of ways:

  1. An expiring timer on every dictionary key, in memory, and if there has not been a successful pop() from that queue within X seconds it gets cleaned from memory.
  2. A LRU cache based setup where if the key count exceeds a certain threshold (like 10,000) we begin to delete keys least recently used every time we add new ones. This may also involve both a timestamp and a count of the number of times a key has been used.
  3. Delete all keys every X seconds, whether it is every day, hour, etc. This is very naive.

I am in favor of point 1. Implementation would be then:

  • replace the ThrottledQueue class object stored at queue_dict[final_key] with a tuple of (ThrottledQueue, timestamp)
  • when looping over the throttled queues, if pop() succeeds update the timestamp, else check if the timestamp diff is greater than our delete threshold.
  • if diff is greater than threshold, remove key from

If you set the threshold to something like 1 year (large), you would expect the queue to grow until all available memory is used. If you set the threshold to 10 minutes (small), you would expect the queue dictionary to only grow to the size of all known domains within the past 10 minutes.

I think this solves the slow memory growth we see when crawling millions of different domains over time.

Elastic Moderated Throttled Queue

The current RedisThrottledQueue when used under Moderation causes a slight drift in the actual processing of X number of hits in Y time. This detla d is then added for each window, so that the successful number of X hits in Y time is really X hits in Y + d time. This delay is not normally an issue, but crops up when really high velocity moderated keys occurs (ie 60 hits in 60 seconds ends up being ~61 or ~62 seconds). This may be due to network latency or improper implementation.

We should have an ability to 'catch back up' or fix the moderation implementation so that the numbers line up with exactly what is defined. This may involve adding more items to the throttle_time key, or setting a minimum moderation value so that the catch up does not cause huge spikes in domain hits.

tests_offline.py path dependency issue

When running the tests_offline.py manually using python tests_offline.py, it will fail unless it is run from the utils directory. The run_offline_test.sh works because it changes to the utils folder before running.

This can be solved by using the current directory instead of using "test" in the code.

Lines 167, 168:

                dir='test', level='INFO', stdout=False, file='test.log')
        self.test_file = 'test/test'

can be changed to:

                dir='.', level='INFO', stdout=False, file='test.log')
        self.test_file = './test'

and the test will work from any directory.

Migration Script to update from 1.0 to 1.1

Issue #2 creates a redis database mismatch in the way we handle the queue for Scrapy Cluster. We will need to write a small 'upgrade' script to help with migration from a single queue -> domain based queues

Plugin for Queue Statistics API

We should have a series of plugins to gather basic queue stats, like we are dumping to the logs to visualize basic redis queue backlog. This is not covered under the current info API nor the current stats api.

For example, the Stats API call would be stats: queue and you would get back the number of domains and the total backlog for each spider type.

Pass "spiderid" param to feed function and got "invalid json received" error

From http://scrapy-cluster.readthedocs.org/en/latest/topics/kafkamonitor.html?highlight=feed#scraper-schema-json I found I could pass "spiderid" to feed function. So I did like this: python kafka-monitor.py feed -s settings_crawling.py '{"url": "http://istresearch3.com", "appid":"testapp", "crawlid":"ABC123", "spiderid":"aaaa"}'. And then I got the "invalid json received" error, I've check the kafka-monitor.py source code and didn't know why.
My purpose of this is that I want to put different links into one single db, and let different spiders (ex: spidera get job from "aaaa:queue" and spiderb get job from "bbbb:queue") get their own job queue. Am I doing it in the right way?

Integrate Travis CI for offline tests

Since the offline tests don't have external dependencies (kafka, redis, etc), we could automatically run them upon any new commits or pull requests to the repo. Travis CI is easy to set up and is free for open source repos. I can also add a badge to show that the offline tests have successfully passed or failed.

Production ready ?

I am up to setting up my own cluster for scrapping using individual components such as

  • Supervisor
  • Scrapy
  • Kafka for msg
  • Celery for queue and Rebbitmq/redis as broker
  • Flask for rest
  • Etc

And I'm trying to stitch all this above parts together and make my crawling system. Came across this repo and specs looks awesome.

  • Zookeeper
  • Kafka
  • Redis
  • Crawler(s) Scrapy
  • Kafka Monitor
  • Redis Monitor

And by reading document i must admin every component seems to be stitch together nicely.

The problem is that document seems promising but after searching online could not find anyone using it for real-world applications. My question is very simple " is it really production ready ? "

Anyone who is using this on production or thought of using can help me with comment. I know this question is very opinionated but any advice would help.

I thought this is most appropriate place to ask this question as i could not find anywhere else peple using it. In case if this sounds inappropriate here feel free to close this.

Do help

LogFactory rolling log doesnt actually roll

The LogFactory logger instance doesn't actually roll the log when it hits the max size, completely negating any added benefit of the ConcurrentRotatingFileHandler we use under the hood.

from scutils.log_factory import LogFactory

logger = LogFactory.get_instance(
        json=False,
        stdout=False,
        level='DEBUG',
        name='test',
        dir='/tmp/',
        file='test.log',
        bytes='5MB',
        backups=2
)

while True:
    logger.info("test log")

Should write 3 log files total, 2 backups of ~5MB each and the main log file. It does no such thing and continues to write to the same log file, growing as big as the system can handle.

Domain based queueing

It would be nice if the spider queue was split up based on the domain you are crawling, in order to see a better picture into the cluster. This affects all three core components of the system, as there is much more work to manage them.

Current code is in beta and needs integration/testing in this main project.

1.1 Troubles

Having a bit of trouble getting started. Below I've included commands and their outputs (note: some outputs are truncated):

python kafka_monitor.py run
2015-12-06 19:59:00,030 [kafka-monitor] INFO: Kafka Monitor Stats Dump:
{
    "fail_21600": 0,
    "fail_3600": 0,
    "fail_43200": 0,
    "fail_604800": 0,
....
    "plugin_StatsHandler_lifetime": 0,
    "total_21600": 13,
    "total_3600": 13,
    "total_43200": 13,
    "total_604800": 13,
    "total_86400": 13,
    "total_900": 1,
    "total_lifetime": 13
}

python redis_monitor.py
....
    "total_604800": 6,
    "total_86400": 6,
    "total_900": 0,
    "total_lifetime": 6
}
2015-12-06 20:02:39,862 [redis-monitor] INFO: Crawler Stats Dump:
{
    "total_spider_count": 0
}


scrapy runspider crawling/spiders/link_spider.py
2015-12-06 19:56:46,817 [scrapy-cluster] INFO: Changed Public IP: None -> 52.91.192.73

(scrapy_dev)ubuntu@ip-172-31-7-147:~/scrapy-cluster/kafka-monitor$ python kafka_monitor.py feed '{"url": "http://dmoz.org", "appid":"testapp", "crawlid":"abc1234", "maxdepth":1}'
No override settings found
2015-12-06 19:58:44,573 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "url": "http://dmoz.org",
    "maxdepth": 1,
    "crawlid": "abc1234",
    "appid": "testapp"
}
2015-12-06 19:58:44,580 [kafka-monitor] INFO: Successly fed item to Kafka

python kafkadump.py dump -t demo.crawled_firehose


(scrapy_dev)ubuntu@ip-172-31-7-147:~/scrapy-cluster/kafka-monitor$ python kafkadump.py dump -t demo.outbound_firehose
No override settings found
2015-12-06 19:35:31,640 [kafkadump] INFO: Connected to localhost:9092
{u'server_time': 1449430706, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}

I haven't changed any of the default settings and I'm currently using the dev branch. However, I don't think my setup is working. I was expecting some updates in dump -t demo.crawled_firehose. So while I think I've successfully feed a url to be crawled scrapy isn't doing the crawl ? Any ideas?

Dockerization

We should use Docker Swarm and Docker Compose in order to be able to spin up and spin down Scrapy Cluster and all it's dependencies.

Impacts #14 #18 #26

Python 3 Support

With Scrapy soon supporting Python 3, we should consider supporting it as well. At a first glance, most of the functionality changes do not affect the code within, but I am sure there needs to be more work done.

How to get amount of crawled pages for specific crawl request?

Thank you for hard work on project.
It seems that current stats per crawler is limited to amount of pending items only.
Is there a way to get more advanced stat that includes amount of successfully parsed pages and amount of fails? I need it per crawl request.

{u'server_time': 1458001694, u'crawlid': u'tc2', u'total_pending': 9160, u'total_domains': 1, u'spiderid': u'link', u'appid': u'apid2', u'domains': {u'dmoz.org': {u'low_priority': -29, u'high_priority': -19, u'total': 9160}}, u'uuid': u'd2afgh'}

Feeding speed is slow, how to speed up?

Great job and thanks a lot, this project is the only one which on github could work correctly.

And I've met one problem: the feeding speed is slow, I have push more than 100000 records to queue, and it's really a nightmare, is there any way to speed it up?

Rest Services for API requests

We need a set of Rest services to be able to pass crawl requests into the Kafka API to be processed by the Kafka Monitor. Ideally this uses something small like Flask and will run on a server that has purely Kafka access only. The rest services should not bypass the whole Kafka/Redis Monitor architecture, but provide a front end rest endpoint into submitting and reading things from Kafka.

This API should allow the passthrough of any JSON that needs to flow into Kafka Monitor Plugin, and in cases where there is an expected response, should return the JSON response from Kafka. At that point it behaves just like a rest service.

Note that the rest endpoint should not try to serve streaming data from the firehose, but rather very specific requests.

Scrapy Cluster 1.1 Documentation

Need more documentation for features added in Scrapy Cluster 1.1. This issue should be worked on after all other issues for the milestone have been completed.

New docs may include:

  • unit testing
  • how to extend or make a new spider
  • more pictures
  • updates to existing pages that are now obsolete

1.1 Final Documentation Completion:

Introduction:

  • Overview
  • Quick Start

Kafka Monitor:

  • Kafka Monitor: Design
  • Kafka Monitor: Quick Start
  • Kafka Monitor: API
  • Kafka Monitor: Plugins
  • Kafka Monitor: Settings

Crawler:

  • Crawler: Design
  • Crawler: Quick Start
  • Crawler: Controlling
  • Crawler: Extension
  • Crawler: Settings

Redis Monitor:

  • Redis Monitor: Design
  • Redis Monitor: Quick Start
  • Redis Monitor: Plugins
  • Redis Monitor: Settings

Utilities:

  • Argparse Helper
  • LogFactory
  • Method Timer
  • Redis Queue
  • Redis Throttled Queue
  • Settings Wrapper
  • Stats Collector
  • Zookeeper Watcher

Advanced Topics:

  • Upgrade Scrapy Cluster
  • Integration with ELK
  • Crawling Responsibly
  • Production Setup
  • DNS Cache
  • Response Time
  • Kafka Topics
  • Redis Keys
  • Other Distributed Scrapy Projects

Misc:

  • FAQ
  • Troubleshooting
  • Contributing
  • Change Log
  • License

Improve documentation

Improve documentation and FAQ's/troubleshooting for the following:

  • How to yield requests from your spider, this probably needs to go in the Wandering Spider example
  • Explain how different spiders can talk to each other using the request.meta['spiderid'] field in their request. This is an undocumented feature but works.
  • Update docs for new kafka consumer and producer settings
  • Update docs for close() plugin functionality in redis monitor
  • Update documentation for running both offline and online unit tests (since we switched to nose for offline unit tests)
  • Ensure changelog is up to date
  • Update migration docs when 1.2 is completed
  • Add redis keys docs for actions conducted that are not scrapes
  • Update documentation about production scheduler domain ttls
  • Update production deployment documentation about proper Stats configurations (remove 1 week, 24 hrs)
  • Add compatibility for VM provisioning with centos 7 as well as ubuntu
  • Add Kafka monitor distributed documentation and diagrams
  • Add Redis monitor distributed documentation and diagrams
  • Finalize docker documentation
  • Correct requirements.txt usage in Overview and Quickstart
  • Update spider documentation with new middlewares for "how to create your own spider", include what things someone needs to do if they don't want to use the RedisSpider base class
  • Add Arachnado to comparisons for other distributed scrapy projects

Switch from Kafka-Python to PyKafka

PyKafka (https://github.com/Parsely/pykafka) supports consumer groups and will allow us to scale the Kafka Monitor horizontally so there is not a single point of failure on the API incoming requests.

This should apply to everything using Kafka Python, so crawlers and redis monitor too. Ultimately we want to remove the kafka-python dependency.

Scrapy Cluster Pip Packaging

Scrapy Cluster should have three or four distinct pip packages that allow a user to run pip install scrapy-cluster to get all available packages set up, or to allow individual component management like pip install sc-kafka-monitor, pip install sc-redis-monitor, pip install sc-crawler, or pip install sc-utils.

This requires significant development effort as it will change the way the project is laid out and used by the end user, since our project structure is dependent on all files being available from the git repo. It will function more like Scrapy, when the end user is responsible for creating the folder structure, and can call something like sc-crawler run to load both scrapy settings and scrapy cluster settings.

See also #13

Cluster Speed Control

The cluster runs wide open at the moment, crawling as fast as it can through the queue. It would be nice if we could control the spiders across different machines or ip addresses in order to play nicer with domains.

Current code is in beta and needs integration into the project.

Redis Monitor Locks for processing

In order to scale the Redis Monitor horizontally, we need to be able to ensure that only a single redis monitor process is operating on a key. This involves adding an additional RedLock locking mechanism around the key that is being processed.

This eliminates the Redis Monitor single point of failure.

Persistent DNS cache

One of the sites we are crawling has alerted us to an issue where we were hitting a load balancer with a 1h TTL record over a week after they changed the DNS.

It appears that the default operation for Scrapy is to use a in-memory DNS cache, which apparently never gets flushed. Since the spiders are long running, you run into the reported issue.

I'm going to try setting DNSCACHE_ENABLED to False in my settings and see if that improves things. If it does, we should set that in settings.py so no one else runs into this problem.

Discussion: Docker vs Pip vs Virtual Machine

I would like to open up a discussion for Scrapy Cluster as to how it can be easier to work with.

As of right now, SC 1.1 (almost ready) allows you to do local development on a single Virtual Machine (at time of writing on the dev branch). This single VM is just to do local testing and should not be used in production.

This leaves you with a production deployment where a user must manually stand up Zookeeper, Kafka, and Redis at their desired scale and deploy SC to the various machines they want it to run on. This is done either manually by copying files, or (potentially) pip packages for the 3 main components. Ansible can help here to a degree but is quirky for your OS setup and is not always as modular.

Docker would allow you both the flexibility of deploying to arbitrary Docker Servers, and the ease of use of standing up your cluster. If we bundled the 3 main components as Docker Containers then with a bit of tweaking I think an easily scalable solution is possible. Especially using something like Mesos, Tumtum, Compose, or just plain Docker makes it really easy to run everything.

The downside to this is that it may be difficult for users to add custom spiders, pipelines, and middleware to their Scrapy based project, especially if heavy customization is going on and the spiders are deployed to a lot of different servers. ...Not to mention if the user adds custom plugins to the Kafka Monitor or Redis Monitor, would the user then would need to bundle their own docker container?

So the question is, what route seems like the most flexible, while allowing both local development and production scale deployment? What is the future of deploying distributed apps and how can we make SC extendable, flexible, deployable, and dev friendly?

Scrapy Cluster Utils Packaging

It would be nice if we could bundle the utils folder into its own package so that we no longer need symlinks or to duplicate code when scrapy cluster is deployed across multiple machines.

This issue should also cover the documentation of said sc-utils package and getting it on pypi.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.