Giter Site home page Giter Site logo

jchenga / scrapy-distributed Goto Github PK

View Code? Open in Web Editor NEW

This project forked from insutanto/scrapy-distributed

0.0 0.0 0.0 74 KB

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Python 100.00%

scrapy-distributed's Introduction

Scrapy-Distributed

Scrapy-Distributed is a series of components for you to develop a distributed crawler base on Scrapy in an easy way.

Now! Scrapy-Distributed has supported RabbitMQ Scheduler, Kafka Scheduler and RedisBloom DupeFilter. You can use either of those in your Scrapy's project very easily.

Features

  • RabbitMQ Scheduler
    • Support custom declare a RabbitMQ's Queue. Such as passivedurableexclusiveauto_delete, and all other options.
  • RabbitMQ Pipeline
    • Support custom declare a RabbitMQ's Queue for the items of spider. Such as passivedurableexclusiveauto_delete, and all other options.
  • Kafaka Scheduler
    • Support custom declare a Kafka's Topic. Such as num_partitionsreplication_factor and will support other options.
  • RedisBloom DupeFilter
    • Support custom the keyerrorRatecapacityexpansion and auto-scaling(noScale) of a bloom filter.

Requirements

  • Python >= 3.6
  • Scrapy >= 1.8.0
  • Pika >= 1.0.0
  • RedisBloom >= 0.2.0
  • Redis >= 3.0.1
  • kafka-python >= 1.4.7

TODO

  • RabbitMQ Item Pipeline
  • Support Delayed Message in RabbitMQ Scheduler
  • Support Scheduler Serializer
  • Custom Interface for DupeFilter
  • RocketMQ Scheduler
  • RocketMQ Item Pipeline
  • SQLAlchemy Item Pipeline
  • Mongodb Item Pipeline
  • Kafka Scheduler
  • Kafka Item Pipeline

Usage

Step 0:

pip install scrapy-distributed

OR

git clone https://github.com/Insutanto/scrapy-distributed.git && cd scrapy-distributed
&& python setup.py install

There is a simple demo in examples/simple_example. Here is the fast way to use Scrapy-Distributed.

# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3
# enable rabbitmq_management
docker exec -it <rabbitmq-container-id> /bin/bash
cd /etc/rabbitmq/
rabbitmq-plugins enable rabbitmq_management

# pull and run a RedisBloom container.
docker run -d --name redis-redisbloom -p 6379:6379 redislabs/rebloom:latest

cd examples/rabbitmq_example
python run_simple_example.py
# make sure you have a Kafka running on localhost:9092
# pull and run a RedisBloom container.
docker run -d --name redis-redisbloom -p 6379:6379 redislabs/rebloom:latest

cd examples/kafka_example
python run_simple_example.py

RabbitMQ Support

If you don't have the required environment for tests:

# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3
# enable rabbitmq_management
docker exec -it <rabbitmq-container-id> /bin/bash
cd /etc/rabbitmq/
rabbitmq-plugins enable rabbitmq_management
# pull and run a RedisBloom container.
docker run -d --name redis-redisbloom -p 6379:6379 redislabs/rebloom:latest

Step 1:

Only by change SCHEDULERDUPEFILTER_CLASS and add some configs, you can get a distributed crawler in a moment.

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue"
RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/?heartbeat=0"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000

# disable the RedirectMiddleware, because the RabbitMiddleware can handle those redirect request.
DOWNLOADER_MIDDLEWARES = {
    ...
    "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": None,
    "scrapy_distributed.middlewares.amqp.RabbitMiddleware": 542
}

# add RabbitPipeline, it will push your items to rabbitmq's queue. 
ITEM_PIPELINES = {
    ...
   'scrapy_distributed.pipelines.amqp.RabbitPipeline': 301,
}


Step 2:

scrapy crawl <your_spider>

Kafka Support

Step 1:

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.kafka.KafkaQueue"
KAFKA_CONNECTION_PARAMETERS = "localhost:9092"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000

DOWNLOADER_MIDDLEWARES = {
    ...
   "scrapy_distributed.middlewares.kafka.KafkaMiddleware": 542
}

Step 2:

scrapy crawl <your_spider>

Reference Project

scrapy-rabbitmq-link(scrapy-rabbitmq-link)

scrapy-redis(scrapy-redis)

scrapy-distributed's People

Contributors

insutanto avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.