insutanto / scrapy-distributed Goto Github PK

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Python 100.00%

rabbitmq scrapy rabbitmq-pipeline kafka redisbloom redis distributed-spider spider python crawler

scrapy-distributed's Introduction

Scrapy-Distributed

Scrapy-Distributed is a series of components for you to develop a distributed crawler base on Scrapy in an easy way.

Now! Scrapy-Distributed has supported RabbitMQ Scheduler, Kafka Scheduler and RedisBloom DupeFilter. You can use either of those in your Scrapy's project very easily.

Features

RabbitMQ Scheduler
- Support custom declare a RabbitMQ's Queue. Such as passive, durable, exclusive, auto_delete, and all other options.
RabbitMQ Pipeline
- Support custom declare a RabbitMQ's Queue for the items of spider. Such as passive, durable, exclusive, auto_delete, and all other options.
Kafaka Scheduler
- Support custom declare a Kafka's Topic. Such as num_partitions, replication_factor and will support other options.
RedisBloom DupeFilter
- Support custom the key, errorRate, capacity, expansion and auto-scaling(noScale) of a bloom filter.

Requirements

Python >= 3.6
Scrapy >= 1.8.0
Pika >= 1.0.0
RedisBloom >= 0.2.0
Redis >= 3.0.1
kafka-python >= 1.4.7

TODO

~~RabbitMQ Item Pipeline~~
Support Delayed Message in RabbitMQ Scheduler
Support Scheduler Serializer
Custom Interface for DupeFilter
RocketMQ Scheduler
RocketMQ Item Pipeline
SQLAlchemy Item Pipeline
Mongodb Item Pipeline
~~Kafka Scheduler~~
~~Kafka Item Pipeline~~

Usage

Step 0:

pip install scrapy-distributed

git clone https://github.com/Insutanto/scrapy-distributed.git && cd scrapy-distributed
&& python setup.py install

There is a simple demo in examples/simple_example. Here is the fast way to use Scrapy-Distributed.

Examples of RabbitMQ

If you don't have the required environment for tests:

# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3-management

# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

cd examples/rabbitmq_example
python run_simple_example.py

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d
cd examples/rabbitmq_example
python run_simple_example.py

Examples of Kafka

If you don't have the required environment for tests:

# make sure you have a Kafka running on localhost:9092
# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

cd examples/kafka_example
python run_simple_example.py

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d
cd examples/kafka_example
python run_simple_example.py

RabbitMQ Support

If you don't have the required environment for tests:

# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3-management

# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d

Step 1:

Only by change SCHEDULER, DUPEFILTER_CLASS and add some configs, you can get a distributed crawler in a moment.

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue"
RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/?heartbeat=0"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000

# disable the RedirectMiddleware, because the RabbitMiddleware can handle those redirect request.
DOWNLOADER_MIDDLEWARES = {
    ...
    "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": None,
    "scrapy_distributed.middlewares.amqp.RabbitMiddleware": 542
}

# add RabbitPipeline, it will push your items to rabbitmq's queue. 
ITEM_PIPELINES = {
    ...
   'scrapy_distributed.pipelines.amqp.RabbitPipeline': 301,
}

Step 2:

scrapy crawl <your_spider>

Kafka Support

Step 1:

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.kafka.KafkaQueue"
KAFKA_CONNECTION_PARAMETERS = "localhost:9092"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000

DOWNLOADER_MIDDLEWARES = {
    ...
   "scrapy_distributed.middlewares.kafka.KafkaMiddleware": 542
}

Step 2:

scrapy crawl <your_spider>

Reference Project

scrapy-rabbitmq-link(scrapy-rabbitmq-link)

scrapy-redis(scrapy-redis)

scrapy-distributed's People

Contributors

Stargazers

Watchers

Forkers

xrosliang zanachka espressofx doginal zheshilixin net5 alexkeleon kingking888 phamvanhanh6720 dqsdatalabs jchenga

scrapy-distributed's Issues

dynamic web crawlers

Can you add dynamic web crawlers into your project? Need to use simulation click technology and anti-climbing mechanism. Thx！

Implementation proposal

Hi @Insutanto

You doing nice work in this repo. I have the same desire: different message queues should be supported in scrapy.

Old implementations of this idea and one you have here share common disadvantage. For every type of queue you need to implement separate scheduler. Beside amount of work required such implementations can't use work done on improvement of scheduling. I am talking mostly about scrapy/scrapy#3520. The reason for going distributed(at least for me) is a lot of domains in a single crawl. Not using DownloaderAwarePriorityQueue makes crawling slower(like 10 times slower) according to benchmarks in mentioned PR.

To overcome this situation I developed and merged in scrapy/scrapy#3884 separation between logic of scheduler and external message queue.

It would be great for your project and scrapy community if you change from scheduler-based to queue-based.

More details and discussions can be find in scrapy/scrapy#4326. Example of such implementation for redis you can find in https://github.com/whalebot-helmsman/scrapy/blob/redis/scrapy/squeues.py#L101-L173 .

Also there is a PR for external queue protocol scrapy/scrapy#4783

Support Delayed Message in RabbitMQ Scheduler

Redis Streams Item Pipeline

RocketMQ Item Pipeline

Redis Streams Scheduler

RocketMQ Scheduler

Congratulate

As for distributed crawlers, I think it's just friendly URL management with reasonable scheduling rules. Semaphore will be their command gun, this is a really great open source project, looking forward to it。

Or we try to use Raft

Custom Interface for DupeFilter

是否能够添加初始url类似 scrapy_redis redis_key的功能

或者有相关的案例展示

Support Scrapy 2.6+

First feedback from users

Hi @Insutanto

Great! Your work is really impressive. However, I would like to add some suggestions.

First of all, I wanna open the front console of RabbitMQ(http://127.0.0.1:15672), but it didn't work.
I fixed it by navigating to the RabbitMQ configuration directory, then install web rabbitmq_management and refresh the front console of RabbitMQ. The commands are:

docker exec -it <rabbitmq-container-id> /bin/bash
cd /etc/rabbitmq/
rabbitmq-plugins enable rabbitmq_management

Secondly, delete the key from redis. The commands are:

docker exec -it <redis-container-id> /bin/bash
redis-cli
keys *
del <key_name>

Thirdly, I added the images and files download functions for rabbitmq example common. The code is:

PATH: examples/rabbitmq_example/simple_example/settings.py

ITEM_PIPELINES = {
   'simple_example.pipelines.SimpleExamplePipeline': 201,
   'scrapy_distributed.pipelines.amqp.RabbitPipeline': 200,
   'simple_example.pipelines.ImagePipeline': 202,
   'simple_example.pipelines.MyFilesPipeline': 203,
}

FILES_STORE = './test_data/example_common/files_dir'
IMAGES_STORE = './test_data/example_common/images_dir'

PATH: examples/rabbitmq_example/simple_example/pipelines.py

from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request


class ImagePipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url, meta={'item': item, 'index': item['image_urls'].index(image_url)})

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        image_guid = request.url.split('/')[-1]
        filename = './{}/{}/{}'.format(item["url"].replace("/", "_"), item['title'], image_guid)
        return filename


class MyFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        for file_url in item['file_urls']:
            yield Request(file_url, meta={'item': item, 'index': item['file_urls'].index(file_url)})

    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no files")
        return item

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        file_guid = request.url.split('/')[-1]
        filename = './{}/{}/{}'.format(item["url"].replace("/", "_"), item['title'], file_guid)
        return filename

PATH: examples/rabbitmq_example/simple_example/items.py

class CommonExampleItem(scrapy.Item):

    # define the fields for your item here like:
    title = scrapy.Field()
    url = scrapy.Field()
    content = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    file_urls = scrapy.Field()
    files = scrapy.Field()

PATH: examples/rabbitmq_example/simple_example/spiders/example.py

    def parse(self, response):
        self.logger.info(f"parse response, url: {response.url}")
        for link in response.xpath("//a/@href").extract():
            if not link.startswith('http'):
                link = response.url + link
            yield Request(url=link)
        item = CommonExampleItem()
        item['url'] = response.url
        item['title'] = response.xpath("//title/text()").extract_first()
        item["content"] = response.text

        image_urls = []
        for image_url in response.xpath('//a/img/@src').extract():
            if image_url.endswith(('jpg', 'png')):
                if not image_url.startswith('http'):
                    image_url = re.match("(.*?//.*?)/", response.url).group(1) + image_url
                    image_urls.append(image_url)
                else:
                    image_urls.append(image_url)
        item['image_urls'] = image_urls

        file_urls = []
        for file_url in response.xpath(
                "//a[re:match(@href,'.*(\.docx|\.doc|\.xlsx|\.pdf|\.xls|\.zip)$')]/@href").extract():
            if not file_url.startswith('http'):
                file_url = re.match("(.*?//.*?)/", response.url).group(1) + file_url
                file_urls.append(file_url)
            else:
                file_urls.append(file_url)
        item['file_urls'] = file_urls

        yield item

Finally, I hope that the author can add the dynamic web crawler tutorial. Thanks again!!!!

SQLAlchemy Pipeline

Support SQLAlchemy Pipeline

insutanto / scrapy-distributed Goto Github PK

scrapy-distributed's Introduction

Scrapy-Distributed

Features

Requirements

TODO

Usage

Step 0:

RabbitMQ Support

Step 1:

Step 2:

Kafka Support

Step 1:

Step 2:

Reference Project

scrapy-distributed's People

Contributors

Stargazers

Watchers

Forkers

scrapy-distributed's Issues

Recommend Projects

Recommend Topics

Recommend Org