Giter Site home page Giter Site logo

insutanto / scrapy-distributed Goto Github PK

View Code? Open in Web Editor NEW
54.0 6.0 11.0 43 KB

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Python 100.00%
rabbitmq scrapy rabbitmq-pipeline kafka redisbloom redis distributed-spider spider python crawler

scrapy-distributed's Introduction

Scrapy-Distributed

Scrapy-Distributed is a series of components for you to develop a distributed crawler base on Scrapy in an easy way.

Now! Scrapy-Distributed has supported RabbitMQ Scheduler, Kafka Scheduler and RedisBloom DupeFilter. You can use either of those in your Scrapy's project very easily.

Features

  • RabbitMQ Scheduler
    • Support custom declare a RabbitMQ's Queue. Such as passivedurableexclusiveauto_delete, and all other options.
  • RabbitMQ Pipeline
    • Support custom declare a RabbitMQ's Queue for the items of spider. Such as passivedurableexclusiveauto_delete, and all other options.
  • Kafaka Scheduler
    • Support custom declare a Kafka's Topic. Such as num_partitionsreplication_factor and will support other options.
  • RedisBloom DupeFilter
    • Support custom the keyerrorRatecapacityexpansion and auto-scaling(noScale) of a bloom filter.

Requirements

  • Python >= 3.6
  • Scrapy >= 1.8.0
  • Pika >= 1.0.0
  • RedisBloom >= 0.2.0
  • Redis >= 3.0.1
  • kafka-python >= 1.4.7

TODO

  • RabbitMQ Item Pipeline
  • Support Delayed Message in RabbitMQ Scheduler
  • Support Scheduler Serializer
  • Custom Interface for DupeFilter
  • RocketMQ Scheduler
  • RocketMQ Item Pipeline
  • SQLAlchemy Item Pipeline
  • Mongodb Item Pipeline
  • Kafka Scheduler
  • Kafka Item Pipeline

Usage

Step 0:

pip install scrapy-distributed

OR

git clone https://github.com/Insutanto/scrapy-distributed.git && cd scrapy-distributed
&& python setup.py install

There is a simple demo in examples/simple_example. Here is the fast way to use Scrapy-Distributed.

If you don't have the required environment for tests:

# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3-management

# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

cd examples/rabbitmq_example
python run_simple_example.py

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d
cd examples/rabbitmq_example
python run_simple_example.py

If you don't have the required environment for tests:

# make sure you have a Kafka running on localhost:9092
# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

cd examples/kafka_example
python run_simple_example.py

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d
cd examples/kafka_example
python run_simple_example.py

RabbitMQ Support

If you don't have the required environment for tests:

# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3-management

# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d

Step 1:

Only by change SCHEDULERDUPEFILTER_CLASS and add some configs, you can get a distributed crawler in a moment.

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue"
RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/?heartbeat=0"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000

# disable the RedirectMiddleware, because the RabbitMiddleware can handle those redirect request.
DOWNLOADER_MIDDLEWARES = {
    ...
    "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": None,
    "scrapy_distributed.middlewares.amqp.RabbitMiddleware": 542
}

# add RabbitPipeline, it will push your items to rabbitmq's queue. 
ITEM_PIPELINES = {
    ...
   'scrapy_distributed.pipelines.amqp.RabbitPipeline': 301,
}


Step 2:

scrapy crawl <your_spider>

Kafka Support

Step 1:

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.kafka.KafkaQueue"
KAFKA_CONNECTION_PARAMETERS = "localhost:9092"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000

DOWNLOADER_MIDDLEWARES = {
    ...
   "scrapy_distributed.middlewares.kafka.KafkaMiddleware": 542
}

Step 2:

scrapy crawl <your_spider>

Reference Project

scrapy-rabbitmq-link(scrapy-rabbitmq-link)

scrapy-redis(scrapy-redis)

scrapy-distributed's People

Contributors

insutanto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

scrapy-distributed's Issues

dynamic web crawlers

Can you add dynamic web crawlers into your project? Need to use simulation click technology and anti-climbing mechanism. Thx!

Implementation proposal

Hi @Insutanto

You doing nice work in this repo. I have the same desire: different message queues should be supported in scrapy.

Old implementations of this idea and one you have here share common disadvantage. For every type of queue you need to implement separate scheduler. Beside amount of work required such implementations can't use work done on improvement of scheduling. I am talking mostly about scrapy/scrapy#3520. The reason for going distributed(at least for me) is a lot of domains in a single crawl. Not using DownloaderAwarePriorityQueue makes crawling slower(like 10 times slower) according to benchmarks in mentioned PR.

To overcome this situation I developed and merged in scrapy/scrapy#3884 separation between logic of scheduler and external message queue.

It would be great for your project and scrapy community if you change from scheduler-based to queue-based.

More details and discussions can be find in scrapy/scrapy#4326. Example of such implementation for redis you can find in https://github.com/whalebot-helmsman/scrapy/blob/redis/scrapy/squeues.py#L101-L173 .

Also there is a PR for external queue protocol scrapy/scrapy#4783

Congratulate

As for distributed crawlers, I think it's just friendly URL management with reasonable scheduling rules. Semaphore will be their command gun, this is a really great open source project, looking forward to it。

Or we try to use Raft

First feedback from users

Hi @Insutanto

Great! Your work is really impressive. However, I would like to add some suggestions.

First of all, I wanna open the front console of RabbitMQ(http://127.0.0.1:15672), but it didn't work.
I fixed it by navigating to the RabbitMQ configuration directory, then install web rabbitmq_management and refresh the front console of RabbitMQ. The commands are:

docker exec -it <rabbitmq-container-id> /bin/bash
cd /etc/rabbitmq/
rabbitmq-plugins enable rabbitmq_management

Secondly, delete the key from redis. The commands are:

docker exec -it <redis-container-id> /bin/bash
redis-cli
keys *
del <key_name>

Thirdly, I added the images and files download functions for rabbitmq example common. The code is:

PATH: examples/rabbitmq_example/simple_example/settings.py

ITEM_PIPELINES = {
   'simple_example.pipelines.SimpleExamplePipeline': 201,
   'scrapy_distributed.pipelines.amqp.RabbitPipeline': 200,
   'simple_example.pipelines.ImagePipeline': 202,
   'simple_example.pipelines.MyFilesPipeline': 203,
}

FILES_STORE = './test_data/example_common/files_dir'
IMAGES_STORE = './test_data/example_common/images_dir'

PATH: examples/rabbitmq_example/simple_example/pipelines.py

from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request


class ImagePipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url, meta={'item': item, 'index': item['image_urls'].index(image_url)})

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        image_guid = request.url.split('/')[-1]
        filename = './{}/{}/{}'.format(item["url"].replace("/", "_"), item['title'], image_guid)
        return filename


class MyFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        for file_url in item['file_urls']:
            yield Request(file_url, meta={'item': item, 'index': item['file_urls'].index(file_url)})

    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no files")
        return item

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        file_guid = request.url.split('/')[-1]
        filename = './{}/{}/{}'.format(item["url"].replace("/", "_"), item['title'], file_guid)
        return filename

PATH: examples/rabbitmq_example/simple_example/items.py

class CommonExampleItem(scrapy.Item):

    # define the fields for your item here like:
    title = scrapy.Field()
    url = scrapy.Field()
    content = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    file_urls = scrapy.Field()
    files = scrapy.Field()

PATH: examples/rabbitmq_example/simple_example/spiders/example.py

    def parse(self, response):
        self.logger.info(f"parse response, url: {response.url}")
        for link in response.xpath("//a/@href").extract():
            if not link.startswith('http'):
                link = response.url + link
            yield Request(url=link)
        item = CommonExampleItem()
        item['url'] = response.url
        item['title'] = response.xpath("//title/text()").extract_first()
        item["content"] = response.text

        image_urls = []
        for image_url in response.xpath('//a/img/@src').extract():
            if image_url.endswith(('jpg', 'png')):
                if not image_url.startswith('http'):
                    image_url = re.match("(.*?//.*?)/", response.url).group(1) + image_url
                    image_urls.append(image_url)
                else:
                    image_urls.append(image_url)
        item['image_urls'] = image_urls

        file_urls = []
        for file_url in response.xpath(
                "//a[re:match(@href,'.*(\.docx|\.doc|\.xlsx|\.pdf|\.xls|\.zip)$')]/@href").extract():
            if not file_url.startswith('http'):
                file_url = re.match("(.*?//.*?)/", response.url).group(1) + file_url
                file_urls.append(file_url)
            else:
                file_urls.append(file_url)
        item['file_urls'] = file_urls

        yield item

Finally, I hope that the author can add the dynamic web crawler tutorial. Thanks again!!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.