turboway / spiderman Goto Github PK

View Code? Open in Web Editor NEW

563.0 17.0 122.0 4.36 MB

基于 scrapy-redis 的通用分布式爬虫框架

License: MIT License

Python 100.00%

spiderman scapy-redis kafka hive hbase rdbm scrapy

spiderman's Introduction

Hi there 👋

spiderman's People

Contributors

Stargazers

Watchers

Forkers

kingking888 iambanma truehaolix phpsoldier xiongjingzhi veikin netskyline mcjrr tu-rui sanyuesiyuewuyue guapier huahanshou kingson jimmeryzou decdfgfe akulubala allensmile tenghuiji yangheng111 nuclearhe trehack xiaoxin16 dragonzhang123 retindercn httpsgithu harrytsz aaronzhangl never615 1046517444 woodshope shad0wperson lianghaivv yuyaokeng handsomeyaqiang leaxiang f0829 lxngoddess5321 legend-zl sobeited xmrio lwjivd astra-zhao irenelover dreamsky124 w1491955388 fathui aleehub panziqiang007 canotf jay20161013 ssmnghunssjust qqizai liyandan wiyos onetwo1 37780012 harveywei crackercat skynet2017 lnker89 lucoo01 mgbin088 peng5550 chengjunjian newboy-git servicefoundation crazyzsshuo hitrustnet tianshanzhilong georgejzzz luoshuihudie iphyer abiao133 howard-huang-scrapy lyonleelpl yotofu chenliang100 975278060 chenpython jameszlj romanlcc kshsky futer2005 eatmoremeat zzzzjjjxxx younggundog kenloo chisdiva speedor tang-dafa showtbc flyjson zmusn44 wakakaikai winkyqin bianchengge anubisxcw solivehong donneylucks starsoft35

spiderman's Issues

运行一段时间后报错

爬虫运行一段时间后报错如下，然后就中断无法运行了
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]

添加相同网址的任务，第二次添加的任务没有执行

EXTENSIONS设置为空，通过ScheduledRequest添加任务，两次任务的网址如果一样，第二次添加的网址不工作。

跟踪到第二次添加任务后，执行到scrapy engine.py schedule函数，执行了self.signals.send_catch_log。
然后后面就不知道代码运行到哪里去了。

推测某个步骤有过滤功能，把第二次添加的任务过滤掉了。

请问过滤的步骤在哪个文件哪个函数？

蜘蛛配置如下：
'EXTENSIONS': {},

    def get_callback(self, callback):
        # url去重设置：True 不去重 False 去重
        callback_dt = {
            'list': (self.list_parse, True),
            'detail': (self.detail_parse, True),
        }
        return callback_dt.get(callback)

elasticsearch 过期了。可以更新一下么，谢谢

kafka监控程序运行报错

Traceback (most recent call last):
File "C:\Users\chengzhang\AppData\Local\Programs\Python\Python39\lib\site-packages\kafka\consumer\fetcher.py", line 445, in _message_generator
raise StopIteration('Subscription needs partition assignment')
StopIteration: Subscription needs partition assignment

是consumer.subscribe()需要别的参数吗，我看运行实例图片上也只有topics=['zhifang',]这一个参数

开启了布隆过滤器数据库中有重复内容

开启了布隆过滤器

单个爬虫并发数 5
地址已经设置了去重

然后启动job 爬虫数量5 页数5
爬取完毕后数据库有重复内容貌似没有过滤成功
目前在单台物理机上只有启动一个爬虫是正常

关于demo采集

2023-05-07 23:35:36 [spiderman.model.standalone] ERROR: 爬虫执行失败：2023-05-07 23:35:36 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: SP)
2023-05-07 23:35:36 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.1.1 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Windows-10-10.0.19041-SP0
2023-05-07 23:35:36 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'SP',
'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter',
'LOG_FILE': 'D:/GitHub/spiderman/SP_log/20230507/zhifang.log',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'SP.spiders',
'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 400, 403, 404],
'RETRY_TIMES': 3,
'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
'SPIDER_MODULES': ['SP.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/62.0.3202.75 Safari/537.36'}
Unhandled error in Deferred:

作者您好！请问运行demo的时候出现这种错误该如何解决？

爬虫命令到request之间的时间如何缩短？

我在spiderman外层套了一个对外的api接口，接收到请求后，执行采集，然后返回采集结果。
测试过程中发现从接到请求到返回结果之间的时间过长。预期时长500毫秒内。目前测试发现可能延长到了5秒左右，不知道是什么原因造成的。

爬虫命令的位置：
[SP_JOBS>>job.py>>SPJob>>run]
p = subprocess.Popen(cmd, shell=True)

Request的位置：
[SP>>spiders>>SPRedisSpider.py>>SPRedisSpider>>make_request_from_data]
return Request(**params)

执行采集过程中发现从启动爬虫命令到request之间的时间花费大约2-4秒。请问这个时间是否可压缩？

demo 运行没爬到东西

查过日志和代码
启动可以正常加入 redis start url

但是启动爬虫后貌似没有执行 make_request_from_data 这个函数不会进麻烦大佬拯救一下
于是乎什么都没有爬到

另外无论用不用虚拟环境 requirements.txt 安装环境都会报错

python 3.10
linux manjaro
执行命令 python SP_JOBS/zhifang_job.py

如何先启动所有爬虫，然后再向单个爬虫投递网址

目前的任务逻辑是，先make_job, 然后生成爬虫启动命令scrapy crawl xxx，启动爬虫并开始采集。

这就造成每次执行采集，就要先等待爬虫启动。

这样的启动过程需要耗时5秒左右，无法适应即时采集的需求。

我在spiderman的外层套了一层sanic的restful api，接收到API请求后，执行job=xxx_job, job.make_job, job.crawl。这个过程就是启动爬虫和开始采集的过程。这个启动很要命，时间太长了。

我这几天研究了好几天，无奈才疏学浅，没有实现先启动后接收采集任务。

请问可否提供一个改造的思路，所有爬虫一次性启动，等待接收采集任务。接到任务后，立即开始采集并返回结果。

谢谢！

考虑过HDF5格式存储吗

可以比csv格式保存更多的信息，存取的速度也比csv快

转化scrapy请求失败问题

运行时make_request_from_data并没有被调用，scrapy没有被启动是怎么回事，但redis中是有内容的。

有个问题请教大佬

如果删去程序部分功能，如kafka那块的功能，会影响到整体运行吗

如何在一个进程中启动多个爬虫

当前的做法是 python SP_JOBS/zhifang_job.py 这样只能启动一类爬虫
如果想启动多个的话就要执行多个进程
问题如下：

可否在一个进程中执行
多爬虫启动两种方案哪种更好（效率方面）

关于cookies使用的问题

问题：在学习爬取steam的过程中遇到了年龄验证的问题查询解决方案是使用 cookies
发现在框架中配置了好像没什么效果是我的格式有问题吗？

cookies 定制格式和 cookie池怎么设置

def __init__(self):
    super().__init__(spider_name=zhifang_Spider.name)
    self.delete()  # 如需去重、增量采集，请注释该行
    self.headers = {
        # 有反爬的话，可以在这边定制请求头
    }
    self.cookies = (
        # 多账号采集的话，可以在这边定制多个 cookie string
       
    )

关于框架的两个警告

1 启动时
[py.warnings] WARNING: /home/donney/.local/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)

2.运行中
2023-02-17 15:30:32 [py.warnings] WARNING: /home/donney/.local/lib/python3.10/site-packages/scrapy_redis/spiders.py:197: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.crawl is deprecated
self.crawler.engine.crawl(req, spider=self)

麻烦大佬看一下尝试修没有好

关于Splash使用的问题

起因: 之前steam爬虫会被年龄验证页面卡住
过程: 查询大量资料后发现用splash渲染低频页面并且跑一个lua脚本可以解决

但是我把代码尝试移动到框架中时却发现无效且有警告
2023-02-17 17:27:27 [py.warnings] WARNING: /home/donney/.local/lib/python3.10/site-packages/scrapy_redis/dupefilter.py:115: ScrapyDeprecationWarning: Call to deprecated function scrapy.utils.request.request_fingerprint().

If you are using this function in a Scrapy component, and you are OK with users of your component changing the fingerprinting algorithm through settings, use crawler.request_fingerprinter.fingerprint() instead in your Scrapy component (you can get the crawler object from the 'from_crawler' class method).

Otherwise, consider using the scrapy.utils.request.fingerprint() function instead.

Either way, the resulting fingerprints will be returned as bytes, not as a string, and they will also be different from those generated by 'request_fingerprint()'. Before you switch, make sure that you understand the consequences of this (e.g. cache invalidation) and are OK with them; otherwise, consider implementing your own function which returns the same fingerprints as the deprecated 'request_fingerprint()' function.
return request_fingerprint(request)