turboway / spiderman Goto Github PK
View Code? Open in Web Editor NEW基于 scrapy-redis 的通用分布式爬虫框架
License: MIT License
基于 scrapy-redis 的通用分布式爬虫框架
License: MIT License
爬虫运行一段时间后报错如下,然后就中断无法运行了
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
EXTENSIONS设置为空,通过ScheduledRequest添加任务,两次任务的网址如果一样,第二次添加的网址不工作。
跟踪到第二次添加任务后,执行到scrapy engine.py schedule函数,执行了self.signals.send_catch_log。
然后后面就不知道代码运行到哪里去了。
推测某个步骤有过滤功能,把第二次添加的任务过滤掉了。
请问过滤的步骤在哪个文件哪个函数?
蜘蛛配置如下:
'EXTENSIONS': {},
def get_callback(self, callback):
# url去重设置:True 不去重 False 去重
callback_dt = {
'list': (self.list_parse, True),
'detail': (self.detail_parse, True),
}
return callback_dt.get(callback)
Traceback (most recent call last):
File "C:\Users\chengzhang\AppData\Local\Programs\Python\Python39\lib\site-packages\kafka\consumer\fetcher.py", line 445, in _message_generator
raise StopIteration('Subscription needs partition assignment')
StopIteration: Subscription needs partition assignment
是consumer.subscribe()需要别的参数吗,我看运行实例图片上也只有topics=['zhifang',]这一个参数
2023-05-07 23:35:36 [spiderman.model.standalone] ERROR: 爬虫执行失败:2023-05-07 23:35:36 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: SP)
2023-05-07 23:35:36 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.1.1 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Windows-10-10.0.19041-SP0
2023-05-07 23:35:36 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'SP',
'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter',
'LOG_FILE': 'D:/GitHub/spiderman/SP_log/20230507/zhifang.log',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'SP.spiders',
'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 400, 403, 404],
'RETRY_TIMES': 3,
'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
'SPIDER_MODULES': ['SP.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/62.0.3202.75 Safari/537.36'}
Unhandled error in Deferred:
作者您好!请问运行demo的时候出现这种错误该如何解决?
我在spiderman外层套了一个对外的api接口,接收到请求后,执行采集,然后返回采集结果。
测试过程中发现从接到请求到返回结果之间的时间过长。预期时长500毫秒内。目前测试发现可能延长到了5秒左右,不知道是什么原因造成的。
爬虫命令的位置:
[SP_JOBS>>job.py>>SPJob>>run]
p = subprocess.Popen(cmd, shell=True)
Request的位置:
[SP>>spiders>>SPRedisSpider.py>>SPRedisSpider>>make_request_from_data]
return Request(**params)
执行采集过程中发现从启动爬虫命令到request之间的时间花费大约2-4秒。请问这个时间是否可压缩?
目前的任务逻辑是,先make_job, 然后生成爬虫启动命令scrapy crawl xxx,启动爬虫并开始采集。
这就造成每次执行采集,就要先等待爬虫启动。
这样的启动过程需要耗时5秒左右,无法适应即时采集的需求。
我在spiderman的外层套了一层sanic的restful api,接收到API请求后,执行job=xxx_job, job.make_job, job.crawl。这个过程就是启动爬虫和开始采集的过程。这个启动很要命,时间太长了。
我这几天研究了好几天,无奈才疏学浅,没有实现先启动后接收采集任务。
请问可否提供一个改造的思路,所有爬虫一次性启动,等待接收采集任务。接到任务后,立即开始采集并返回结果。
谢谢!
可以比csv格式保存更多的信息,存取的速度也比csv快
运行时make_request_from_data并没有被调用,scrapy没有被启动是怎么回事,但redis中是有内容的。
如果删去程序部分功能,如kafka那块的功能,会影响到整体运行吗
当前的做法是 python SP_JOBS/zhifang_job.py 这样只能启动一类爬虫
如果想启动多个的话 就要执行多个进程
问题如下:
def __init__(self):
super().__init__(spider_name=zhifang_Spider.name)
self.delete() # 如需去重、增量采集,请注释该行
self.headers = {
# 有反爬的话,可以在这边定制请求头
}
self.cookies = (
# 多账号采集的话,可以在这边定制多个 cookie string
)
1 启动时
[py.warnings] WARNING: /home/donney/.local/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2.运行中
2023-02-17 15:30:32 [py.warnings] WARNING: /home/donney/.local/lib/python3.10/site-packages/scrapy_redis/spiders.py:197: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.crawl is deprecated
self.crawler.engine.crawl(req, spider=self)
麻烦大佬看一下 尝试修没有好
起因: 之前steam爬虫 会被年龄验证页面卡住
过程: 查询大量资料后 发现用splash渲染低频页面 并且跑一个lua脚本可以解决
但是我把代码尝试移动到框架中时却发现无效且有警告
2023-02-17 17:27:27 [py.warnings] WARNING: /home/donney/.local/lib/python3.10/site-packages/scrapy_redis/dupefilter.py:115: ScrapyDeprecationWarning: Call to deprecated function scrapy.utils.request.request_fingerprint().
If you are using this function in a Scrapy component, and you are OK with users of your component changing the fingerprinting algorithm through settings, use crawler.request_fingerprinter.fingerprint() instead in your Scrapy component (you can get the crawler object from the 'from_crawler' class method).
Otherwise, consider using the scrapy.utils.request.fingerprint() function instead.
Either way, the resulting fingerprints will be returned as bytes, not as a string, and they will also be different from those generated by 'request_fingerprint()'. Before you switch, make sure that you understand the consequences of this (e.g. cache invalidation) and are OK with them; otherwise, consider implementing your own function which returns the same fingerprints as the deprecated 'request_fingerprint()' function.
return request_fingerprint(request)
按照说明安装依赖,找不到numpy对应版本
Could not find a version that satisfies the requirement numpy==1.18.4
python: v3.8.6
系统: OSX 11.0.1
我配置了一台Slave,爬取结果是成功的,但是在Pycharm的console里输出一直是slave:localhost.localdomain爬虫执行成功,调试的时候连接的SSH的hostname确实是我配置的slave 的host,但是msg_out却是localhost,为什么呢
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.