xwang20 / weibocrawler Goto Github PK
View Code? Open in Web Editor NEW无cookie版微博爬虫,可以连续爬取一个或多个新浪微博用户信息、用户微博及其微博评论转发。
License: MIT License
无cookie版微博爬虫,可以连续爬取一个或多个新浪微博用户信息、用户微博及其微博评论转发。
License: MIT License
工程师您好,我发现在爬微博评论时,(1)并不能够把所有微博评论都爬下来;(2)会返回很多空白行;(3)每次执行程序,返回的结果(也就是爬下来的数量)还不一致。我想可能是评论接口有一定几率返回空白,需要重复访问多次,才有结果。我不知道您有没有考虑这个问题呢?
作者您好!在我运行抓发爬取命令时遇到了如下问题:
2022-03-03 14:53:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://m.weibo.cn/api/statuses/repostTimeline?id=4742865873797609&page=1> (referer: None) 2022-03-03 14:53:17 [scrapy.core.scraper] ERROR: Spider error processing <GET https://m.weibo.cn/api/statuses/repostTimeline?id=4742865873797609&page=1> (referer: None) Traceback (most recent call last): File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output for x in result: File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\17243\Desktop\WeiboCrawler\WeiboCrawler\spiders\repost.py", line 23, in parse all_page = js['data']['max'] KeyError: 'data'
初步分析是由于需要登录造成的,请教下您如何解决?
有关关键词的配置还能生效吗,我按照说明将注释掉的代码恢复了,然后设置了keyword,但是还是无法生效
可不可以添加一个直接写入CSV的保存形式啊,小白选手对数据库比较吃力
可不可以只爬去评论中的图片,按日期命名
安装这个时 C:\weibo\WeiboCrawler-main\WeiboCrawler>python -m pip install -r requirements.txt
WARNING: Ignoring invalid distribution -qdm (c:\miniconda3\lib\site-packages)
你好,请问为什么更换了微博id之后只能抓微博下面的一条评论啊?还有哪些地方需要修改?我是个小白555看不懂
现在有转发者或评论者id,请问如何修改,保存时可以增加转发者或评论者的昵称?
大佬你好,想在抓取评论时获取评论的具体时间,按照网上的代码在 comment.py
做了增添:
commentItem['created_at']= standardize_date(comment['created_at']).strftime('%Y-%m-%d %H:%M:%S')
部分结果显示 小时-分钟-秒,但都是00;还有部分结果还是只有年-月-日,不知道哪里出了问题,求解答🙏
在setting文件里改成true 然后cookie=xxxx..... 吗 代码小白请求大佬解答
因为weibo_id 字符串为 Hd1N2qpta(举个例子),项目中直接用的是转化出来的十进制数字。感觉自己再找个代码转化,有点麻烦。
(此外,可不可以出这样一个代码:话题下/关键词搜索会有很多微博,爬取这些所有微博的转发关系)
(我给您发的邮件就是这个意思啦,当时邮件里没有表述清楚,现在我的节点恢复了,能登陆github了,十分有幸遇到这么好的项目(o゚v゚)ノ)
请问我在爬取转发内容时发现每条微博能爬取的转发量是一个固定的数字,只有一百多条(微博转发的第十页),添加了cookie也没有变化,请问大佬应该如何完整地爬取转发内容?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.