Giter Site home page Giter Site logo

xwang20 / weibocrawler Goto Github PK

View Code? Open in Web Editor NEW
140.0 140.0 25.0 1.96 MB

无cookie版微博爬虫,可以连续爬取一个或多个新浪微博用户信息、用户微博及其微博评论转发。

License: MIT License

Python 100.00%
crawler social-media weibo weibo-spider

weibocrawler's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

weibocrawler's Issues

comment 爬虫中存在的问题

工程师您好,我发现在爬微博评论时,(1)并不能够把所有微博评论都爬下来;(2)会返回很多空白行;(3)每次执行程序,返回的结果(也就是爬下来的数量)还不一致。我想可能是评论接口有一定几率返回空白,需要重复访问多次,才有结果。我不知道您有没有考虑这个问题呢?

在转发爬取时遇到问题

作者您好!在我运行抓发爬取命令时遇到了如下问题:
2022-03-03 14:53:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://m.weibo.cn/api/statuses/repostTimeline?id=4742865873797609&page=1> (referer: None) 2022-03-03 14:53:17 [scrapy.core.scraper] ERROR: Spider error processing <GET https://m.weibo.cn/api/statuses/repostTimeline?id=4742865873797609&page=1> (referer: None) Traceback (most recent call last): File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output for x in result: File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\17243\Desktop\WeiboCrawler\venv\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\17243\Desktop\WeiboCrawler\WeiboCrawler\spiders\repost.py", line 23, in parse all_page = js['data']['max'] KeyError: 'data'
初步分析是由于需要登录造成的,请教下您如何解决?

关键词查找问题

有关关键词的配置还能生效吗,我按照说明将注释掉的代码恢复了,然后设置了keyword,但是还是无法生效

求新增直接写入CSV

可不可以添加一个直接写入CSV的保存形式啊,小白选手对数据库比较吃力

爬取数据的存储问题

1640168005(1)
您好我在使用本项目时代码正常运行,但每次爬取的数据无法及时储存,当程序结束后【user.csv】文件才有数据内容,请问大佬如何修改呀。

评论爬取问题

image
工程师您好,在运行comment.py时爬不下来东西,很多都一条都没有,有一些能爬也只能爬个10%这样,感觉有点局限性,您有好的建议吗?

数据的具体时间获取问题

大佬你好,想在抓取评论时获取评论的具体时间,按照网上的代码在 comment.py 做了增添:

commentItem['created_at']= standardize_date(comment['created_at']).strftime('%Y-%m-%d %H:%M:%S')

部分结果显示 小时-分钟-秒,但都是00;还有部分结果还是只有年-月-日,不知道哪里出了问题,求解答🙏

超棒的项目!可不可以再加一个微博url到mid的转化呀

因为weibo_id 字符串为 Hd1N2qpta(举个例子),项目中直接用的是转化出来的十进制数字。感觉自己再找个代码转化,有点麻烦。

(此外,可不可以出这样一个代码:话题下/关键词搜索会有很多微博,爬取这些所有微博的转发关系)
(我给您发的邮件就是这个意思啦,当时邮件里没有表述清楚,现在我的节点恢复了,能登陆github了,十分有幸遇到这么好的项目(o゚v゚)ノ)

转发爬取遇到的问题

请问我在爬取转发内容时发现每条微博能爬取的转发量是一个固定的数字,只有一百多条(微博转发的第十页),添加了cookie也没有变化,请问大佬应该如何完整地爬取转发内容?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.