Giter Site home page Giter Site logo

weibo_terminator_workflow's Introduction

Weibo Terminator Work Flow

PicName

这个项目是之前项目的重启版本,之前的项目地址这里,那个项目依旧会保持更新,这是weibo terminator的工作版本,这个版本对上一个版本做了一些优化,这里的最终目标是一起爬取语料,包括情感分析、对话语料、舆论风控、大数据分析等应用。

UPDATE 2017-5-16

更新:

  • 调整了首次cookies获取逻辑,如果程序没有检测到cookies就会退出,防止后面爬取不到更多的内容而crash;
  • 增加了WeiBoScraperM 类,目前还在构建中,欢迎submit PR 实现,这个类主要实现从另外一个微博域名爬取,也就是手机域名;

大家可以pull一下更新。

UPDATE 2017-5-15

经过一些小修改和几位contributor的PR,代码发生了一些小变化,基本上都是在修复bug和完善一些逻辑,修改如下:

  1. 修复了保存出错的问题,这个大家在第一次push的时候clone的代码要pull一下;
  2. 关于 WeiboScraper has not attribute weibo_content的错误,新代码已经修复;

@Fence 提交PR修改了一些内容:

  1. 原先的固定30s休息换成随机时间,具体参数可自己定义
  2. 增加了big_v_ids_file,记录已经保存过粉丝的明星id; 用txt格式,方便contributor手动增删
  3. 两个函数的爬取页面都改成了page+1,避免断点续爬时重复爬取上次已经爬过最后一页
  4. 把原先的“爬取完一个id的所有微博及其评论”改为“爬完一条微博及其所有评论就保存”
  5. (Optional)把保存文件的部分单独为函数,因为分别有2个和3个地方需要保存

大家可以git pull origin master, 获取一下新更新的版本,同时也欢迎大家继续问我要uuid,我会定时把名单公布在contirbutor.txt 中,我近期在做数据merge的工作,以及数据清洗,分类等工作,merge工作完成之后会把大数据集分发给大家。

Improve

对上一版本做了以下改进:

  • 没有了太多的distraction,直奔主题,给定id,获取该用户的所有微博,微博数量,粉丝数,所有微博内容以及评论内容;
  • 和上一版本不同的是,这一次我们的理念是把所有数据保存到三个pickle文件中,以字典的文件存储,这么做的目的是方便断点续爬;
  • 同时做到了,已经爬过的id爬虫不会再次爬取,也就是说爬虫会记住爬取过的id,每个id获取完了所有内容之后会被标记为已经爬取;
  • 除此之外,微博内容和微博评论被单独分开,微博内容爬取过程中出现中断,第二次不会重新爬取,会从中断的页码继续爬取;
  • 更加重要的是!!!每个id爬取互不影响,你可以直接从pickle文件中调取出任何你想要的id的微博内容,可以做任何处理!!
  • 除此之外之外,测试了新的反爬策略,采用的延时机制能够很好的工作,不过还无法完全做到无人控制。

更更加重要的是!!!,在这一版本中,爬虫的智能性得到了很大提升,爬虫会在爬取每个id的时候,自动去获取该id的所有粉丝id!! 相当于是,我给大家的都是种子id,种子id都是一些明星或者公司或者媒体大V的id,从这些种子id你可以获取到成千上万的其他种子id!! 假如一个明星粉丝是3.4万,第一次爬取你就可以获得3.4万个id,然后在从子代id继续爬,每个子代id有粉丝100,第二次你就可以获取到340万个id!!!足够了吗?!!!当然不够!!!

我们这个项目永远不会停止!!! 会一直进行下去,直到收获足够多的语料!!!

(当然实际上我们不能获得所有粉丝,不过这些也足够了。)

PicName

Work Flow

这一版本的目标是针对contributor,我们的工作流程也非常简单:

  1. 获取uuid,这个uuid可以调取到 distribute_ids.pkl 的2-3个id,这个是我们的种子id,当然大家也可以直接获取到所有id,但是为了防止重复工作,建议大家向我申请一个uuid,你只负责你的那个,爬完之后,把最终文件反馈给我,我整理去重之后,把最终的大语料发放给大家。
  2. 运行 python3 main.py uuid,这里说明一下,uuid指定的id爬取完之后才会取爬fans id;
  3. Done!

Discuss

依旧贴出一下讨论群,欢迎大家添加:

QQ
AI智能自然语言处理: 476464663
Tensorflow智能聊天Bot: 621970965
GitHub深度学习开源交流: 263018023

微信可以加我好友: jintianiloveu

Copyright

(c) 2017 Jin Fagang & Tianmu Inc. & weibo_terminator authors LICENSE Apache 2.0

weibo_terminator_workflow's People

Contributors

af1ynch avatar chenleilei avatar lucasjinreal avatar xiaochao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weibo_terminator_workflow's Issues

No such file or directory: './scraped_corpus/weibo_content.pkl'

Hello, I tried this project but did not make it to save files.
firstly: change firstly of mission function in the main.py to scrap('5979819802')
secondly: git pull origin
thirdly: python3 main.py -i 5979819802
What it turned out to be is indicated as this figure.
1495161669 1
1495161006 1
So, what should I do please?
BTW, I would like to apply for an uuid and already texted to you on Wechat.

验证码导致出错。 error, account id xxx is not valid.

preparing cookies for account {'id': 'xxxxx', 'password': 'xxxxx'}
loading PhantomJS from /Users/ailias/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs
opening weibo login page, this is first done for prepare for cookies. be patience to waite load complete.
100%|███████████████████████████████████████████████████████████████| 40/40 [00:20<00:00, 1.99it/s]
account id: XXXXXX
account password: XXXXXXXX
error, account id XXXXXX is not valid, pass this account, you can edit it and then update cookies.

该问题是由于登录需要输入验证码而导致的登录出错,能否解决这个问题呢?在用我自己的账号登录时不需要验证码的时候是可以登录成功的。

A reminder for configuring accounts.py

Just a reminder:
When you add your own weibo id to the accouts.py to get your own cookies:
The "id" is your email or phone number, it is your weibo login, not the weibo ID: 273993327.

When you obtain weibos from different users, you put their id in the id file.

一个澄清:
在修改accounts.py里的id和密码设置时,id需要填入自己微博账号的登录名(一般是邮箱或者手机号),而不是微博分配给你账号的那一串数字id。

当抓取其他微博大V的微博时,需要输入微博大V账号的那一串数字或者字母,比如:273993327

ERROR:root:file must have 'read' and 'readline' attributes

When i am trying to run main.py,
and login was successful,
but when it run to "getting content and comment..."
it throws an Error as follow:

getting content and comment...
ERROR:root:file must have 'read' and 'readline' attributes
some error above not catch, return to dispatch center, resign for new account..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.