Giter Site home page Giter Site logo

jasonweibocrawler's Introduction

#JasonWeiboCrawler

###提供cookies爬取特定用户或关键词的爬虫

author:Jason Wood

email:[email protected]

项目介绍


开发环境:

  • Mac OSX 10.11.3
  • Python 2.7

依赖库:

  • lxml
  • urllib
  • request

安装方法: sudo pip install lxml urllib request


使用方法

python main.py [email protected] (password) (u/k) (displayID/keyword) (m/s) (r/t) [(trycount) (thread control)]

参数解释:

  1. argv[0]:文件名(必填)
  2. argv[1]:新浪微博登录名(通常为邮箱,必填)
  3. argv[2]:登录密码(必填)
  4. argv[3]:爬虫模式。u表示爬取用户微博,k表示爬取关键词(必填)
  5. argv[4]:若爬取用户微博,则提供用户的displayID,该ID在该用户主页上有显示,注意displayID不是登录账号;若爬取关键词,则提供所需关键词,中英文都可,emoji/形如[xx]的微博表情不可(必填)
  6. argv[5]:线程模式,m代表多线程,s代表单线程(必填)
  7. argv[6]:判别爬虫模式,"r"表示displayID参数为博主ID,"t"表示displayID参数为其他人ID(应对2016.3.21新浪反爬机制改版)
  8. argv[7]:尝试阈值,同一尝试次数超过该阈值时线程退出,防止程序陷入死循环(选填,默认为20)
  9. argv[8]:多线程控制,仅在argv[5]为m时有用,用于指定多线程中单个线程下载的微博页面数(选填,默认为10)

本来这个程序不怎么需要输出信息的,但是开发前期觉得不输出点什么感觉程序好像陷入了死循环,所以就每爬取一个页面就显示完成进度。

另外,程序最后会输出两行,一行显示True或者False。True表示爬取过程很顺利,没有丢失页面,否则显示False;第二行是Done!表示程序结束。

解析的结果将存在/Weibo_parsed/文件夹里,以用户displayID为命名的.txt文件;关键词结果存在/keyword/文件夹里,以关键词为命名的.txt文件。而/Weibo_raw/是过渡文件夹,用来存爬取成功但未解析的微博页面文件,程序自动删除,若希望保留,只需注释掉main.py第83行的clean()方法调用即可。


写在后面:

使用他人的劳动成果时最好先征得他人的同意,但是如果你觉得我比较帅,就可以不用征得我的同意啦hhh 如果您觉得好用,请给与我鼓励,如果您觉得不好用,或者爬虫已经失效,请发邮件告诉我。

jasonweibocrawler's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.