Giter Site home page Giter Site logo

zhihu_spider's Introduction

zhihu_spider

大规模知乎用户爬虫

  • (1)使用python的request模块获取html页面,注意要修改自己的cookie,使得我们更像是使用浏览器访问
  • (2)使用xpath模块从html中提取需要的关键信息(姓名,职业,居住地,关注人等)
  • (3)使用redis作为队列,很好的解决并发和大规模数据的问题(可以分布式)
  • (4)使用bfs宽度优先搜索,使得程序得以不断扩展持续搜索用户
  • (5)数据存储至no-sql数据库:mongodb(高效轻量级并且支持并发)
  • (6)使用python的进程池模块提高抓取速度
  • (7)使用csv,pandas,matplotlib模块进行数据处理(需要完善)

联系作者

  • 具体可以参考我的博客:http://blog.csdn.net/nk_test/article/details/51330971
  • 运行的时候需要指定参数 : print_data_out 表示输出至屏幕;store_data_to_mongo代表存入mongodb数据库 同时依赖redis,mongodb以及python的部分模块,请自行安装。

数据展示:

image image image

zhihu_spider's People

Contributors

tachone avatar

Stargazers

 avatar Kai Li avatar aboutboy avatar alphabeta avatar AN avatar quantsql avatar  avatar  avatar julian avatar  avatar  avatar  avatar  avatar  avatar psyc_chain avatar  avatar  avatar  avatar Eric Alan avatar  avatar  avatar  avatar ZYSzys avatar BlingSmile avatar  avatar  avatar hannea avatar 小歪 avatar  avatar Ronnie avatar Aray avatar  avatar  avatar Usher avatar pet avatar Long Zhijun avatar ark avatar LouisCherry avatar 张成悟 avatar  avatar feico avatar Elwynn Chen avatar  avatar LisaJing avatar  avatar  avatar Nan Feng avatar ShiJiang Liu avatar  avatar Chang Liu avatar 0x24bin avatar zhengxiaoxu avatar yoCruzer avatar allen avatar  avatar 风轻扬 avatar  avatar  avatar yanlove avatar M-Kepler avatar 晋先森 avatar guofh avatar Lanqiu avatar IsaWdx avatar zhaoyong avatar Bill Bai avatar Qionghua Yang avatar fxrc avatar JimmyLv_吕立青 avatar Yao avatar Ke Fang avatar zhu wei avatar  avatar catcut avatar jianghao avatar guoqingzhi avatar Robert Zhang avatar  avatar

Watchers

James Cloos avatar 0x24bin avatar  avatar  avatar 杨小事er avatar  avatar dongzn avatar

zhihu_spider's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.