Giter Site home page Giter Site logo

cc98's Introduction

CC98

#爬虫 设定板块的ID号,然后爬虫开始去追踪版面信息,把该板块的每个帖子里,每层楼的发帖者,发帖时间,楼层,发帖内容,改帖子信息存储到MongoDB数据库。

#热词统计 选取帖子超过30页的帖子,进行分词热词统计,然后过滤掉一些无用的热词,每个帖子的热词存储在MongoDB数据库里面。

#依赖库

  1. Beautifusoup4
    用来解析HTML页面,定位和提取HTML页面里面所需存储的信息。
    pip install beautifulsoup4
    
  2. lxml
    Beautifulsoup使用的第三方解析器
    pip install lxml
    
  3. pymongo
    MongoDB的python接口
    pip install pymongo	
    
  4. jieba
    用于分词
    pip install jieba

cc98's People

Contributors

puhao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.