Giter Site home page Giter Site logo

newsspider's Introduction

包含网站:

  • 今日头条
  • 网易新闻
  • 腾讯新闻

主要功能

  • 新闻抓取
  • 索引构建
  • 前端搜索

整体结构

运行

一键启动

直接执行工程目录下的start.sh,可以启动抓取,索引和检索。可以修改tools/Global.py中的project_root路径,默认所有处理的数据均在该目录下

同时运行所有爬虫

git clone https://github.com/lzjqsdd/NewsSpider.git
cd NewsSpider/news_spider
scrapy crawlall

运行单个爬虫

scrapy crawl [toutiao|netease|tencent]

数据及注意事项

  • 抓取的新闻为utf-8格式的,并不是乱码
  • 网易新闻2015年的内容格式和2016的不一样,可以抓取,需要修改xpath解析方式
  • 默认参数可以抓取到13万条左右的数据,
    • title.json(不含新闻内容)
    • news.json(含新闻内容),可以在setting.py中修改默认写入选项
    • news2db.py 可以将json文件写入sqlite3数据库
  • 所有的数据配置均可以在tool/Global.py中修改

TODO

  • 相似新闻推荐
  • 排序算法

Demo展示

Demo

newsspider's People

Contributors

lzjqsdd avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.