Giter Site home page Giter Site logo

ducgt / gushiwen-spider Goto Github PK

View Code? Open in Web Editor NEW

This project forked from frankzhangv5/gushiwen-spider

0.0 1.0 0.0 26 KB

基于scrapy + scrapy-redis + scrapy-splash编写的爬取gushiwen.org上面诗词的爬虫

License: GNU General Public License v3.0

Python 92.67% Lua 7.33%

gushiwen-spider's Introduction

gushiwen-spider

1. 简介

本项目基于scrapy + scrapy-redis + scrapy-splash编写的爬取gushiwen.org上面诗词的爬虫,在爬取的过程中将需要解析的链接存入redis,将诗词数据持久化到mysql数据库

2. 爬取思路

首先按“朝代”这个分类来将所有朝代的链接提取出来存入到redis,然后根据上一步爬取的朝代链接爬取这个朝代的所有页的链接,将页链接存入redis,然后根据每个页链接爬取当前页面上所有诗词的链接,将诗词的链接存入redis,然后根据每个诗词链接,从诗词页面提取所需的数据并存入mysql数据库,当诗词页面有作者信息时,把作者链接存入redis,最后爬取每个作者的页面,将作者的信息提取出来存入mysql数据库

3. spider列表

编号 spider名 spider说明
1 dynasty 爬取所有朝代的链接并存入redis
2 page 爬取每个朝代的所有页面链接并存入redis
3 list 爬取每个页面上诗词的链接并存入redis
4 poem 爬取每个诗词链接页面所需信息,并存入数据库
5 poet 爬取每个作者的链接页面所需信息,并存入数据

4. spider执行顺序

由于每个爬虫的输入链接依赖前一个爬虫爬取的结果,所以应该按照上面表格中的编号从小到大依次执行。但是每个spider在无输入链接时会处于等待的状态,所以同时运行每个spider也是可以的。

5. 注意事项

  • scrapy-splash依赖docker镜像运行,需参考github说明运行docker服务
  • scrapy-redis只是redis的客户端,需要提前安装redis-server

6. 扩展

  • 可以利用scrapyd与scrapyd-client部署到web
  • 更简单的是使用Gerapy部署到web

gushiwen-spider's People

Contributors

frankzhangv5 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.