Giter Site home page Giter Site logo

wanfangdata's Introduction

WanFangData

Python3兼容分支

在语法上做了修改以支持Python3运行,但主要代码结构和内容没有变更,将原先的代理部分换成了我刚写的FP-Server

要顺利爬取内容,需要自己手动修改爬虫代码来适应最新的网站结构。

兼容修改完成的部分:

  • WFSpider

安装环境:

  • 安装MongoDB数据库
  • 安装Python依赖(Windows系统可能需要安装其他package)
    pip install requirements.txt
    
  • 安装运行FP-Server

爬虫部分

☝️ 此爬虫仅针对万方数据知识服务平台网站的“期刊”模块,如果要爬其他模块,需要对WFbase和WFindex做一些修改

爬虫基于Scrapy和MongoDB

spiders目录下有5个爬虫,按顺序执行:

  • WFbase 爬取期刊主分类

  • WFindex 爬取期刊索引,即二级分类(三千个左右)

  • WFcore 爬取文章页面(因个人需要只爬了17年的,截至6月份共计爬取23万条,保守估计2016年条数会超过百万)

另外两个patch是修正用的补丁,已经加入到WFcore中,可以忽略

关于settings设置:

  • TARGET_YEAR = ['2017'] 目标时间(依照期刊发行时间筛选),从最新发行刊目开始逆序运行

  • USE_PROXY = 0 是否使用代理

  • DOWNLOAD_DELAY = 1 如果不使用代理,此处不能小于1,否则IP会被服务器拒绝,一小时内不能访问

Web部分

用爬下来的数据做了web展示,基于Flask

主要功能有:

  1. 根据标题、作者等进行搜索

    • 主页 home
    • 搜索结果 searchresult.png
    • 条目详情 item2
  2. 输入任意时间范围和领域,查看Top100的热门作者、关键词以及对应的文章数量

    • 热门查询 popular.png
    • 查询结果 popular_res2.png

wanfangdata's People

Contributors

dependabot[bot] avatar karmenzind avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

wanfangdata's Issues

爬虫不能抓取了

刚把这个项目跑起来,貌似万方做了反爬。连接直接被拒绝。
image
这样子,要怎么改呢.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.