Giter Site home page Giter Site logo

scrapy_baidu_image's Introduction

scrapy_baidu_image

爬取百度图片的scrapy爬虫实现

说明

1选取了'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='月季花' 地址搜索图片,该地址有返回页码,比较容易遍历图片

2百度图片使用了ajax动态获取数据,因此scrapy选用了chromedriver做js动态渲染

3存在图片搜索关键字和图片路径使用mongodb存储,在baidu.pipelines.MongoDBPipeline中实现

4配置中禁用了robots,修改了USER_AGENT,爬取图片时加上referer,这些是绕过百度拦截爬虫和防止盗链必须的。

主要代码实现位置:

/baidu/spiders/imageSpider.py

/baidu/pipelines.py

做该爬虫主要是本人做神经网络研究时使用,所以分享出来给大家了,轮子还是不错,实现简洁,2019年1月还是可以正常使用,以后每月更新一次

最后说明,请文明爬取数据,不要太频繁,给对方服务器造成压力。

本文仅供大家学习使用

scrapy_baidu_image's People

Contributors

blueapplehe avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.