Giter Site home page Giter Site logo

blurrye / sogoupic Goto Github PK

View Code? Open in Web Editor NEW
5.0 0.0 2.0 11 KB

:beetle: spider 搜狗图片( http://pic.sogou.com/ )。爬虫,采用 scrapy 框架,构造 Ajax 请求,获取 json数据,存储到 mongodb 数据库。

Python 100.00%
scrapy mongodb sogoupic

sogoupic's Introduction

简介

scrapy框架,存储到mongodb。

使用方法

在settings.py中设置下列参数,然后运行爬虫。

# 自定义变量
# 分类
CATEGORY = '美女'
# 标签
TAG = '全部'
# 开始页码(起始0),每页15幅图片
START = 0
# 结束页码
END = 260
# mongodb
MONGO_URI = 'localhost'
MONGO_DATABASE = 'sogoupic'

简要分析

搜狗图片主页:http://pic.sogou.com/

按F12打开web开发者工具 -> 进网络监视器,很容易发现采用Ajax,网址规律如下:

http://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?category=美女&tag=女神&start=0&len=15
http://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?category=美女&tag=女神&start=15&len=15
http://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?category=美女&tag=女神&start=30&len=15
http://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?category=美女&tag=女神&start=45&len=15

在response中,发现json数据,节选部分数据项保存。

e.g.

{
    "title" : "清新萌妹子可爱迷人",
    "tags" : [ 
        "迷人", 
        "妹纸", 
        "可爱", 
        "小清新"
    ],
    "width" : 620,
    "height" : 930,
    "size" : 89441,
    "page_url" : "http://www.mm4493.com/meitu/46166_10.html",
    "ori_pic_url" : "http://www.mm4493.com/d/file/p/2016-01-04/2c4799934e8f33d80891cb7389e58e86.jpg",
    "pic_url" : "http://img03.sogoucdn.com/app/a/100520021/82e035a5ea76c4900afd54cf5c609eb3"
}

sogoupic's People

Contributors

blurrye avatar

Stargazers

 avatar DreamForever avatar  avatar J Liu avatar  avatar

Forkers

hzy9981 wangang36

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.