Giter Site home page Giter Site logo

siren's Introduction

简介

siren是一套以配置为基础的爬虫系统,他的基本配置和解析系统是yaml。借助yaml的语法,他可以很轻松的定义爬虫,而不需要编写大量代码。

背景知识

使用siren,你需要了解css或者xpath,能够用css或xpath表述你需要获得的内容。知道正则表达式,能够使用正则处理简单的过滤和替换。

要良好的使用siren,你还可能需要了解robots.txt协议相关的内容。遵循别人的意愿,礼貌的获取数据,做一只绅(bian)士(tai)的爬虫。

原理简述

siren维护一个爬虫队列。在爬虫工作时,每次从队列中取出一个request。而后开始按照匹配规则进行匹配。

当匹配规则命中某个项目时,爬虫会执行一种action。例如把url下载下来,调用python代码处理。或者解析下载下来的html,再调用python代码。

siren的特殊之处在于,定义了一组预定义的爬虫处理程序。这组程序被称为parsers。通过配置,可以直接处理结果,而不需要编写python代码。

范例

name: wenku8
timeout: 10
interval: 5
result: novel:result
output: output.txt
patterns:
 
  - name: main
	desc: table of content
	parsers:
	  - css: a
		attr: href
		is: "[0-9]+\\.htm"
		call: node
 
  - name: node
	desc: node
	parsers:
	  - css: div#title
		text: yes
		result: title
	  - css: div#content
		html2text: yes
		result: content

配置讲解

细节请参考config

入门指引

请看guide

TODO

  • do something

    • bilibili
    • bt.ktxp.com
    • jd
  • regex

  • js runner

  • cookie在redis中保存:加速存取效率。

  • 队列防回环(in redis):已经爬过的维护一份列表。

  • parser in css or xpath

授权

Copyright (C) 2012 Shell Xu

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

siren's People

Contributors

shell909090 avatar

Watchers

manbuheiniu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.