Giter Site home page Giter Site logo

code4craft / webmagic Goto Github PK

View Code? Open in Web Editor NEW
11.3K 767.0 4.2K 16.96 MB

A scalable web crawler framework for Java.

Home Page: http://webmagic.io/

License: Apache License 2.0

Java 76.77% JavaScript 0.17% HTML 22.73% Shell 0.02% Python 0.04% Ruby 0.08% Groovy 0.05% Kotlin 0.13%
crawler java scraping framework

webmagic's Introduction

logo

Readme in Chinese

Maven Central License Build Status

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.

Install:

Add dependencies to your pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>${webmagic.version}</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>${webmagic.version}</version>
</dependency>

WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.

<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>

Get Started:

First crawler:

Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.

public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}
  • page.addTargetRequests(links)

    Add urls for crawling.

You can also use annotation way:

@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {

    @ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
    private String name;

    @ExtractByUrl("https://github\\.com/(\\w+)/.*")
    private String author;

    @ExtractBy("//div[@id='readme']/tidyText()")
    private String readme;

    public static void main(String[] args) {
        OOSpider.create(Site.me().setSleepTime(1000)
                , new ConsolePageModelPipeline(), GithubRepo.class)
                .addUrl("https://github.com/code4craft").thread(5).run();
    }
}

Docs and samples:

Documents: http://webmagic.io/docs/

The architecture of webmagic (refered to Scrapy)

image

There are more examples in webmagic-samples package.

License:

Licensed under Apache 2.0 license

Thanks:

To write webmagic, I refered to the projects below :

Mail-list:

https://groups.google.com/forum/#!forum/webmagic-java

http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988

QQ Group: 373225642 542327088

Related Project

  • Gather Platform

    A web console based on WebMagic for Spider configuration and management.

webmagic's People

Contributors

ayushi250317 avatar bingoko avatar carl-don-it avatar ccliangbo avatar code4craft avatar conchz avatar d0ngw avatar dd-ray avatar edwardsbean avatar fengwuze avatar francoisgib avatar friddle avatar gzhy avatar harikrishna553 avatar hooyantsing avatar hsqlu avatar jimmy-zha avatar johnsonsbaby avatar jsmaster008 avatar jwlyn avatar ouyanghuangzheng avatar simpleexpress avatar snyk-bot avatar sutra avatar thebirdandfish avatar xbynet avatar ywooer avatar yxssfxwzy avatar zhugw avatar zyfxgo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webmagic's Issues

Add annotation for result formatter

Some times we need to add some text before or after extracted result, such as "ID"+id for field String id, or fill more than one result in a single field, such as name1+"|"+name2 for field name. Add a annotation like that:

@Formatter(value=@ExtractBy("//div[@id='id']"),template="ID%s")
private String id;

UrlUtils.getCharset bug

when fetch http://www.gdwest.com/a/151/ and use UrlUtils.getCharset to get charset for all content return gb2312", not gb2312

suggest code below:

private static final Pattern patternForCharset = Pattern
.compile("<meta\shttp-equiv=[\"\']content-type[\"\']\scontent\s*=\s*["'][a-z]/[a-z]\s*;\scharset=([a-z\d\-])[\"\'\>]", Pattern.CASE_INSENSITIVE);
public static String getCharset(String content) {
Matcher matcher = patternForCharset.matcher(content);
if (matcher.find()) {
String charset = matcher.group(1);
if (Charset.isSupported(charset)) {
return charset;
}
}
return null;
}

ExtractByUrl 增加获取链接范围

框架很方便,很好用,国人的骄傲啊,顶一个。
下面是我使用过程中遇到的一个小问题。
在爬分页的时候,使用@ExtractByUrl抽取连接里面的pagekey和page的时候。
按仅正则写会抽取出许多无用的东西。需要的东西还丢了。。
不知道是不是我的写法不对
希望有类似@targeturl的sourceRegion来限制范围。
或者其他方法抽取pagekey和page

HttpClientPool does not have a Httpclientpool

BUG: Every time calling generateClient() a new PoolingClientConnectionManager will be created.
FIX: Try to reuse the PoolingClientConnectionManager.

private HttpClient generateClient(Site site) {
   PoolingClientConnectionManager connectionManager = new PoolingClientConnectionManager(schemeRegistry);
}

Remove multi in ExtractBy

Remove multi in ExtractBy. Use class of field instead. If it is List, the multi is set to true by default.

Spider does not exit when success

Sometimes the thread does not exit when success in multi-thread mode.
I guess there is some problem in the wait-notify mechanism:

if (request == null) {
                if (threadAlive.get() == 0 && exitWhenComplete) {
                    break;
                }
                // wait until new url added
                waitNewUrl();
            }

增加https支持

如题,爬取https网站时会出错:
Caused by: org.apache.http.HttpException: Scheme 'https' not registered.

WebMagic-Avalon项目计划

WebMagic-Avalon项目的目标是打造一个可配置、可管理的爬虫,以及一个可分享配置/脚本的平台,从而减少熟悉的开发者的开发量,并且让不熟悉Java技术的人也能简单的使用一个爬虫。

Part1:webmagic-scripts

目标:使得可以用简单脚本的方式编写爬虫,从而为一些常用场景提供可流通的脚本。
例如:我需要抓github的仓库数据,可以这样写一个脚本(javascript):

https://github.com/code4craft/webmagic/tree/master/webmagic-scripts

这个功能目前实现了一部分,但最终结果仍在实验阶段。欢迎大家积极参与并提出意见。因为编写一个稳定的语言需要很多领域积累,后来决定先做后台。

Part2:webmagic-pannel

一个集成了加载脚本、管理爬虫的后台。目前在开发中。

Part3:webmagic-market

一个可以分享、搜索和下载脚本的站点。计划中。

如何参与

webmagic目前由作者业余维护,仅仅为了分享和个人提高,没有任何盈利,也没有商业化打算。

欢迎以下几种形式的贡献:

  1. 为webmagic项目本身提出改进意见,可以通过邮件组、qq、oschina或者在github提交issue(推荐)的方式。
  2. 参与WebMagic-Avalon计划的建设讨论,包括产品设计、技术选型等,可以直接回复这个issue。
  3. 参与webmagic代码开发,请fork一份代码,修改后提交pull request给我。请使用尽量新的版本,并说明修改内容。pull request接受后,我会将你加为committer,共同参与开发。

the enganced code

@Extract(value = 
        { 
        @ExtractBy(value = "//meta[@itemprop='name']/@content", type = ExprType.XPATH),
        @ExtractBy(value = "h1.productTitle", type = ExprType.CSS, isOuterHtml = false),
        @ExtractBy(value = "h1.productTitle p span", type = ExprType.CSS) 
        }, op = OP.OR)
private String productName;

@Extract(value = 
        { 
        @ExtractBy(value = "R3_ITEM\\.setId\\(['\"](\\d+)['\"]\\)", type = ExprType.REGEX),
        @ExtractBy(value = "var\\s+DefaultItem\\s*=\\s*\\{\\s*itemId\\s*:\\s*(\\d+)\\s*,", type = ExprType.REGEX),
        @ExtractBy(value = "//input[@name='product_id']/@value")
        }, op = OP.OR)
private String productId;

@Extract(value =
        {
        @ExtractBy(value = "table.SpecTable", type = ExprType.CSS),
        @ExtractBy(value = "Walmart No\\.:</td>\\s*<td.+?>(\\d+)</td>", type = ExprType.REGEX)
        }, op = OP.AND)
private String channelSKU;

完善分布式支持

有一个基于redis的分布式实现,但是比较简单,也没有保存request里的extra信息。完善这个功能,并可以考虑接入其他存储。

Write crawler by configuration

Create a crawler dynamically from configuration, so that developer can only write xml to realize a crawler.

Scheduled in 0.3.0。

Update httpclient to 4.3.1

httpclient 4.3.1 has changing API with more features. Update httpclient to 4.3.1 and use more features in it.

why cycleretry cannot work in QueueScheduler

This project is amazing! But I have a tiny question.

I consider the push function of QueueScheduler should be as follows.

public synchronized void push(Request request, Task task) {
        if (request.getExtra(Request.CYCLE_TRIED_TIMES) != null || urls.add(request)) {
            queue.add(request);
        }
}

Hope to receive your reply! Thanks!

让XPath语法出错的提示更人性化

XPath出错时,并没有提示说哪里出错,比较不人性化,最好能定位到错误的字符。
org.htmlcleaner.XPatherException: Error in evaluating XPath expression!
at org.htmlcleaner.XPather.throwStandardException(XPather.java:111)
at org.htmlcleaner.XPather.evaluateAgainst(XPather.java:172)
at org.htmlcleaner.XPather.evaluateAgainst(XPather.java:170)
at org.htmlcleaner.XPather.evaluateAgainst(XPather.java:170)
at org.htmlcleaner.XPather.evaluateAgainstNode(XPather.java:98)
at org.htmlcleaner.TagNode.evaluateXPath(TagNode.java:457)
嗯,感觉难度还是有点大的...

Move selectors to a separated project

Move selectors to a separated project. After that , webmagic-core only contains the core components of crawler, and webmagic-extractor can work separatedly as html/text extractor.

Annotation extactor does not work

Exception in thread "main" java.lang.IllegalArgumentException: String input must not be null
  at org.jsoup.helper.Validate.notNull(Validate.java:26)
  at org.jsoup.parser.TreeBuilder.initialiseParse(TreeBuilder.java:24)
  at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:40)
  at org.jsoup.parser.HtmlTreeBuilder.parse(HtmlTreeBuilder.java:37)
  at org.jsoup.parser.Parser.parse(Parser.java:90)
  at org.jsoup.Jsoup.parse(Jsoup.java:58)

Write spider by script

Invite other language in JVM and write spider (mainly PageProcessor) in script. Support JRuby and javascript first for experiment.

oschina.js

title = $("div.BlogTitle h1"),
content = $("div.BlogContent")
urls("http://my\\.oschina\\.net/flashsword/blog/\\d+")

oschina.rb

title = css "div.BlogTitle h1"
content = css "div.BlogContent"
urls "http://my\\.oschina\\.net/flashsword/blog/\\d+"

增加annotation方式的爬虫配置

之前PageProcessor的实现方式靠手写代码,没有体现出框架的意味。
因为Java程序都要用到model,考虑用annotation来注解model,这样牺牲一些灵活性,但是更清晰,维护起来更容易。

Downloader thread hang up when timeout/下载线程僵死问题

The socket timeout is not set in 0.4.0~0.4.1, so when connection success but data is not received completely, the thread will hang up for data. Try setSocketTimeout to fix.

在0.4.0到0.4.1版本中,没有设置socket超时时间,会导致下载线程僵死。使用setSocketTimeout进行修复。

Auto type converting for fields

In annotation module, the type of field in page model can only be String or List<String>. Add more basic type support and auto converting such as int, Integer,long, Long, double, Double, boolean, Boolean.

http proxy support?

Hi Sir,

Can you tell me how to add http proxy to access page.
I have code as following, but it does not take effect.

System.setProperty("http.proxyHost", "proxyserver");
System.setProperty("http.proxyPort", "8080");

Thanks

Add jsonpath in annotation mode for json result

Add jsonpath to annotation mode for json extrat:

public class AppStore {

    @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..trackName")
    private String trackName;

    @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..description")
    private String description;

    public static void main(String[] args) {
        AppStore appStore = OOSpider.create(Site.me(), AppStore.class).<AppStore>get("http://itunes.apple.com/lookup?id=653350791&country=cn&entity=software");
        System.out.println(appStore.trackName);
        System.out.println(appStore.description);
    }
}

完善ajax抓取

目前基于Selenium进行了一个实现,但是有一些细化的东西,例如去掉图片加载以加快渲染速度。考虑用webkit的ghostdrive结合Selenium来实现。

Further gzip support

Add Accept-Encoding:gzip to HttpRequest header, to cut down size of transport with gzip if the server support gzip.

Parsing html after page.getHtml()

In 0.4.0, the html is parsed immediately after downloaded. Sometimes we don't want to parse it but there is no choice. Change the parsing of html just after call page.getHtml().

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.