code4craft / webmagic Goto Github PK

View Code? Open in Web Editor NEW

11.3K 767.0 4.2K 16.96 MB

A scalable web crawler framework for Java.

Home Page: http://webmagic.io/

License: Apache License 2.0

Java 76.77% JavaScript 0.17% HTML 22.73% Shell 0.02% Python 0.04% Ruby 0.08% Groovy 0.05% Kotlin 0.13%

crawler java scraping framework

webmagic's Introduction

Readme in Chinese

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

Simple core with high flexibility.
Simple API for html extracting.
Annotation with POJO to customize a crawler, no configuration.
Multi-thread and Distribution support.
Easy to be integrated.

Install:

Add dependencies to your pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>${webmagic.version}</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>${webmagic.version}</version>
</dependency>

WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.

<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>

Get Started:

First crawler:

Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.

public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}

page.addTargetRequests(links)

Add urls for crawling.

You can also use annotation way:

@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {

    @ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
    private String name;

    @ExtractByUrl("https://github\\.com/(\\w+)/.*")
    private String author;

    @ExtractBy("//div[@id='readme']/tidyText()")
    private String readme;

    public static void main(String[] args) {
        OOSpider.create(Site.me().setSleepTime(1000)
                , new ConsolePageModelPipeline(), GithubRepo.class)
                .addUrl("https://github.com/code4craft").thread(5).run();
    }
}

Docs and samples:

Documents: http://webmagic.io/docs/

The architecture of webmagic (refered to Scrapy)

There are more examples in webmagic-samples package.

License:

Licensed under Apache 2.0 license

Thanks:

To write webmagic, I refered to the projects below :

Scrapy

A crawler framework in Python.

http://scrapy.org/
Spiderman

Another crawler framework in Java.

http://git.oschina.net/l-weiwei/spiderman

Mail-list:

https://groups.google.com/forum/#!forum/webmagic-java

http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988

QQ Group: 373225642 542327088

Related Project

Gather Platform

A web console based on WebMagic for Spider configuration and management.

webmagic's People

Contributors

Stargazers

Watchers

Forkers

garyudeng yuany zhangdongly changheluor007 lichhaosysu monolithic lao5net cto1206 x303597316 kylinhaha feikiss snowody chenying99 chenld peidachang langzime rockzeng iiekpig cody0755 luchao789 txcujoo jianghu918 duplicatedcode highflightinsky yangzilong1986 fashtimedotcom johnnyhg tongchuanwei singyea maximotom perryhau xiegqooo purewind houhlin jackini wangc-joker wikimore ljjzlm onesall l1z2g9 tangpei1215 huzorro svoflee keepcleargas todotobe1 willwotechnologies bitted fl64830993 zq34 microinfo jxqlovejava youzhagui2006 kewinwang xishuixixia git-smallcat arrmac is00hcw menuz wangazhi wufulin geekcheng getting jimolucy cdzztong zhangxinxin lolocn turanlisp willmoon glayyiyi aafly galahade ywq1015 javeabc jayzhou alvinfox lonchura bluelzx alanchenup openvalley chinagirlzz erpgly onsdk chenbing1989 peterxiong chengjunjian shaisxx aqqwiyth vinkeychen1987 nero520 jakbb lihuazhang notbadpad wojing haozi89 daxiao wutaotao ykgarfield kakakiki529 kusium feihugis

webmagic's Issues

Add annotation for result formatter

Some times we need to add some text before or after extracted result, such as "ID"+id for field String id, or fill more than one result in a single field, such as name1+"|"+name2 for field name. Add a annotation like that:

@Formatter(value=@ExtractBy("//div[@id='id']"),template="ID%s")
private String id;

UrlUtils.getCharset bug

when fetch http://www.gdwest.com/a/151/ and use UrlUtils.getCharset to get charset for all content return gb2312", not gb2312

suggest code below:

private static final Pattern patternForCharset = Pattern
.compile("<meta\shttp-equiv=[\"\']content-type[\"\']\scontent\s*=\s*["'][a-z]/[a-z]\s*;\scharset=([a-z\d\-])[\"\'\>]", Pattern.CASE_INSENSITIVE);
public static String getCharset(String content) {
Matcher matcher = patternForCharset.matcher(content);
if (matcher.find()) {
String charset = matcher.group(1);
if (Charset.isSupported(charset)) {
return charset;
}
}
return null;
}

ExtractByUrl 增加获取链接范围

框架很方便，很好用，国人的骄傲啊，顶一个。
下面是我使用过程中遇到的一个小问题。
在爬分页的时候，使用@ExtractByUrl抽取连接里面的pagekey和page的时候。
按仅正则写会抽取出许多无用的东西。需要的东西还丢了。。
不知道是不是我的写法不对
希望有类似@targeturl的sourceRegion来限制范围。
或者其他方法抽取pagekey和page

Further priority support

Add priority for annotation TargetUrl. Add priority to RedisScheduler.

HttpClientPool does not have a Httpclientpool

BUG: Every time calling generateClient() a new PoolingClientConnectionManager will be created.
FIX: Try to reuse the PoolingClientConnectionManager.

private HttpClient generateClient(Site site) {
   PoolingClientConnectionManager connectionManager = new PoolingClientConnectionManager(schemeRegistry);
}

Add url template for specific url crawling

Crawling specific urls instead of discover and crawl links for pages.
Use url template like "http://baike.baidu.com/search/word?word=${word}&enc=utf8" and params as input.
It's similar to harvestman.

Remove multi in ExtractBy

Remove multi in ExtractBy. Use class of field instead. If it is List, the multi is set to true by default.

Add `stop` method to Spider

Now Spider can only stop when all urls are downloaded. Add a method stop() for Spider to stop it.

Close reader in FileCacheQueueScheduler

The reader in FileCacheQueueScheduler has not been closed as read.

there maybe throw java.lang.NullPointerException at Html#xpath() and so on

I'm sorry,it's not a exception,it's just a log in Html#initDocument()..................

when I use Html class like this:html.xpath().links().all(),there maybe throw java.lang.NullPointerException.usually,I want to ignore this exception because it will abort application.

Spider does not exit when success

Sometimes the thread does not exit when success in multi-thread mode.
I guess there is some problem in the wait-notify mechanism:

if (request == null) {
                if (threadAlive.get() == 0 && exitWhenComplete) {
                    break;
                }
                // wait until new url added
                waitNewUrl();
            }

增加https支持

如题，爬取https网站时会出错：
Caused by: org.apache.http.HttpException: Scheme 'https' not registered.

WebMagic-Avalon项目计划

WebMagic-Avalon项目的目标是打造一个可配置、可管理的爬虫，以及一个可分享配置/脚本的平台，从而减少熟悉的开发者的开发量，并且让不熟悉Java技术的人也能简单的使用一个爬虫。

Part1:webmagic-scripts

目标：使得可以用简单脚本的方式编写爬虫，从而为一些常用场景提供可流通的脚本。
例如：我需要抓github的仓库数据，可以这样写一个脚本(javascript)：

https://github.com/code4craft/webmagic/tree/master/webmagic-scripts

这个功能目前实现了一部分，但最终结果仍在实验阶段。欢迎大家积极参与并提出意见。~~因为编写一个稳定的语言需要很多领域积累，后来决定先做后台。~~

Part2:webmagic-pannel

一个集成了加载脚本、管理爬虫的后台。目前在开发中。

Part3:webmagic-market

一个可以分享、搜索和下载脚本的站点。计划中。

如何参与

webmagic目前由作者业余维护，仅仅为了分享和个人提高，没有任何盈利，也没有商业化打算。

欢迎以下几种形式的贡献：

为webmagic项目本身提出改进意见，可以通过邮件组、qq、oschina或者在github提交issue(推荐)的方式。
参与WebMagic-Avalon计划的建设讨论，包括产品设计、技术选型等，可以直接回复这个issue。
参与webmagic代码开发，请fork一份代码，修改后提交pull request给我。请使用尽量新的版本，并说明修改内容。pull request接受后，我会将你加为committer，共同参与开发。

Complete unit test

There is little unit test in WebMagic. Complete unit tests.

the enganced code

@Extract(value = 
        { 
        @ExtractBy(value = "//meta[@itemprop='name']/@content", type = ExprType.XPATH),
        @ExtractBy(value = "h1.productTitle", type = ExprType.CSS, isOuterHtml = false),
        @ExtractBy(value = "h1.productTitle p span", type = ExprType.CSS) 
        }, op = OP.OR)
private String productName;

@Extract(value = 
        { 
        @ExtractBy(value = "R3_ITEM\\.setId\\(['\"](\\d+)['\"]\\)", type = ExprType.REGEX),
        @ExtractBy(value = "var\\s+DefaultItem\\s*=\\s*\\{\\s*itemId\\s*:\\s*(\\d+)\\s*,", type = ExprType.REGEX),
        @ExtractBy(value = "//input[@name='product_id']/@value")
        }, op = OP.OR)
private String productId;

@Extract(value =
        {
        @ExtractBy(value = "table.SpecTable", type = ExprType.CSS),
        @ExtractBy(value = "Walmart No\\.:</td>\\s*<td.+?>(\\d+)</td>", type = ExprType.REGEX)
        }, op = OP.AND)
private String channelSKU;

完善分布式支持

有一个基于redis的分布式实现，但是比较简单，也没有保存request里的extra信息。完善这个功能，并可以考虑接入其他存储。

Write crawler by configuration

Create a crawler dynamically from configuration, so that developer can only write xml to realize a crawler.

Scheduled in 0.3.0。

图中文字

Internat拼写是不是错了？

抓取部分网页出现乱码

例如：
http://a.lietou.com/1342847/job_2182701.shtml
编码已正确处理。使用curl、wget工具此地址也是乱码。
可能是目标站点的防抓取策略，目前原理不明。

Update httpclient to 4.3.1

httpclient 4.3.1 has changing API with more features. Update httpclient to 4.3.1 and use more features in it.

why cycleretry cannot work in QueueScheduler

This project is amazing! But I have a tiny question.

I consider the push function of QueueScheduler should be as follows.

public synchronized void push(Request request, Task task) {
        if (request.getExtra(Request.CYCLE_TRIED_TIMES) != null || urls.add(request)) {
            queue.add(request);
        }
}

Hope to receive your reply! Thanks!

让XPath语法出错的提示更人性化

XPath出错时，并没有提示说哪里出错，比较不人性化，最好能定位到错误的字符。
org.htmlcleaner.XPatherException: Error in evaluating XPath expression!
at org.htmlcleaner.XPather.throwStandardException(XPather.java:111)
at org.htmlcleaner.XPather.evaluateAgainst(XPather.java:172)
at org.htmlcleaner.XPather.evaluateAgainst(XPather.java:170)
at org.htmlcleaner.XPather.evaluateAgainst(XPather.java:170)
at org.htmlcleaner.XPather.evaluateAgainstNode(XPather.java:98)
at org.htmlcleaner.TagNode.evaluateXPath(TagNode.java:457)
嗯，感觉难度还是有点大的...

Change all code comments in Chinese to English

Save old Chinese javadoc to zh_docs by l10ndoclet , and change all comments to English.

增加更人性化的持久化支持

用同步的方式读取结果；
提供数据库之类的持久化支持。

EOFExcetion when download http://baike.baidu.com/

When download url like http://baike.baidu.com/search/word?word=httpclient&pic=1&sug=1&enc=utf8, an EOFExcetion will be thrown because the response with 302 state code also has a header "Content-Encoding:gzip". It is treated as gzip entity but actually it has no entity. It will cause some error in RedirectExc in HttpClient.

Move selectors to a separated project

Move selectors to a separated project. After that , webmagic-core only contains the core components of crawler, and webmagic-extractor can work separatedly as html/text extractor.

Update implements of fixRelativeUrl

FixRelative now has some bug. It can't fix url such as "http://github.com/code4craft/webmaigc/../../" UrlChange fixRelativeUrl to Java URI api.

Annotation extactor does not work

Exception in thread "main" java.lang.IllegalArgumentException: String input must not be null
  at org.jsoup.helper.Validate.notNull(Validate.java:26)
  at org.jsoup.parser.TreeBuilder.initialiseParse(TreeBuilder.java:24)
  at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:40)
  at org.jsoup.parser.HtmlTreeBuilder.parse(HtmlTreeBuilder.java:37)
  at org.jsoup.parser.Parser.parse(Parser.java:90)
  at org.jsoup.Jsoup.parse(Jsoup.java:58)

Write spider by script

Invite other language in JVM and write spider (mainly PageProcessor) in script. Support JRuby and javascript first for experiment.

oschina.js

title = $("div.BlogTitle h1"),
content = $("div.BlogContent")
urls("http://my\\.oschina\\.net/flashsword/blog/\\d+")

oschina.rb

title = css "div.BlogTitle h1"
content = css "div.BlogContent"
urls "http://my\\.oschina\\.net/flashsword/blog/\\d+"

增加annotation方式的爬虫配置

之前PageProcessor的实现方式靠手写代码，没有体现出框架的意味。
因为Java程序都要用到model，考虑用annotation来注解model，这样牺牲一些灵活性，但是更清晰，维护起来更容易。

Add more http infomation to Page

Page object only contains URL and response body.
Add more information (such as http status code, headers) to page.

Add more http info to page

Add more http such as Http Status Code, cookies to page.

Downloader thread hang up when timeout/下载线程僵死问题

The socket timeout is not set in 0.4.0~0.4.1, so when connection success but data is not received completely, the thread will hang up for data. Try setSocketTimeout to fix.

在0.4.0到0.4.1版本中，没有设置socket超时时间，会导致下载线程僵死。使用setSocketTimeout进行修复。

如何处理多层次抓取呢？

webmagic是如何连贯处理一个页面内子页面以及子页面的子页面的抓取呢？有相关示例么？

seed urls with more information

change List startUrls to List startUrls ti carry extra information

Auto type converting for fields

In annotation module, the type of field in page model can only be String or List<String>. Add more basic type support and auto converting such as int, Integer,long, Long, double, Double, boolean, Boolean.

config http header and other info for site

detail: http://www.oschina.net/question/229079_127415

Get running status of spider

Add more public method to Spider to monitor the running status.

Change algorithm of SmartContentSelector

Use algorithm of https://code.google.com/p/cx-extractor/. It has 95% accuracy in normal html.

http proxy support?

Hi Sir,

Can you tell me how to add http proxy to access page.
I have code as following, but it does not take effect.

System.setProperty("http.proxyHost", "proxyserver");
System.setProperty("http.proxyPort", "8080");

Thanks

Add jsonpath in annotation mode for json result

Add jsonpath to annotation mode for json extrat:

public class AppStore {

    @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..trackName")
    private String trackName;

    @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..description")
    private String description;

    public static void main(String[] args) {
        AppStore appStore = OOSpider.create(Site.me(), AppStore.class).<AppStore>get("http://itunes.apple.com/lookup?id=653350791&country=cn&entity=software");
        System.out.println(appStore.trackName);
        System.out.println(appStore.description);
    }
}