crawlscript / webcollector Goto Github PK

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

Home Page: https://github.com/CrawlScript/WebCollector

License: GNU General Public License v3.0

Java 100.00%

webcollector's Introduction

WebCollector

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

In addition to a general crawler framework, WebCollector also integrates CEPF, a well-designed state-of-the-art web content extraction algorithm proposed by Wu, et al.:

Wu GQ, Hu J, Li L, Xu ZH, Liu PC, Hu XG, Wu XD. Online Web news extraction via tag path feature fusion. Ruan Jian Xue Bao/Journal of Software, 2016,27(3):714-735 (in Chinese). http://www.jos.org.cn/1000-9825/4868.htm

HomePage

https://github.com/CrawlScript/WebCollector

Installation

Using Maven

<dependency>
    <groupId>cn.edu.hfut.dmic.webcollector</groupId>
    <artifactId>WebCollector</artifactId>
    <version>2.73-alpha</version>
</dependency>

Without Maven

WebCollector jars are available on the HomePage.

webcollector-version-bin.zip contains core jars.

Example Index

Annotation versions are named with DemoAnnotatedxxxxxx.java.

Basic

CrawlDatum and MetaData

Http Request and Javascript

NextFilter

Quickstart

Lets crawl some news from github news.This demo prints out the titles and contents extracted from news of github news.

Automatically Detecting URLs

DemoAutoNewsCrawler.java:

import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;

/**
 * Crawling news from github news
 *
 * @author hu
 */
public class DemoAutoNewsCrawler extends BreadthCrawler {
    /**
     * @param crawlPath crawlPath is the path of the directory which maintains
     *                  information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     *                  links which match regex rules from pag
     */
    public DemoAutoNewsCrawler(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        /*start pages*/
        this.addSeed("https://blog.github.com/");
        for(int pageIndex = 2; pageIndex <= 5; pageIndex++) {
            String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex);
            this.addSeed(seedUrl);
        }

        /*fetch url like "https://blog.github.com/2018-07-13-graphql-for-octokit/" */
        this.addRegex("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}-[^/]+/");
        /*do not fetch jpg|png|gif*/
        //this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
        //this.addRegex("-.*#.*");

        setThreads(50);
        getConf().setTopN(100);

        //enable resumable mode
        //setResumable(true);
    }

    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.url();
        /*if page is news page*/
        if (page.matchUrl("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}[^/]+/")) {

            /*extract title and content of news by css selector*/
            String title = page.select("h1[class=lh-condensed]").first().text();
            String content = page.selectText("div.content.markdown-body");

            System.out.println("URL:\n" + url);
            System.out.println("title:\n" + title);
            System.out.println("content:\n" + content);

            /*If you want to add urls to crawl,add them to nextLink*/
            /*WebCollector automatically filters links that have been fetched before*/
            /*If autoParse is true and the link you add to nextLinks does not match the 
              regex rules,the link will also been filtered.*/
            //next.add("http://xxxxxx.com");
        }
    }

    public static void main(String[] args) throws Exception {
        DemoAutoNewsCrawler crawler = new DemoAutoNewsCrawler("crawl", true);
        /*start crawl with depth of 4*/
        crawler.start(4);
    }

}

Manually Detecting URLs

DemoManualNewsCrawler.java:

import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;

/**
 * Crawling news from github news
 *
 * @author hu
 */
public class DemoManualNewsCrawler extends BreadthCrawler {
    /**
     * @param crawlPath crawlPath is the path of the directory which maintains
     *                  information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     *                  links which match regex rules from pag
     */
    public DemoManualNewsCrawler(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        // add 5 start pages and set their type to "list"
        //"list" is not a reserved word, you can use other string instead
        this.addSeedAndReturn("https://blog.github.com/").type("list");
        for(int pageIndex = 2; pageIndex <= 5; pageIndex++) {
            String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex);
            this.addSeed(seedUrl, "list");
        }

        setThreads(50);
        getConf().setTopN(100);

        //enable resumable mode
        //setResumable(true);
    }

    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.url();

        if (page.matchType("list")) {
            /*if type is "list"*/
            /*detect content page by css selector and mark their types as "content"*/
            next.add(page.links("h1.lh-condensed>a")).type("content");
        }else if(page.matchType("content")) {
            /*if type is "content"*/
            /*extract title and content of news by css selector*/
            String title = page.select("h1[class=lh-condensed]").first().text();
            String content = page.selectText("div.content.markdown-body");

            //read title_prefix and content_length_limit from configuration
            title = getConf().getString("title_prefix") + title;
            content = content.substring(0, getConf().getInteger("content_length_limit"));

            System.out.println("URL:\n" + url);
            System.out.println("title:\n" + title);
            System.out.println("content:\n" + content);
        }

    }

    public static void main(String[] args) throws Exception {
        DemoManualNewsCrawler crawler = new DemoManualNewsCrawler("crawl", false);

        crawler.getConf().setExecuteInterval(5000);

        crawler.getConf().set("title_prefix","PREFIX_");
        crawler.getConf().set("content_length_limit", 20);

        /*start crawl with depth of 4*/
        crawler.start(4);
    }

}

CrawlDatum

CrawlDatum is an important data structure in WebCollector, which corresponds to url of webpages. Both crawled urls and detected urls are maintained as CrawlDatums.

There are some differences between CrawlDatum and url:

A CrawlDatum contains a key and a url. The key is the url by default. You can set the key manually by CrawlDatum.key("xxxxx") so that CrawlDatums with the same url may have different keys. This is very useful in some tasks like crawling data by api, which often request different data by the same url with different post parameters.
A CrawlDatum may contain metadata, which could maintain some information besides the url.

Manually Detecting URLs

In both void visit(Page page, CrawlDatums next) and void execute(Page page, CrawlDatums next), the second parameter CrawlDatum next is a container which you should put the detected URLs in:

//add one detected URL
next.add("detected URL");
//add one detected URL and set its type
next.add("detected URL", "type");
//add one detected URL
next.add(new CrawlDatum("detected URL"));
//add detected URLs
next.add("detected URL list");
//add detected URLs
next.add(("detected URL list","type");
//add detected URLs
next.add(new CrawlDatums("detected URL list"));

//add one detected URL and return the added URL(CrawlDatum)
//and set its key and type
next.addAndReturn("detected URL").key("key").type("type");
//add detected URLs and return the added URLs(CrawlDatums)
//and set their type and meta info
next.addAndReturn("detected URL list").type("type").meta("page_num",10);

//add detected URL and return next
//and modify the type and meta info of all the CrawlDatums in next,
//including the added URL
next.add("detected URL").type("type").meta("page_num", 10);
//add detected URLs and return next
//and modify the type and meta info of all the CrawlDatums in next,
//including the added URLs
next.add("detected URL list").type("type").meta("page_num", 10);

You don't need to consider how to filter duplicated URLs, the crawler will filter them automatically.

Plugins

Plugins provide a large part of the functionality of WebCollector. There are several kinds of plugins:

Executor: Plugins which define how to download webpages, how to parse webpages and how to detect new CrawlDatums(urls)
DBManager: Plugins which maintains the crawling history
GeneratorFilter: Plugins which generate CrawlDatums(urls) which will be crawled
NextFilter: Plugins which filter CrawlDatums(urls) which detected by the crawler

Some BreadthCrawler and RamCrawler are the most used crawlers which extends AutoParseCrawler. The following plugins only work in crawlers which extend AutoParseCrawler:

Requester: Plugins which define how to do http request
Visitor: Plugins which define how to parse webpages and how to detect new CrawlDatums(urls)

Plugins can be mounted as follows:

crawler.setRequester(xxxxx);
crawler.setVisitor(xxxxx);
crawler.setNextFilter(xxxxx);
crawler.setGeneratorFilter(xxxxx);
crawler.setExecutor(xxxxx);
crawler.setDBManager(xxxxx);

AutoParseCrawler is also an Executor plugin, a Requester plugin and a Visitor plugin. By default it use itsself as the Executor plugin, Request Plugin and Visitor plugin. So if you want to write a plugin for AutoParseCrawler, you have two ways:

Just override the corresponding methods of your AutoParseCrawler. For example, if you are using BreadthCrawler, all you have to do is override the Page getResponse(CrawlDatum crawlDatum) method.
Create a new class which implements Requester interface and implement the Page getResponse(CrawlDatum crawlDatum) method of the class. Instantiate the class and use crawler.setRequester(the instance) to mount the plugin to the crawler.

Customizing Requester Plugin

Creating a Requester plugin is very easy. You just need to create a new class which implements Requester interface and implement the Page getResponse(CrawlDatum crawlDatum) method of the class. OkHttpRequester is a Requester Plugin provided by WebCollector. You can find the code here: OkHttpRequester.class.

Most of the time, you don't need to write a Requester plugin from the scratch. Creating a Requester plugin by extending the OkHttpRequester is a convenient way.

Configuration Details

Configuration mechanism of WebCollector is redesigned in version 2.70. The above example ManualNewsCrawler.java also shows how to use configuration to customize your crawler.

Before version 2.70, configuration is maintained by static variables in class cn.edu.hfut.dmic.webcollector.util.Config, hence it's cumbersome to assign different configurations to different crawlers.

Since version 2.70, each crawler can has its own configuration. You can use crawler.getConf() to get it or crawler.setConf(xxx) to set it. By default, all crawlers use a singleton default configuration, which could be get by Configuration.getDefault(). So in the above example ManualNewsCrawler.java, crawler.getConf().set("xxx", "xxx") would affect the default configuration, which may be used by other crawlers.

If you want to change the configuration of a crawler without affecting other crawlers, you should manually create a configuration and specify it to the crawler. For example:

Configuration conf = Configuration.copyDefault();

conf.set("test_string_key", "test_string_value");
crawler.getConf().setReadTimeout(1000 * 5);

crawler.setConf(conf);

crawler.getConf().set("test_int_key", 10);
crawler.getConf().setConnectTimeout(1000 * 5);

Configuration.copyDefault() is suggested, because it creates a copy of the singleton default configuration, which contains some necessary key-value pairs, while new Configuration() creates an empty configuration.

Resumable Crawling

If you want to stop a crawler and continue crawling the next time, you should do two things:

Add crawler.setResumable(true) to your code.
Don't delete the history directory generated by the crawler, which is specified by the crawlPath parameter.

When you call crawler.start(depth), the crawler will delete the history if you set resumable to false, which is false by default. So if you forget to put 'crawler.setResumable(true)' in your code before the first time you start your crawler, it doesn't matter, because you have no history directory.

Content Extraction

WebCollector could automatically extract content from news web-pages:

News news = ContentExtractor.getNewsByHtml(html, url);
News news = ContentExtractor.getNewsByHtml(html);
News news = ContentExtractor.getNewsByUrl(url);

String content = ContentExtractor.getContentByHtml(html, url);
String content = ContentExtractor.getContentByHtml(html);
String content = ContentExtractor.getContentByUrl(url);

Element contentElement = ContentExtractor.getContentElementByHtml(html, url);
Element contentElement = ContentExtractor.getContentElementByHtml(html);
Element contentElement = ContentExtractor.getContentElementByUrl(url);

WeChat Group

webcollector's People

Contributors

Stargazers

Watchers

Forkers

pengqin xsr1001 chenying99 felix200 scbai skylearnstocode chengjunjian nandychen zpfbupt hugcoday captain1896 jasongwt dizhu chensonglovelife vvcvu yaoliyc djm-im damacheng009 kongnuanzhao zypy333 alexguoq zhangxian sunyinhuicoding flyingwenku wangyunm coolroad yj8202003 yinning2216 cshuig linving milanandy chinacbk bluwil mosoft521 zxniuniu kaixlin hudk2000 forestshen kabuqinuo penger harryzd bellastream alexyyek tracythink zhx1019 niuxinghua supersunsun chinalongganhu guanghe hailangonly kynchx feer921 xxxmatata papertiger8848 minidoll microee iam007site leoking01 yangchao3220 qixiaoke show123456 chenylong kerwong zgjssyjam yuwencheng huokedu xiaxuewuhen001 sohu001 gaku29 2012-9414 binchensjtu smcshk itcul loveh dluobo donaldlee2008 wangmb chu888chu888 flyweilai1287 gaohuijian xiliangsong rucky2013 fflydev sdtm1016 baininghan lasfil leobelive fhqihwcw 422053362 zhao8899 licong unix8net lixuanxian binluo zjdgx xiaocongsheng xieh1989 varunrisbud zhuyoufeng zheng-cao

webcollector's Issues

很现实的一个问题，爬取网站受到网站访问频率限制

在爬取一些网站，会受到网站的访问频率限制，
这个爬虫开启多个线程，怎么设置延时爬取

WebCollector 2.31 选择器（select bug）

在使用新版的2.31版从日本气象厅网站抓取台风数据时，select 方法会有丢失情况，
网址地址：http://www.jma.go.jp/en/typh/index.html
抓取内容：地图右边的台风信息，table 内容
抓取代码如下：
Elements forecast = doc.getElementsByClass("forecast").select("tr");
forecast 中获取的 tr 有丢失情况发生

请教几个问题

1、添加url约束规则的时候，能不能控制每一层用不同的规则进行约束，
例如depth=1时，约束规则为A，depth=2时，约束规则为B，同时清除约束规则A
如果想要实现这个功能，如何再现有的功能上进行二次扩展？

2、爬取内容有分页时，每抓取一页，深度是不是都会增加1？
如果我要抓取100页，深度将会是100？
crawler.start(depth); 这个depth页需要设置为100？

3、爬虫获取下一个需要访问的页面是哪个方法，没找到

没有爬取地址完成就退出

long start = System.currentTimeMillis();
TonghuashunWebCollector crawler = new TonghuashunWebCollector("D:/var");
Map<String, BasicInfo> biMap = BasicInfoCache.getMarketStockMap();
int count =0;

    for (Map.Entry<String, BasicInfo> bi : biMap.entrySet()) {
        // 接口中以前缀识别交易所
        String prefix = "";
        if (bi.getValue().getExchange_name().length()>1) {
            prefix = bi.getValue().getExchange_name()+"_" + bi.getValue().getClass_code();
        }
        StringBuffer sb = new StringBuffer();
        sb.append(prefix);
        // 收集满MAX_NUM后，形成一个url，push进去
        String url = PREFIX_URL_ + sb.toString()+ SUFFIX_URL_ +"?"+sb.toString();
        crawler.addSeed(url);
        logger.info(url);
        count++;
        if (count == 10) {
            break;
        }
    }
    crawler.setThreads(5);
    crawler.setMaxRetry(2);
    crawler.start(1);

    System.out.println(System.currentTimeMillis()-start);

代码如上，但是设置了但是仅仅爬了几个就退出了

readme的是什么格式啊，怎么看来是乱的

解析某些页面会出现死锁，如内容中列出的页面

http://www.shtong.gov.cn/Newsite/node2/node2247/node4561/index.html

设置timeout 以及设置重连次数

有些页面一次无法加在或者加载慢设置超时时间可以加快
另外一些则是服务器响应问题可能一次始终获取不大doc 需要重试

这个里面是没有的吧？
或者要怎样配置才能做到上面的需求？

现在maven仓库的版本是多少呢？

引入maven版本2.09，但是继承BreadthCrawler 之后，重写的方法和教程中的有点不一致；这个是版本问题吗？

教程列表的地址一个也打不开

是我的原因还是服务挂了

调用api 接口用什么方法返回数据呢？

您好，想问一下，调用api，返回的数据是该用什么方法返回呢？
page.getHtml() 是返回html页面的方法，那ajax接口请求回来的数据用什么方法返回呢？

断点爬取怎么感觉每次都会重新爬取一次呢

你好，在爬取一段时间后，终端爬虫，再重新进行时，还会把相同的页面爬取下来。
每次输出的title都会出现相同的。 crawler.setResumable(true) ，设置true或者false，效果一样
是我设置的问题吗?多谢指教。附部分代码
@OverRide
public void visit(Page page, CrawlDatums next) {
System.out.println("<<<<<<<<<<visiting:"+page.getUrl()+"\tdepth="+page.getMetaData("depth"));
String title = page.select("title").first().text();
System.out.println(title);
}

/* afterVisit方法在visit方法执行后执行，这里super.afterVisit完成根据
   正则自动抽取链接的功能，因此这里需要保留 */
@Override
public void afterVisit(Page page, CrawlDatums next) {
    super.afterVisit(page, next); 

    //当前页面的depth为x，则从当前页面解析的后续任务的depth为x+1
    int depth;
    //如果在添加种子时忘记添加depth信息，可以通过这种方式保证程序不出错
    if(page.getMetaData("depth")==null){
        depth=1;
    }else{
        depth=Integer.valueOf(page.getMetaData("depth"));
    }
    depth++;
    for(CrawlDatum datum:next){
        datum.putMetaData("depth", depth+"");
    }
}

public static void main(String[] args) throws Exception {
    Example crawler=new Example("example_crawler", true);

    crawler.addSeed(new CrawlDatum("http://news.sohu.com/")
           .putMetaData("depth", "1"));

    crawler.addRegex(".*news.sohu.com/.*.shtml");
    crawler.setRetryInterval(1000);
    crawler.setVisitInterval(1000*1);
    crawler.setThreads(1);
    crawler.setResumable(true);

    crawler.start(3);
}

数据链接问题

找不到数据库链接释放的方法，不知道有没有添加进去

是否支持Proxy？另外如果支持的话，是否支持动态的轮换？

比如我这边有一个实际的需求，可用的proxy约有10个，但是每一个都可能fail。这个时候就需要切换到另外一个proxy并且之前失败的url重来一次

不知道是否支持

最新版的有在重定向时set cookie吗？

我用的不是最新版的，最近遇到重定向逻辑问题，场景如下：
模拟登陆，自动获取授权码，目标网页登陆（简称：A）后会重定向到用户首页（简称：B），同事给一个set-cookie的请求头，重定向到B时，如果没有把set-cookie中的cookie放入到“Cookie”中，就会造成一直反复重定向。

所以我在httpRequest.getResponse()中加入如下代码：

String newCookie = con.getHeaderField("Set-Cookie");
if (StringUtils.isNotBlank(newCookie)) {
      this.setCookie(newCookie);
      response.addHeader("Cookie", newCookie);
      this.crawlDatum.putMetaData("Cookie", newCookie);
}

这样就避免登陆成功因为没有setCookie导致重定向后又重定向会登陆页面的死循环，同时可以在后面的crawl中继续使用这个Cookie,知道新的“set-cookie”出现。
如果需要的话我可以吧重写的httpRequest给你你review一下，或者我push到我的fork中.

教程链接挂了，能解决下吗

教程页面报错，Error establishing a database connection

请问如何不让日志输出到控制台

请问如何不让日志输出到控制台，只是输出到某个日志文件中去。

Classes in Demo are not exist in the jar of 2.09

cn.edu.hfut.dmic.webcollector.model.CrawlDatum;//Class CrawlDatum is missing
cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;//package plugin is missing

crawlDatums.add(datum) 之后不继续执行

多次调用 crawlDatums.add(datum);
在第二次解析数据后（findThirdUrl这个方法里），再次add数据，但是爬虫并没有继续执行，请问是为啥？大致代码如下：

 @Override
    public void visit(Page page, CrawlDatums crawlDatums) {
        if (FIRST.equals(page.meta(TYPE))) {
            System.out.println("match url-->" + page.getUrl());
            findSecondUrl(page, crawlDatums);
        } else if (SECOND.equals(page.meta(TYPE))) {
            findThirdUrl(page, crawlDatums);
        }
}

  private void findThird(Page page, CrawlDatums crawlDatums) {
        Elements seasons = page.select(＊＊＊);
        for (int i = 0; i < seasons.size(); i++) {
            String text = seasons.get(i).text();
            CrawlDatum datum = new CrawlDatum();
            datum.setUrl(＊＊＊);
            datum.meta(TYPE, THIRD);
            //继续爬  
           //TODO 无效
            crawlDatums.add(datum);
        }
    }

新版本plugin下怎么没有的Mongo了?

为什么在crawler中强制记录信息到berkeleydb呢？

完全可以不要求记录啊

请问2.x的maven仓库在哪儿？

搜索到的maven仓库只有1.x的呢。

爬取的页面内部链接能修改么

比如我爬取的网站到本地后，html页面内部的链接怎么修改为本地链接，绝对路径修改为本地相对路径，这个问题很常见的吧

DemoDepthCrawler好像不能正常工作

在我运行DemoDepthCrawler时，并不能正确的获取深度，在爬取第二页时，日志如下：
total time: 2 seconds
2016-04-14 09:43:47 INFO cn.edu.hfut.dmic.webcollector.crawler.Crawler - start depth 2
2016-04-14 09:43:47 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - open generator:cn.edu.hfut.dmic.webcollector.plugin.berkeley.BerkeleyGenerator
2016-04-14 09:43:47 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - init segmentWriter:cn.edu.hfut.dmic.webcollector.plugin.berkeley.BerkeleyDBManager
2016-04-14 09:43:48 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - -activeThreads=50, spinWaiting=0, fetchQueue.size=84
visiting:http://news.hfut.edu.cn/show-1-29679-1.html depth=null
2016-04-14 09:43:48 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - done: http://news.hfut.edu.cn/show-1-29679-1.html
visiting:http://news.hfut.edu.cn/show-1-28410-1.html depth=null
visiting:http://news.hfut.edu.cn/show-1-29726-1.html depth=null
2016-04-14 09:43:48 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - done: http://news.hfut.edu.cn/show-1-28410-1.html
2016-04-14 09:43:48 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - done: http://news.hfut.edu.cn/show-1-29726-1.html
visiting:http://news.hfut.edu.cn/show-1-12727-1.html depth=null
visiting:http://news.hfut.edu.cn/show-1-28790-1.html depth=null

想请教下，爬取的url信息储存到BDB中字段属性的相关说明

例如：（key）http://www.a.com，（value）["http://http://www.a.com",0,1480523195349,0]是其中的一个链接的保存字段，请问后面的0,1480523195349,0是什么意思呢？

请问个别网站抓取时，结构是乱怎么办？

个别网站中文抓取后变为其他类似繁体字，并带有�这种符号。
如 “瞳”变成了“憂”，确定为编码问题，如何在BreadthCrawler子类中设置编码Content-Type charset=GBK

ContentExtractor时间解析不准确

尝试了几个资讯网站:

http://www.leiphone.com/news/201609/AtW1F5zt6GS1ru9Y.html
https://www.huxiu.com/article/167883.html

时间解析准确率偏低,准确率很低, 希望优化时间正则

Exception in thread "main" java.lang.NoClassDefFoundError: com/sleepycat/je/EnvironmentConfig

Exception in thread "main" java.lang.NoClassDefFoundError: com/sleepycat/je/EnvironmentConfig
at cn.edu.hfut.dmic.webcollector.crawler.Crawler.start(Crawler.java:99)

运行 Crawler.start的时候 can't found class，有人遇到过吗，求解决方法

在爬某些主流网站时出现FileNotFoundException

String url = "http://36kr.com/p/5043670.html";

try {
News news = ContentExtractor.getNewsByUrl(url);
String title = news.getTitle();
} catch(Exception e) {
}
会抛出异常FileNotFoundException。

只是某些网页会出现，大部分网页是能爬到

发生异常的url再跑一次，我这段代码没有达到效果

/**
 * 发生异常，这页就再跑一次
 * 
 * @param page
 * @param next
 */
private void againCrawl(Page page, CrawlDatums next) {
    CrawlDatum datum = page.getCrawlDatum();
    Map<String, String> metaData = datum.getMetaData();
    if (metaData.isEmpty()) {
        metaData.put("againCrawled", "yes");
        next.add(datum);
    }
}

新版本怎么使用代理？

以前有proxyGenerator，怎么取消了？
我看的源码
public HttpResponse getResponse(CrawlDatum crawlDatum) throws Exception {
HttpRequest request=new HttpRequest(crawlDatum);
return request.getResponse();
}
写死了？怎么设置代理呀？谢谢

关于正则的问题

/*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
        this.addRegex("-.*#.*");

为什么这里的正则表示的是not 的意思呢？而不是匹配这些正则规则呢？

ContentCollector解析时间的正则表达式bug

源码中getTime，getDate方法中提供的解析时间的正则表达式可以匹配2015ab08cd20mn这样的字符串，为何不使用更为严格的正则呢？如：
(1-9)-/年 -/月日?[\s| ]*(?:(1[0-9]|2[0-3]|0?[0-9]):时[:分]?(?:([1-5][0-9]|0?[0-9])秒?)?(?:.[0-9]{3})?)?

java.lang.NoClassDefFoundError: org/openqa/selenium/htmlunit/HtmlUnitDriver

爬取javascript报这个错

https类型的网站证书设置问题

运行环境：
WebCollector 2.0.9
jdk1.8.0_65
Linux
测试代码：
获取证书后，程序加入下面语句
System.setProperty("javax.net.ssl.**", **);
......
测试结果：
使用Jsoup直接connect没有问题。使用WebCollector时报如下错误。
日志输出：
javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: Certificates does not conform to algorithm constraints

请问WeiboHelper的源码是否可以公开？

希望能够研究一下Weibo的实现和结构。

请问breadthcrawler有timeout控制吗

我看了crawler有关的几个class没有控制http timeout相关的设定，想问一下http timeout时会有exception吗？如果没有的话我应该extend哪一个class进行timeout control?

webCollector中的新版本ContentExtractor是否可以单独使用

WebCollector是不是不支持优先级队列？

Unit Tests for improved test coverage

Hi
I would like know if it will be helpful if I contribute some unit tests.

Thank you
Farid

有没有爬图片下载到本地的demo 网上找的都是老版的新手不太会用啊

webcollector帮助文档问题

没有官方的论坛或者文档么，感觉源码阅读有点费劲

爬取大数据量时，DB报错

这是我分析日志发现的，我的业务逻辑是在第一层里返回了所有页的Link
java.lang.IllegalStateException: Can't call Database.put Database was closed.
at com.sleepycat.je.Database.checkOpen(Database.java:1863)
at com.sleepycat.je.Database.put(Database.java:1168)
at com.itsecu.crawler.collector.webcollector.fetcher.SegmentWriter.wrtieLinks(SegmentWriter.java:89)
at com.itsecu.crawler.collector.webcollector.fetcher.Fetcher$FetcherThread.run(Fetcher.java:345)

爬取ajax或者是js跳转的页面怎么设置策略？

爬取ajax或者是js跳转的页面怎么设置策略？
看到介绍说“自定义遍历策略，可完成更为复杂的遍历业务！”这种怎么设置策略，请指教？

深度爬取，存储berkeleydb错误，爬取完成不释放内存

使用样例：
\WebCollector-master\WebCollector\src\main\java\cn\edu\hfut\dmic\webcollector\example\DemoDepthCrawler.java
存储错误，爬取完成不是否内存，导致内存溢出。导致系统cup 内存使用率高，信息如下：
云盘：http://pan.baidu.com/s/1nvpk9Vb
密码：mnqz
请求解答。

visitt抛异常了，我不希望再爬，怎么做？

Page page = getPage(crawlDatum);
crawlDatum.incrRetry(page.getRetry());
crawlDatum.setFetchTime(System.currentTimeMillis());

                CrawlDatums next = new CrawlDatums();
                if (visit(crawlDatum, page, next)) {
                    try {
                        /*写入fetch信息*/
                        dbManager.wrtieFetchSegment(crawlDatum);

我看你的retry是在getPage里，获取页面我没有报错，我是在visit里报错的，所以没有
dbManager.wrtieFetchSegment(crawlDatum);，也没有retry+1，还是没有爬的状态，所以一直在跑？是这样吗？

BekeleyDB 报错：LOG_FILE_NOT_FOUND: Log file missing

com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 5.0.73) /home/data/bing fetchTarget of 0x0/0x361 parent IN=5 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0x2/0x14468 lastLoggedVersion=0x2/0x14468 parent.getDirty()=true state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed

请问你有遇到过这个问题嘛？在启动多线程时爬虫会报错沿用的版本是你最新的

层数过多导致失败，如何保证数据爬全

您好，我现在在做一个爬取分页的功能。
种子里是第一页，然后返回下一页的Links，一直这样，直到最后一页。
但是我发现每次爬取到20多层的时候就爬取不到了，导致异常。
我不知道是因为被爬网站的ip限制或者是您框架的漏洞，总之就抛异常了。
但是我在爬取第一页时候，就把第一条的时间存储了，以后就只会爬取最新的，爬到存储的时间就不爬了。
比如如果第10页抛异常了，10页后面的数据我就爬取不到。
请问您有什么好的办法能保证数据被爬全吗？
谢谢你，我准备好好研究你的框架，然后写一些我使用的心得，推广这个好东西。

crawlscript / webcollector Goto Github PK

webcollector's Introduction

WebCollector

HomePage

Installation

Using Maven

Without Maven

Example Index

Basic

CrawlDatum and MetaData

Http Request and Javascript

NextFilter

Quickstart

Automatically Detecting URLs

Manually Detecting URLs

CrawlDatum

Manually Detecting URLs

Plugins

Customizing Requester Plugin

Configuration Details

Resumable Crawling

Content Extraction

WeChat Group

webcollector's People

Contributors

Stargazers

Watchers

Forkers

webcollector's Issues

Recommend Projects

Recommend Topics

Recommend Org