Giter Site home page Giter Site logo

crawlscript / webcollector Goto Github PK

View Code? Open in Web Editor NEW
3.0K 328.0 1.5K 265.74 MB

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

Home Page: https://github.com/CrawlScript/WebCollector

License: GNU General Public License v3.0

Java 100.00%

webcollector's Introduction

WebCollector

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

In addition to a general crawler framework, WebCollector also integrates CEPF, a well-designed state-of-the-art web content extraction algorithm proposed by Wu, et al.:

  • Wu GQ, Hu J, Li L, Xu ZH, Liu PC, Hu XG, Wu XD. Online Web news extraction via tag path feature fusion. Ruan Jian Xue Bao/Journal of Software, 2016,27(3):714-735 (in Chinese). http://www.jos.org.cn/1000-9825/4868.htm

HomePage

https://github.com/CrawlScript/WebCollector

Installation

Using Maven

<dependency>
    <groupId>cn.edu.hfut.dmic.webcollector</groupId>
    <artifactId>WebCollector</artifactId>
    <version>2.73-alpha</version>
</dependency>

Without Maven

WebCollector jars are available on the HomePage.

  • webcollector-version-bin.zip contains core jars.

Example Index

Annotation versions are named with DemoAnnotatedxxxxxx.java.

Basic

CrawlDatum and MetaData

Http Request and Javascript

NextFilter

Quickstart

Lets crawl some news from github news.This demo prints out the titles and contents extracted from news of github news.

Automatically Detecting URLs

DemoAutoNewsCrawler.java:

import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;

/**
 * Crawling news from github news
 *
 * @author hu
 */
public class DemoAutoNewsCrawler extends BreadthCrawler {
    /**
     * @param crawlPath crawlPath is the path of the directory which maintains
     *                  information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     *                  links which match regex rules from pag
     */
    public DemoAutoNewsCrawler(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        /*start pages*/
        this.addSeed("https://blog.github.com/");
        for(int pageIndex = 2; pageIndex <= 5; pageIndex++) {
            String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex);
            this.addSeed(seedUrl);
        }

        /*fetch url like "https://blog.github.com/2018-07-13-graphql-for-octokit/" */
        this.addRegex("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}-[^/]+/");
        /*do not fetch jpg|png|gif*/
        //this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
        //this.addRegex("-.*#.*");

        setThreads(50);
        getConf().setTopN(100);

        //enable resumable mode
        //setResumable(true);
    }

    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.url();
        /*if page is news page*/
        if (page.matchUrl("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}[^/]+/")) {

            /*extract title and content of news by css selector*/
            String title = page.select("h1[class=lh-condensed]").first().text();
            String content = page.selectText("div.content.markdown-body");

            System.out.println("URL:\n" + url);
            System.out.println("title:\n" + title);
            System.out.println("content:\n" + content);

            /*If you want to add urls to crawl,add them to nextLink*/
            /*WebCollector automatically filters links that have been fetched before*/
            /*If autoParse is true and the link you add to nextLinks does not match the 
              regex rules,the link will also been filtered.*/
            //next.add("http://xxxxxx.com");
        }
    }

    public static void main(String[] args) throws Exception {
        DemoAutoNewsCrawler crawler = new DemoAutoNewsCrawler("crawl", true);
        /*start crawl with depth of 4*/
        crawler.start(4);
    }

}

Manually Detecting URLs

DemoManualNewsCrawler.java:

import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;

/**
 * Crawling news from github news
 *
 * @author hu
 */
public class DemoManualNewsCrawler extends BreadthCrawler {
    /**
     * @param crawlPath crawlPath is the path of the directory which maintains
     *                  information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     *                  links which match regex rules from pag
     */
    public DemoManualNewsCrawler(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        // add 5 start pages and set their type to "list"
        //"list" is not a reserved word, you can use other string instead
        this.addSeedAndReturn("https://blog.github.com/").type("list");
        for(int pageIndex = 2; pageIndex <= 5; pageIndex++) {
            String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex);
            this.addSeed(seedUrl, "list");
        }

        setThreads(50);
        getConf().setTopN(100);

        //enable resumable mode
        //setResumable(true);
    }

    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.url();

        if (page.matchType("list")) {
            /*if type is "list"*/
            /*detect content page by css selector and mark their types as "content"*/
            next.add(page.links("h1.lh-condensed>a")).type("content");
        }else if(page.matchType("content")) {
            /*if type is "content"*/
            /*extract title and content of news by css selector*/
            String title = page.select("h1[class=lh-condensed]").first().text();
            String content = page.selectText("div.content.markdown-body");

            //read title_prefix and content_length_limit from configuration
            title = getConf().getString("title_prefix") + title;
            content = content.substring(0, getConf().getInteger("content_length_limit"));

            System.out.println("URL:\n" + url);
            System.out.println("title:\n" + title);
            System.out.println("content:\n" + content);
        }

    }

    public static void main(String[] args) throws Exception {
        DemoManualNewsCrawler crawler = new DemoManualNewsCrawler("crawl", false);

        crawler.getConf().setExecuteInterval(5000);

        crawler.getConf().set("title_prefix","PREFIX_");
        crawler.getConf().set("content_length_limit", 20);

        /*start crawl with depth of 4*/
        crawler.start(4);
    }

}

CrawlDatum

CrawlDatum is an important data structure in WebCollector, which corresponds to url of webpages. Both crawled urls and detected urls are maintained as CrawlDatums.

There are some differences between CrawlDatum and url:

  • A CrawlDatum contains a key and a url. The key is the url by default. You can set the key manually by CrawlDatum.key("xxxxx") so that CrawlDatums with the same url may have different keys. This is very useful in some tasks like crawling data by api, which often request different data by the same url with different post parameters.
  • A CrawlDatum may contain metadata, which could maintain some information besides the url.

Manually Detecting URLs

In both void visit(Page page, CrawlDatums next) and void execute(Page page, CrawlDatums next), the second parameter CrawlDatum next is a container which you should put the detected URLs in:

//add one detected URL
next.add("detected URL");
//add one detected URL and set its type
next.add("detected URL", "type");
//add one detected URL
next.add(new CrawlDatum("detected URL"));
//add detected URLs
next.add("detected URL list");
//add detected URLs
next.add(("detected URL list","type");
//add detected URLs
next.add(new CrawlDatums("detected URL list"));

//add one detected URL and return the added URL(CrawlDatum)
//and set its key and type
next.addAndReturn("detected URL").key("key").type("type");
//add detected URLs and return the added URLs(CrawlDatums)
//and set their type and meta info
next.addAndReturn("detected URL list").type("type").meta("page_num",10);

//add detected URL and return next
//and modify the type and meta info of all the CrawlDatums in next,
//including the added URL
next.add("detected URL").type("type").meta("page_num", 10);
//add detected URLs and return next
//and modify the type and meta info of all the CrawlDatums in next,
//including the added URLs
next.add("detected URL list").type("type").meta("page_num", 10);

You don't need to consider how to filter duplicated URLs, the crawler will filter them automatically.

Plugins

Plugins provide a large part of the functionality of WebCollector. There are several kinds of plugins:

  • Executor: Plugins which define how to download webpages, how to parse webpages and how to detect new CrawlDatums(urls)
  • DBManager: Plugins which maintains the crawling history
  • GeneratorFilter: Plugins which generate CrawlDatums(urls) which will be crawled
  • NextFilter: Plugins which filter CrawlDatums(urls) which detected by the crawler

Some BreadthCrawler and RamCrawler are the most used crawlers which extends AutoParseCrawler. The following plugins only work in crawlers which extend AutoParseCrawler:

  • Requester: Plugins which define how to do http request
  • Visitor: Plugins which define how to parse webpages and how to detect new CrawlDatums(urls)

Plugins can be mounted as follows:

crawler.setRequester(xxxxx);
crawler.setVisitor(xxxxx);
crawler.setNextFilter(xxxxx);
crawler.setGeneratorFilter(xxxxx);
crawler.setExecutor(xxxxx);
crawler.setDBManager(xxxxx);

AutoParseCrawler is also an Executor plugin, a Requester plugin and a Visitor plugin. By default it use itsself as the Executor plugin, Request Plugin and Visitor plugin. So if you want to write a plugin for AutoParseCrawler, you have two ways:

  • Just override the corresponding methods of your AutoParseCrawler. For example, if you are using BreadthCrawler, all you have to do is override the Page getResponse(CrawlDatum crawlDatum) method.
  • Create a new class which implements Requester interface and implement the Page getResponse(CrawlDatum crawlDatum) method of the class. Instantiate the class and use crawler.setRequester(the instance) to mount the plugin to the crawler.

Customizing Requester Plugin

Creating a Requester plugin is very easy. You just need to create a new class which implements Requester interface and implement the Page getResponse(CrawlDatum crawlDatum) method of the class. OkHttpRequester is a Requester Plugin provided by WebCollector. You can find the code here: OkHttpRequester.class.

Most of the time, you don't need to write a Requester plugin from the scratch. Creating a Requester plugin by extending the OkHttpRequester is a convenient way.

Configuration Details

Configuration mechanism of WebCollector is redesigned in version 2.70. The above example ManualNewsCrawler.java also shows how to use configuration to customize your crawler.

Before version 2.70, configuration is maintained by static variables in class cn.edu.hfut.dmic.webcollector.util.Config, hence it's cumbersome to assign different configurations to different crawlers.

Since version 2.70, each crawler can has its own configuration. You can use crawler.getConf() to get it or crawler.setConf(xxx) to set it. By default, all crawlers use a singleton default configuration, which could be get by Configuration.getDefault(). So in the above example ManualNewsCrawler.java, crawler.getConf().set("xxx", "xxx") would affect the default configuration, which may be used by other crawlers.

If you want to change the configuration of a crawler without affecting other crawlers, you should manually create a configuration and specify it to the crawler. For example:

Configuration conf = Configuration.copyDefault();

conf.set("test_string_key", "test_string_value");
crawler.getConf().setReadTimeout(1000 * 5);

crawler.setConf(conf);

crawler.getConf().set("test_int_key", 10);
crawler.getConf().setConnectTimeout(1000 * 5);

Configuration.copyDefault() is suggested, because it creates a copy of the singleton default configuration, which contains some necessary key-value pairs, while new Configuration() creates an empty configuration.

Resumable Crawling

If you want to stop a crawler and continue crawling the next time, you should do two things:

  • Add crawler.setResumable(true) to your code.
  • Don't delete the history directory generated by the crawler, which is specified by the crawlPath parameter.

When you call crawler.start(depth), the crawler will delete the history if you set resumable to false, which is false by default. So if you forget to put 'crawler.setResumable(true)' in your code before the first time you start your crawler, it doesn't matter, because you have no history directory.

Content Extraction

WebCollector could automatically extract content from news web-pages:

News news = ContentExtractor.getNewsByHtml(html, url);
News news = ContentExtractor.getNewsByHtml(html);
News news = ContentExtractor.getNewsByUrl(url);

String content = ContentExtractor.getContentByHtml(html, url);
String content = ContentExtractor.getContentByHtml(html);
String content = ContentExtractor.getContentByUrl(url);

Element contentElement = ContentExtractor.getContentElementByHtml(html, url);
Element contentElement = ContentExtractor.getContentElementByHtml(html);
Element contentElement = ContentExtractor.getContentElementByUrl(url);

WeChat Group

webcollector's People

Contributors

briefcopy avatar cyburs avatar hujunxianligong avatar mdzz9527 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webcollector's Issues

请教几个问题

1、添加url约束规则的时候,能不能控制每一层用不同的规则进行约束,
例如depth=1时,约束规则为A,depth=2时,约束规则为B,同时清除约束规则A
如果想要实现这个功能,如何再现有的功能上进行二次扩展?

2、爬取内容有分页时,每抓取一页,深度是不是都会增加1?
如果我要抓取100页,深度将会是100?
crawler.start(depth); 这个depth页需要设置为100?

3、爬虫获取下一个需要访问的页面是哪个方法,没找到

没有爬取地址完成就退出

long start = System.currentTimeMillis();
TonghuashunWebCollector crawler = new TonghuashunWebCollector("D:/var");
Map<String, BasicInfo> biMap = BasicInfoCache.getMarketStockMap();
int count =0;

    for (Map.Entry<String, BasicInfo> bi : biMap.entrySet()) {
        // 接口中以前缀识别交易所
        String prefix = "";
        if (bi.getValue().getExchange_name().length()>1) {
            prefix = bi.getValue().getExchange_name()+"_" + bi.getValue().getClass_code();
        }
        StringBuffer sb = new StringBuffer();
        sb.append(prefix);
        // 收集满MAX_NUM后,形成一个url,push进去
        String url = PREFIX_URL_ + sb.toString()+ SUFFIX_URL_ +"?"+sb.toString();
        crawler.addSeed(url);
        logger.info(url);
        count++;
        if (count == 10) {
            break;
        }
    }
    crawler.setThreads(5);
    crawler.setMaxRetry(2);
    crawler.start(1);

    System.out.println(System.currentTimeMillis()-start);

代码如上,但是设置了但是仅仅爬了几个就退出了

设置timeout 以及设置 重连次数

有些页面 一次无法加在 或者加载慢 设置超时时间可以加快
另外一些则是服务器响应问题 可能一次始终获取不大doc 需要重试

这个里面是没有的吧?
或者要怎样配置才能做到上面的需求?

断点爬取怎么感觉每次都会重新爬取一次呢

你好,在爬取一段时间后,终端爬虫,再重新进行时,还会把相同的页面爬取下来。
每次输出的title都会出现相同的。 crawler.setResumable(true) ,设置true或者false,效果一样
是我设置的问题吗?多谢指教。附部分代码
@OverRide
public void visit(Page page, CrawlDatums next) {
System.out.println("<<<<<<<<<<visiting:"+page.getUrl()+"\tdepth="+page.getMetaData("depth"));
String title = page.select("title").first().text();
System.out.println(title);
}

/* afterVisit方法在visit方法执行后执行,这里super.afterVisit完成根据
   正则自动抽取链接的功能,因此这里需要保留 */
@Override
public void afterVisit(Page page, CrawlDatums next) {
    super.afterVisit(page, next); 

    //当前页面的depth为x,则从当前页面解析的后续任务的depth为x+1
    int depth;
    //如果在添加种子时忘记添加depth信息,可以通过这种方式保证程序不出错
    if(page.getMetaData("depth")==null){
        depth=1;
    }else{
        depth=Integer.valueOf(page.getMetaData("depth"));
    }
    depth++;
    for(CrawlDatum datum:next){
        datum.putMetaData("depth", depth+"");
    }
}

public static void main(String[] args) throws Exception {
    Example crawler=new Example("example_crawler", true);

    crawler.addSeed(new CrawlDatum("http://news.sohu.com/")
           .putMetaData("depth", "1"));

    crawler.addRegex(".*news.sohu.com/.*.shtml");
    crawler.setRetryInterval(1000);
    crawler.setVisitInterval(1000*1);
    crawler.setThreads(1);
    crawler.setResumable(true);

    crawler.start(3);
}

数据链接问题

找不到数据库链接释放的方法,不知道有没有添加进去

最新版的有在重定向时set cookie吗?

我用的不是最新版的,最近遇到重定向逻辑问题,场景如下:
模拟登陆,自动获取授权码,目标网页登陆(简称:A)后会重定向到用户首页(简称:B),同事给一个set-cookie的请求头,重定向到B时,如果没有把set-cookie中的cookie放入到“Cookie”中,就会造成一直反复重定向。

所以我在httpRequest.getResponse()中加入如下代码:

String newCookie = con.getHeaderField("Set-Cookie");
if (StringUtils.isNotBlank(newCookie)) {
      this.setCookie(newCookie);
      response.addHeader("Cookie", newCookie);
      this.crawlDatum.putMetaData("Cookie", newCookie);
}

这样就避免登陆成功因为没有setCookie导致重定向后又重定向会登陆页面的死循环,同时可以在后面的crawl中继续使用这个Cookie,知道新的“set-cookie”出现。
如果需要的话我可以吧重写的httpRequest给你你review一下,或者我push到我的fork中.

crawlDatums.add(datum) 之后不继续执行

多次调用 crawlDatums.add(datum);
在第二次解析数据后(findThirdUrl这个方法里),再次add数据,但是爬虫并没有继续执行,请问是为啥?大致代码如下:

 @Override
    public void visit(Page page, CrawlDatums crawlDatums) {
        if (FIRST.equals(page.meta(TYPE))) {
            System.out.println("match url-->" + page.getUrl());
            findSecondUrl(page, crawlDatums);
        } else if (SECOND.equals(page.meta(TYPE))) {
            findThirdUrl(page, crawlDatums);
        }
}
  private void findThird(Page page, CrawlDatums crawlDatums) {
        Elements seasons = page.select(***);
        for (int i = 0; i < seasons.size(); i++) {
            String text = seasons.get(i).text();
            CrawlDatum datum = new CrawlDatum();
            datum.setUrl(***);
            datum.meta(TYPE, THIRD);
            //继续爬  
           //TODO 无效
            crawlDatums.add(datum);
        }
    }

爬取的页面内部链接能修改么

比如我爬取的网站到本地后,html页面内部的链接怎么修改为本地链接,绝对路径修改为本地相对路径,这个问题很常见的吧

DemoDepthCrawler好像不能正常工作

在我运行DemoDepthCrawler时,并不能正确的获取深度,在爬取第二页时,日志如下:
total time: 2 seconds
2016-04-14 09:43:47 INFO cn.edu.hfut.dmic.webcollector.crawler.Crawler - start depth 2
2016-04-14 09:43:47 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - open generator:cn.edu.hfut.dmic.webcollector.plugin.berkeley.BerkeleyGenerator
2016-04-14 09:43:47 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - init segmentWriter:cn.edu.hfut.dmic.webcollector.plugin.berkeley.BerkeleyDBManager
2016-04-14 09:43:48 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - -activeThreads=50, spinWaiting=0, fetchQueue.size=84
visiting:http://news.hfut.edu.cn/show-1-29679-1.html depth=null
2016-04-14 09:43:48 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - done: http://news.hfut.edu.cn/show-1-29679-1.html
visiting:http://news.hfut.edu.cn/show-1-28410-1.html depth=null
visiting:http://news.hfut.edu.cn/show-1-29726-1.html depth=null
2016-04-14 09:43:48 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - done: http://news.hfut.edu.cn/show-1-28410-1.html
2016-04-14 09:43:48 INFO cn.edu.hfut.dmic.webcollector.fetcher.Fetcher - done: http://news.hfut.edu.cn/show-1-29726-1.html
visiting:http://news.hfut.edu.cn/show-1-12727-1.html depth=null
visiting:http://news.hfut.edu.cn/show-1-28790-1.html depth=null

发生异常的url再跑一次,我这段代码没有达到效果

/**
 * 发生异常,这页就再跑一次
 * 
 * @param page
 * @param next
 */
private void againCrawl(Page page, CrawlDatums next) {
    CrawlDatum datum = page.getCrawlDatum();
    Map<String, String> metaData = datum.getMetaData();
    if (metaData.isEmpty()) {
        metaData.put("againCrawled", "yes");
        next.add(datum);
    }
}

新版本怎么使用代理?

以前有proxyGenerator,怎么取消了?
我看的源码
public HttpResponse getResponse(CrawlDatum crawlDatum) throws Exception {
HttpRequest request=new HttpRequest(crawlDatum);
return request.getResponse();
}
写死了?怎么设置代理呀?谢谢

关于正则的问题

/*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
        this.addRegex("-.*#.*");

为什么这里的正则表示的是not 的意思呢?而不是匹配这些正则规则呢?

ContentCollector解析时间的正则表达式bug

源码中getTime,getDate方法中提供的解析时间的正则表达式可以匹配2015ab08cd20mn这样的字符串,为何不使用更为严格的正则呢?如:
(1-9)-/年-/月日?[\s| ]*(?:(1[0-9]|2[0-3]|0?[0-9]):时[:分]?(?:([1-5][0-9]|0?[0-9])秒?)?(?:.[0-9]{3})?)?

https类型的网站证书设置问题

运行环境:
WebCollector 2.0.9
jdk1.8.0_65
Linux
测试代码:
获取证书后,程序加入下面语句
System.setProperty("javax.net.ssl.**", **);
......
测试结果:
使用Jsoup直接connect没有问题。使用WebCollector时报如下错误。
日志输出:
javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: Certificates does not conform to algorithm constraints

请问breadthcrawler有timeout控制吗

我看了crawler有关的几个class没有控制http timeout相关的设定,想问一下http timeout时会有exception吗?如果没有的话我应该extend哪一个class进行timeout control?

爬取大数据量时,DB报错

这是我分析日志发现的,我的业务逻辑是在第一层里返回了所有页的Link
java.lang.IllegalStateException: Can't call Database.put Database was closed.
at com.sleepycat.je.Database.checkOpen(Database.java:1863)
at com.sleepycat.je.Database.put(Database.java:1168)
at com.itsecu.crawler.collector.webcollector.fetcher.SegmentWriter.wrtieLinks(SegmentWriter.java:89)
at com.itsecu.crawler.collector.webcollector.fetcher.Fetcher$FetcherThread.run(Fetcher.java:345)

visitt抛异常了, 我不希望再爬,怎么做?

Page page = getPage(crawlDatum);
crawlDatum.incrRetry(page.getRetry());
crawlDatum.setFetchTime(System.currentTimeMillis());

                CrawlDatums next = new CrawlDatums();
                if (visit(crawlDatum, page, next)) {
                    try {
                        /*写入fetch信息*/
                        dbManager.wrtieFetchSegment(crawlDatum);

我看你的retry是在getPage里,获取页面我没有报错,我是在visit里报错的,所以没有
dbManager.wrtieFetchSegment(crawlDatum);,也没有retry+1,还是没有爬的状态,所以一直在跑?是这样吗?

BekeleyDB 报错:LOG_FILE_NOT_FOUND: Log file missing

com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 5.0.73) /home/data/bing fetchTarget of 0x0/0x361 parent IN=5 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0x2/0x14468 lastLoggedVersion=0x2/0x14468 parent.getDirty()=true state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed

请问你有遇到过这个问题嘛? 在启动多线程时 爬虫会报错 沿用的版本是你最新的

层数过多导致失败,如何保证数据爬全

您好,我现在在做一个爬取分页的功能。
种子里是第一页,然后返回下一页的Links,一直这样,直到最后一页。
但是我发现每次爬取到20多层的时候就爬取不到了,导致异常。
我不知道是因为被爬网站的ip限制或者是您框架的漏洞,总之就抛异常了。
但是我在爬取第一页时候,就把第一条的时间存储了,以后就只会爬取最新的,爬到存储的时间就不爬了。
比如如果第10页抛异常了,10页后面的数据我就爬取不到。
请问您有什么好的办法能保证数据被爬全吗?
谢谢你,我准备好好研究你的框架,然后写一些我使用的心得,推广这个好东西。

实际应用中,有个页数限制的参数。

实际应用爬虫,有个页数限制的,比如爬取一个网站, 比如 bbs.fobshanghai.com
当你爬取一定数量的时候, bbs.fobshanghai.com 就会启动防卫机制,让你爬到的都是垃圾自动生成的网页,而且无穷尽.实际劫持了爬虫,让你永远也绕不出去.网络上有大量这样的网站,想方设法劫持搜索引擎的爬虫,最终目的是把流量流向他们那儿.

这个时候需要爬虫有一个页数限制的参数,没有这个参数,大量自动爬取时候,碰到它们就会让劫持住。
crawler.setmaxpages(10000);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.