Giter Site home page Giter Site logo

netcarry's Introduction

netcarry

网页搬运工,用于特定页面的抓取解析,暂不支持深度抓取

##抓取程序执行demo

public class CarrayMainDemo {
    public static void main(String[] args) throws IOException {
        // 结果保存路径
        String savePath = "";
        // 日志路径
        String logPath = "logDir";
        // 抓取页面的入口
        String[] carryUrls = {};
        // 抓取应用的相关配置,包括连接配置信息,线程数等等
        NetCarryConfig config = new NetCarryConfig();
        config.setFetchThreadNumber(5);
        //
        PageFetchExecutor<String> main = new PageFetchExecutor<String>();
        FetchCollector<String> collector = new FetchCollector<String>(1000, savePath);
        PageParserDemo pageParser = new PageParserDemo(collector);
        NextPageParserA nextPageParsers = new NextPageURLParserDemo(Integer.MAX_VALUE);
        main.execute(logPath, Arrays.asList(carryUrls), config, pageParser, new NextPageParserA[]{nextPageParsers});
    }
}

##待抓取页面解析demo

public class NextPageURLParserDemo extends NextPageParserA {
    /**
     * deep参数定义的有些问题,目前不是当做深度再使用,作全局大抓取页面数
     * @param deep
     */
    public NextPageURLParserDemo(int maxPages) {
        super(maxPages);
    }

    /**
     * 该页面是否满足进一步抓取页面的要求
     */
    @Override
    public boolean needParserThisPage(String url) {
        return false;
    }

    /**
     * 解析出该页面中哪一些页面要作抓取
     */
    @Override
    protected List<PageMeta> parser(String url, Document document) {
        return null;
    }
}

##页面解析demo,即抓取者正真关心的内容

public class PageParserDemo extends FetchParser<String> {
    /**
     * collector为收集器,负责页面解析结果的收集存储
     * 
     * @param collector
     */
    public PageParserDemo(FetchCollector<String> collector) {
        super(collector);
    }

    /**
     * 该URL是否满足解析规则
     */
    @Override
    public boolean needParser(String url) {
        return false;
    }

    /**
     * 页面解析
     * 
     * @param page 该页面的父页面的信息
     */
    @Override
    protected List<String> parser(PageMetas page, Document document) {
        return null;
    }

}

about me

netcarry's People

Contributors

yinyayun avatar

Watchers

 avatar  avatar

Forkers

itkazakh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.