Giter Site home page Giter Site logo

hemin1003 / java-spider Goto Github PK

View Code? Open in Web Editor NEW
336.0 22.0 149.0 14.78 MB

一个基于webmagic框架二次开发的java爬虫框架实战,已实现能爬取腾讯,搜狐,今日头条(单独集成功能)等资讯内容,配合elasticsearch框架用法,实现了自动爬虫,已投入线上生产使用。

JavaScript 14.63% Java 85.37%
spider scraper webmagic elasticsearch

java-spider's Introduction

JAVA爬虫框架实战

基于webmagic框架二次开发的java爬虫框架实战,已实现能爬取腾讯,搜狐,今日头条(单独集成功能,教程学习地址)等资讯内容,配合elasticsearch框架用法,实现了自动爬虫,已投入生产试用中。

后台管理统计系统源码

体验系统地址:http://182.92.82.188:8280/manage/login.jsp

体验账号/密码,test1001/a12345678

后台系统源码:https://github.com/hemin1003/aylson-parent

欢迎交流问题,可加我的个人QQ 469580884,或群号 751925591,一起探讨交流问题

我的博客地址

个人域名

感谢

如果觉得内容赞,您可以请我喝一杯咖啡:
    



参考项目资料如下:

欢迎使用 Gather Platform 数据采集与分析平台


Readme in English

详细使用方法请参考 在线文档

Build Status

Gather Platform 数据抓取平台是一套基于Webmagic内核的,具有Web任务配置和任务管理界面的数据采集与搜索平台.具有以下功能

  • 根据配置的模板进行数据采集,支持Ajax网页采集
  • 在不配置采集模板的情况下自动检测网页正文,自动抽取文章发布时间
  • 动态字段抽取与静态字段植入
  • 已抓取数据的管理,包括:搜索,增删改查,按照新的数据模板重新抽取数据
  • 对采集的数据进行NLP处理,包括:抽取关键词,抽取摘要,抽取实体词
  • 含有相关文章推荐,文章中人物、地点之间的关联关系分析

5分钟即可部署完毕,半分钟即可完成一个爬虫,开始数据采集.、

不需要进行任何编码就可以完成一个功能强大的爬虫.

show

Windows/Mac/Linux 全平台支持

本系统需要如下依赖:

  • JDK 8 及以上
  • Tomcat 8.3 及以上

可选依赖组件:

 - Elasticsearch 5.0

部署、使用方法、二次开发手册、常见问题等全部迁移至在线文档

java-spider's People

Contributors

hemin1003 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

java-spider's Issues

请问可以实现自动配置爬虫吗

没有看源码,抱歉,因为Gather Platform已经不可用了,这个项目好像是基于Gather Platform,那么可以实现自动配置爬虫吗,类似火车头那种。测试地址进不去了,还有这个爬虫是怎么用的,谢谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.