Giter Site home page Giter Site logo

speed / newcrawler Goto Github PK

View Code? Open in Web Editor NEW
584.0 30.0 115.0 147.29 MB

Free Web Scraping Tool with Java

Home Page: http://www.newcrawler.com

Shell 0.42% HTML 35.86% CSS 6.11% Java 0.29% JavaScript 57.22% Dockerfile 0.09%
spider crawler docker scraping

newcrawler's Introduction

NewCrawler

Free Web Scraping Tool

NewCrawler Quick Start

www.newcrawler.com

Linux

Installing software packages on Centos / Fedora servers:

x86

curl -fsSL https://raw.githubusercontent.com/speed/newcrawler/master/install_i586.sh | sh

x64

curl -fsSL https://raw.githubusercontent.com/speed/newcrawler/master/install_x86_64.sh | sh

Installing software packages on Ubuntu / Debian servers:

x86

curl -fsSL https://raw.githubusercontent.com/speed/newcrawler/master/install_Debian_i586.sh | sh

x64

curl -fsSL https://raw.githubusercontent.com/speed/newcrawler/master/install_Debian_x86_64.sh | sh

Installing NewCrawler and Chrome software packages on Centos / Fedora servers:

x86

curl -fsSL https://raw.githubusercontent.com/speed/newcrawler/master/install_NewCrawler_Chrome_MySQL_x86_64.sh | sh

	# OS Version 、 NewCrawler Directory
	
	[root@localhost ~]# rpm -q centos-release
	centos-release-7-0.1406.el7.centos.2.5.x86_64

	[root@localhost ~]# ls
	install.sh  newcrawler

	[root@localhost ~]# ls newcrawler
	db  jetty  jre  phantomjs  start.sh  stop.sh  war

Modify the database to MySQL or use the default file database

#edit 'war/WEB-INF/classes/datanucleus.properties'

javax.jdo.option.ConnectionURL=jdbc:mysql://127.0.0.1:3306/newcrawler?characterEncoding=UTF-8
javax.jdo.option.ConnectionUserName=root
javax.jdo.option.ConnectionPassword=123456

Windows

x86

https://github.com/speed/windows-32bit-jetty-jre

x64

https://github.com/speed/windows-64bit-jetty-jre

Google App Engine

https://github.com/speed/newcrawler-gae-shell

Docker

docker pull newcrawler/spider

docker run -itd -p --net=host 8500:8500 --name=newcrawler newcrawler/spider

docker logs -f newcrawler

Docker aliyun

docker run -itd -p --net=host 8500:8500 --name=newcrawler registry.cn-shenzhen.aliyuncs.com/speed/spider

Startup NewCrawler

sh newcrawler/start.sh &

http://127.0.0.1:8500

Shutdown NewCrawler

sh newcrawler/stop.sh

Upgrade NewCrawler

sh newcrawler/upgrade.sh

Install Chrome

https://github.com/speed/selenium

ScreenShot

NewCrawler Cluster

ScreenShot

newcrawler's People

Contributors

speed avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

newcrawler's Issues

异常


2012-09-16 23:50:59.694
com.soso.common.util.log.Logger error: class java.lang.NullPointerException null
    at java.util.regex.Matcher.getTextLength(Matcher.java:1269)
    at java.util.regex.Matcher.reset(Matcher.java:308)
    at java.util.regex.Matcher.<init>(Matcher.java:228)
    at java.util.regex.Pattern.matcher(Pattern.java:905)
    at com.soso.common.util.other.RegexUtil.replaceAll(RegexUtil.java:21)
    at com.soso.spider.service.impl.CrawlService.getNextLink(CrawlService.java:138)
    at com.soso.spider.service.impl.CrawlService.search(CrawlService.java:130)
    at com.soso.spider.service.impl.CrawlHtmlTestServiceImpl.crawl(CrawlHtmlTestServiceImpl.java:50)
    at com.soso.spider.hessian.api.impl.CrawlDataApiImpl.test(CrawlDataApiImpl.java:63)
    at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 6:53

下一页表达式TITLE修改

使用 '{标签名}' 或 '{索引号}' 
的方式来读取标签值,格式:${1}?pageNo=${2}&param1={3}&param2={4} 
...

改为:
使用 '${标签名}' 或 '${索引号}' 的方式来读取标签值
格式:${1}?pageNo=${2}&param1={3}&param2={4} ...

Original issue reported on code.google.com by [email protected] on 22 Oct 2012 at 6:20

采集规则更新后增加发布动作

采集规则更新后增加发布动作,默认发布按钮是灰色,只有在更新才可操作

采集规则更新后不实时生效,需要发布才生效

数据发布规则调整

包含推送数据Push、拉取数据Pull
JSON格式默认绑定property1、property2...propertyN

网址可用性检测添加检测方式


检测方式有:
1.Ping,使用HTTP请求读取响应码,当服务器响应错误时(如:4
04,500...),系统会发送邮件通知。
2.采集规则检测,使用采集规则去匹配网页内容,当没有匹配
到数据或连接IO异常时,系统会发送邮件通知

Original issue reported on code.google.com by [email protected] on 22 Aug 2012 at 4:24

在世界地图上显示爬虫数量,用汽泡大小代替区域高亮

修改删除过期数据的逻辑


1.有采集站点管理中添加"数据过期天数"字段
2.修改计划任务中的"删除旧采集数据" 为 "删除过期的数据"

Original issue reported on code.google.com by [email protected] on 22 Aug 2012 at 4:50

采集规则的索引帮助信息修改

采集规则的索引帮助信息修改

OLD:采集索引非常重要,直接影响采集结果。
NEW:采集索引只允许是数字类型,系统将按照索引顺序对各字段进行采集

Original issue reported on code.google.com by [email protected] on 21 Aug 2012 at 8:38

功能完善

1.通过一个网址入口爬取整站相式网址(2012.05.26)
2.可以对采集到的数据进行编辑再发布,功能点:数据编辑,手动发布(2012.05.26)
  手动发布已完成时间:2012.06.01
3.网址入库规则,添加一个必须包含的规则字段(2012.05.27)
  完成时间:2012.06.01
4.允许爬虫在抓取数据时就过滤一次重复采集到的数据(2012.05.31)
  完成时间:2012.06.02
5.不保存采集到的数据,减少GAE的数据库使用配额(2012.05.31)
  完成时间:2012.06.03
6.可以配置爬虫采集速率,控制爬虫APP的并发数同时可以减少实例启动数量,降低Frontend Instance Hours的配额(2012.06.05)
  完成时间:2012.06.11
7.能够勾选指定的规则进行采集测试(2012.05.31)
  完成时间:2012.06.11
8.能够对采集网址和采集数据导出到Excel(2012.06.12)
9.实现OCR功能(2012.06.12)
  完成时间:2012.06.12

Original issue reported on code.google.com by [email protected] on 21 Aug 2012 at 8:48

插件支持


1.抓取请求插件
2.采集过滤插件
3.文件保存插件
4.数据发布插件

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 8:51

HTML排除标签


1.HTML排除标签应该不区分大小写
2.应该只排除标签,保留文本内容
3.保证嵌套匹配

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 4:28

快速开始

创建任务“名称、网址、列表页/内容页”
下一步

XPATH可视化提取
DONE

数据导出API

常用编码


<select xml:lang="en" dir="ltr" name="db_collation" id="select_db_collation">
<option value=""></option>
<optgroup label="armscii8" title="ARMSCII-8 Armenian">
<option value="armscii8_bin" title="亚美尼亚语, 
二进制">armscii8_bin</option>
<option value="armscii8_general_ci" title="亚美尼亚语, 
不区分大小写">armscii8_general_ci</option>
</optgroup>
<optgroup label="ascii" title="US ASCII">
<option value="ascii_bin" title="西欧 (多语言), 
二进制">ascii_bin</option>
<option value="ascii_general_ci" title="西欧 (多语言), 
不区分大小写">ascii_general_ci</option>
</optgroup>
<optgroup label="binary" title="Binary pseudo charset">
<option value="binary" title="二进制">binary</option>
</optgroup>
<optgroup label="cp1250" title="Windows Central European">
<option value="cp1250_bin" title="中欧 (多语言), 
二进制">cp1250_bin</option>
<option value="cp1250_croatian_ci" title="克罗地亚语, 
不区分大小写">cp1250_croatian_ci</option>
<option value="cp1250_general_ci" title="中欧 (多语言), 
不区分大小写">cp1250_general_ci</option>
</optgroup>
<optgroup label="cp1251" title="Windows Cyrillic">
<option value="cp1251_bin" title="西里尔语 (多语言), 
二进制">cp1251_bin</option>
<option value="cp1251_bulgarian_ci" title="保加利亚语, 
不区分大小写">cp1251_bulgarian_ci</option>
<option value="cp1251_general_ci" title="西里尔语 (多语言), 
不区分大小写">cp1251_general_ci</option>
<option value="cp1251_general_cs" title="西里尔语 (多语言), 
区分大小写">cp1251_general_cs</option>
<option value="cp1251_ukrainian_ci" title="乌克兰语, 
不区分大小写">cp1251_ukrainian_ci</option>
</optgroup>
<optgroup label="cp1256" title="Windows Arabic">
<option value="cp1256_bin" title="阿拉伯语, 二进制">cp1256_bin</option>
<option value="cp1256_general_ci" title="阿拉伯语, 
不区分大小写">cp1256_general_ci</option>
</optgroup>
<optgroup label="cp1257" title="Windows Baltic">
<option value="cp1257_bin" title="巴拉克语 (多语言), 
二进制">cp1257_bin</option>
<option value="cp1257_general_ci" title="巴拉克语 (多语言), 
不区分大小写">cp1257_general_ci</option>
<option value="cp1257_lithuanian_ci" title="立陶宛语, 
不区分大小写">cp1257_lithuanian_ci</option>
</optgroup>
<optgroup label="cp850" title="DOS West European">
<option value="cp850_bin" title="西欧 (多语言), 
二进制">cp850_bin</option>
<option value="cp850_general_ci" title="西欧 (多语言), 
不区分大小写">cp850_general_ci</option>
</optgroup>
<optgroup label="cp852" title="DOS Central European">
<option value="cp852_bin" title="中欧 (多语言), 
二进制">cp852_bin</option>
<option value="cp852_general_ci" title="中欧 (多语言), 
不区分大小写">cp852_general_ci</option>
</optgroup>
<optgroup label="cp866" title="DOS Russian">
<option value="cp866_bin" title="俄语, 二进制">cp866_bin</option>
<option value="cp866_general_ci" title="俄语, 
不区分大小写">cp866_general_ci</option>
</optgroup>
<optgroup label="dec8" title="DEC West European">
<option value="dec8_bin" title="西欧 (多语言), 二进制">dec8_bin</option>
<option value="dec8_swedish_ci" title="瑞典语, 
不区分大小写">dec8_swedish_ci</option>
</optgroup>
<optgroup label="gbk" title="GBK Simplified Chinese">
<option value="gbk_bin" title="简体中文, 二进制">gbk_bin</option>
<option value="gbk_chinese_ci" title="简体中文, 不区分大小写" 
selected="selected">gbk_chinese_ci</option>
</optgroup>
<optgroup label="geostd8" title="GEOSTD8 Georgian">
<option value="geostd8_bin" title="乔治亚语, 二进制">geostd8_bin</option>
<option value="geostd8_general_ci" title="乔治亚语, 
不区分大小写">geostd8_general_ci</option>
</optgroup>
<optgroup label="greek" title="ISO 8859-7 Greek">
<option value="greek_bin" title="希腊语, 二进制">greek_bin</option>
<option value="greek_general_ci" title="希腊语, 
不区分大小写">greek_general_ci</option>
</optgroup>
<optgroup label="hebrew" title="ISO 8859-8 Hebrew">
<option value="hebrew_bin" title="希伯来语, 二进制">hebrew_bin</option>
<option value="hebrew_general_ci" title="希伯来语, 
不区分大小写">hebrew_general_ci</option>
</optgroup>
<optgroup label="hp8" title="HP West European">
<option value="hp8_bin" title="西欧 (多语言), 二进制">hp8_bin</option>
<option value="hp8_english_ci" title="英语, 
不区分大小写">hp8_english_ci</option>
</optgroup>
<optgroup label="keybcs2" title="DOS Kamenicky Czech-Slovak">
<option value="keybcs2_bin" title="捷克斯洛伐克语, 
二进制">keybcs2_bin</option>
<option value="keybcs2_general_ci" title="捷克斯洛伐克语, 
不区分大小写">keybcs2_general_ci</option>
</optgroup>
<optgroup label="koi8r" title="KOI8-R Relcom Russian">
<option value="koi8r_bin" title="俄语, 二进制">koi8r_bin</option>
<option value="koi8r_general_ci" title="俄语, 
不区分大小写">koi8r_general_ci</option>
</optgroup>
<optgroup label="koi8u" title="KOI8-U Ukrainian">
<option value="koi8u_bin" title="乌克兰语, 二进制">koi8u_bin</option>
<option value="koi8u_general_ci" title="乌克兰语, 
不区分大小写">koi8u_general_ci</option>
</optgroup>
<optgroup label="latin1" title="cp1252 West European">
<option value="latin1_bin" title="西欧 (多语言), 
二进制">latin1_bin</option>
<option value="latin1_danish_ci" title="丹麦语, 
不区分大小写">latin1_danish_ci</option>
<option value="latin1_general_ci" title="西欧 (多语言), 
不区分大小写">latin1_general_ci</option>
<option value="latin1_general_cs" title="西欧 (多语言), 
区分大小写">latin1_general_cs</option>
<option value="latin1_german1_ci" title="德语 (字典), 
不区分大小写">latin1_german1_ci</option>
<option value="latin1_german2_ci" title="德语 (电话本), 
不区分大小写">latin1_german2_ci</option>
<option value="latin1_spanish_ci" title="西班牙语, 
不区分大小写">latin1_spanish_ci</option>
<option value="latin1_swedish_ci" title="瑞典语, 
不区分大小写">latin1_swedish_ci</option>
</optgroup>
<optgroup label="latin2" title="ISO 8859-2 Central European">
<option value="latin2_bin" title="中欧 (多语言), 
二进制">latin2_bin</option>
<option value="latin2_croatian_ci" title="克罗地亚语, 
不区分大小写">latin2_croatian_ci</option>
<option value="latin2_general_ci" title="中欧 (多语言), 
不区分大小写">latin2_general_ci</option>
<option value="latin2_hungarian_ci" title="匈牙利语, 
不区分大小写">latin2_hungarian_ci</option>
</optgroup>
<optgroup label="latin5" title="ISO 8859-9 Turkish">
<option value="latin5_bin" title="土耳其语, 二进制">latin5_bin</option>
<option value="latin5_turkish_ci" title="土耳其语, 
不区分大小写">latin5_turkish_ci</option>
</optgroup>
<optgroup label="latin7" title="ISO 8859-13 Baltic">
<option value="latin7_bin" title="巴拉克语 (多语言), 
二进制">latin7_bin</option>
<option value="latin7_estonian_cs" title="爱沙尼亚语, 
区分大小写">latin7_estonian_cs</option>
<option value="latin7_general_ci" title="巴拉克语 (多语言), 
不区分大小写">latin7_general_ci</option>
<option value="latin7_general_cs" title="巴拉克语 (多语言), 
区分大小写">latin7_general_cs</option>
</optgroup>
<optgroup label="macce" title="Mac Central European">
<option value="macce_bin" title="中欧 (多语言), 
二进制">macce_bin</option>
<option value="macce_general_ci" title="中欧 (多语言), 
不区分大小写">macce_general_ci</option>
</optgroup>
<optgroup label="macroman" title="Mac West European">
<option value="macroman_bin" title="西欧 (多语言), 
二进制">macroman_bin</option>
<option value="macroman_general_ci" title="西欧 (多语言), 
不区分大小写">macroman_general_ci</option>
</optgroup>
<optgroup label="swe7" title="7bit Swedish">
<option value="swe7_bin" title="瑞典语, 二进制">swe7_bin</option>
<option value="swe7_swedish_ci" title="瑞典语, 
不区分大小写">swe7_swedish_ci</option>
</optgroup>
<optgroup label="utf8" title="UTF-8 Unicode">
<option value="utf8_bin" title="Unicode (多语言), 
二进制">utf8_bin</option>
<option value="utf8_czech_ci" title="捷克语, 
不区分大小写">utf8_czech_ci</option>
<option value="utf8_danish_ci" title="丹麦语, 
不区分大小写">utf8_danish_ci</option>
<option value="utf8_esperanto_ci" title="世界语, 
不区分大小写">utf8_esperanto_ci</option>
<option value="utf8_estonian_ci" title="爱沙尼亚语, 
不区分大小写">utf8_estonian_ci</option>
<option value="utf8_general_ci" title="Unicode (多语言), 
不区分大小写">utf8_general_ci</option>
<option value="utf8_hungarian_ci" title="匈牙利语, 
不区分大小写">utf8_hungarian_ci</option>
<option value="utf8_icelandic_ci" title="冰岛语, 
不区分大小写">utf8_icelandic_ci</option>
<option value="utf8_latvian_ci" title="拉脱维亚语, 
不区分大小写">utf8_latvian_ci</option>
<option value="utf8_lithuanian_ci" title="立陶宛语, 
不区分大小写">utf8_lithuanian_ci</option>
<option value="utf8_persian_ci" title="波斯语, 
不区分大小写">utf8_persian_ci</option>
<option value="utf8_polish_ci" title="波兰语, 
不区分大小写">utf8_polish_ci</option>
<option value="utf8_roman_ci" title="西欧, 
不区分大小写">utf8_roman_ci</option>
<option value="utf8_romanian_ci" title="罗马尼亚语, 
不区分大小写">utf8_romanian_ci</option>
<option value="utf8_slovak_ci" title="斯洛伐克语, 
不区分大小写">utf8_slovak_ci</option>
<option value="utf8_slovenian_ci" title="斯洛文尼亚语, 
不区分大小写">utf8_slovenian_ci</option>
<option value="utf8_spanish2_ci" title="传统西班牙语, 
不区分大小写">utf8_spanish2_ci</option>
<option value="utf8_spanish_ci" title="西班牙语, 
不区分大小写">utf8_spanish_ci</option>
<option value="utf8_swedish_ci" title="瑞典语, 
不区分大小写">utf8_swedish_ci</option>
<option value="utf8_turkish_ci" title="土耳其语, 
不区分大小写">utf8_turkish_ci</option>
<option value="utf8_unicode_ci" title="Unicode (多语言), 
不区分大小写">utf8_unicode_ci</option>
</optgroup>
</select>

Original issue reported on code.google.com by [email protected] on 27 Sep 2012 at 10:20

站点分页规则修改

1."当前页正则"改成"下一页页号"
  下一页页号可以使用正则用来读取分页的数字编号,系统会将正则表达式中第一个括号里的内容作为下一页的页号
  如:<a title="下一页" href="/\?page=(\d+)"
2."分页链接正则"改成"分页链接"
  格式:"http://blog.02ta.com/?page=(*)",其中"(*)"代表分页的页号位置

Original issue reported on code.google.com by [email protected] on 22 Aug 2012 at 1:43

BUG


1.采集站点管理,去掉"自动采集"字段
2.数据采集规则,"偏移标签">"采集偏移"

Original issue reported on code.google.com by [email protected] on 3 Sep 2012 at 2:42

式样修改

保持登录状态选择框式样
margin: 0;
vertical-align: bottom;

Original issue reported on code.google.com by [email protected] on 12 Oct 2012 at 8:16

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.