Giter Site home page Giter Site logo

gecco's People

Contributors

aimilin6688 avatar baiqirui avatar bearswallow avatar chncaption avatar cvedetect avatar dependabot[bot] avatar letsky avatar ls9527 avatar mconintet avatar patcon avatar shangjian avatar xiaomaoguai avatar xtuhcy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gecco's Issues

如何设置爬虫的超时时间

int timeout = 1000;
if(context != null) {
currDownloader = context.getDownloader();
before = context.getBeforeDownload();
after = context.getAfterDownload();
timeout = context.getTimeout();
} else {
currDownloader = engine.getSpiderBeanFactory().getDownloaderFactory().defaultDownloader();
}
if(before != null) {
before.process(request);
}
HttpResponse response = currDownloader.download(request, timeout);

这段代码貌似是通过下载器去下载网页的,但是超时时间是硬编码写死的,有点不方便,可否开发一个接口让用户自己设置了

能不能发布下修改日志

能否添加下 每个版本的修改日志,增加了什么功能,修改了什么功能。这样也方便大家及时更新

关于爬虫中使用代理ip

当频繁的使用爬虫时候,某些网站会采取一些措施,是否会在代码中考虑加入代理ip?

使用非阻塞方式创建多个 GeccoEngine 报错

GeccoEngine.create()
                .classpath("xxx")
                .pipelineFactory(springPipelineFactory)
                .interval(1000)
                .start(request).start();

如上 创建多个会出现以下异常

Exception in thread "GeccoEngine" java.lang.IllegalStateException: zip file closed
	at java.util.zip.ZipFile.ensureOpen(ZipFile.java:669)
	at java.util.zip.ZipFile.access$200(ZipFile.java:60)
	at java.util.zip.ZipFile$ZipEntryIterator.hasNext(ZipFile.java:493)
	at java.util.zip.ZipFile$ZipEntryIterator.hasMoreElements(ZipFile.java:488)
	at java.util.jar.JarFile$JarEntryIterator.hasNext(JarFile.java:253)
	at java.util.jar.JarFile$JarEntryIterator.hasMoreElements(JarFile.java:262)
	at org.reflections.vfs.ZipDir$1$1.computeNext(ZipDir.java:30)
	at org.reflections.vfs.ZipDir$1$1.computeNext(ZipDir.java:26)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
	at org.reflections.Reflections.scan(Reflections.java:243)
	at org.reflections.Reflections.scan(Reflections.java:202)
	at org.reflections.Reflections.<init>(Reflections.java:123)
	at org.reflections.Reflections.<init>(Reflections.java:168)
	at org.reflections.Reflections.<init>(Reflections.java:141)
	at com.geccocrawler.gecco.monitor.GeccoJmx.export(GeccoJmx.java:16)
	at com.geccocrawler.gecco.GeccoEngine.run(GeccoEngine.java:250)

在启动时 我不想配置 starts.json 文件 有什么方式能够避免报错? 如果不配置 starts.json 会抛一个异常。

感谢!

What's the meaning of this ERROR? thx

I have an error log as:
"ERROR com.geccocrawler.gecco.spider.Spider - cant't match url : https://someurl
and Gecco shutdown after that:
2287 [GeccoEngine] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection manager is shutting down
2287 [GeccoEngine] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection manager shut down
2295 [GeccoEngine] INFO com.geccocrawler.gecco.GeccoEngine - close gecco!
what's happening here? How can I solve this?

空指针异常

public String getContent(InputStream instream, long contentLength, String charset) throws IOException {
try {
if (instream == null) {
return null;
}
int i = (int)contentLength;
if (i < 0) {
i = 4096;
}
Reader reader = new InputStreamReader(instream, charset);
CharArrayBuffer buffer = new CharArrayBuffer(i);
char[] tmp = new char[1024];
int l;
while((l = reader.read(tmp)) != -1) {
buffer.append(tmp, 0, l);
}
return buffer.toString();
} finally {
instream.reset();
}
}

如果入参InputStream为null, finally中instream.reset()会发生空指针异常

URL正则匹配以及动态生成字段问题

  1. URL正则匹配的问题
    请问URL是否支持正则匹配?我自己测试下来是不支持的。

  2. 动态生成字段问题
    我指定生成一个Integer类型的List集合:.listField("storiesId", Integer.class)
    断点跟踪解析页面的值格式为:Integer数组。但是在beanMap.put赋值时,抛出异常Object can not cast Integer的类型转换异常。我将动态字段的类型改为String,也抛出转换异常。最后将类型改为Object才运行成功。请问集合字段只能设置为Object吗?

代理ip

如何动态更新代理ip,可否给个提示

url#号后内容无法匹配

matchUrl = "https://www.xxx.com/currencies/{codeName}/#/?coin_code={codeID}"

使用的地址是:https://www.xxx.com/currencies/dogecoin/#/?coin_code=1

但是提示

[ ERROR] [2018-03-14 11:13:29] com.geccocrawler.gecco.spider.Spider [98] - cant't match url : https://www.xxx.com/currencies/dogecoin/
[ INFO ] [2018-03-14 11:13:29] com.geccocrawler.gecco.GeccoEngine [355] - close gecco!

麻烦作者看一下

ajax 跨域调用问题

如果需要访问的ajax地址只能使用jsonp的形式访问,那么返回的数据形式就是callback(data),需要如何处理?

许多包找不到

我clone源代码到本地,运行MyGithub.java失败,比如“com.alibaba.fastjson.JSON”找不到,然后我手动在pom.xml文件中添加

<dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.12</version>
</dependency>

还是找不到符号

不能设置User-Agent ,Read timed out

希望能增加读取失败后能循环重试
ERROR com.geccocrawler.gecco.spider.Spider:run(107) - com.geccocrawler.gecco.downloader.DownloadException: java.net.SocketTimeoutException: Read timed out

HtmlRender 中执行顺序问题

请教下HtmlRender类中JSVarFieldRender和AjaxFieldRender的顺序先后有什么考虑吗?
目前是AjaxFieldRender在前,JSVarFieldRender在后.

url不匹配是什么原因?

[Spider-0] WARN o.a.h.c.p.ResponseProcessCookies - Invalid cookie header: "Set-Cookie: logged_in=no; domain=.github.com; path=/; expires=Fri, 19 Jun 2037 08:57:04 -0000; secure; HttpOnly". Invalid 'expires' attribute: Fri, 19 Jun 2037 08:57:04 -0000
[Spider-0] ERROR c.g.gecco.spider.Spider - cant't match url : https://github.com/xtuhcy/gecco
运行了Quick start出现url不匹配问题,请问这是什么原因?谢谢~

一条记录对应两份数据的时候如何合并呢?

比如我要爬商品信息, 列表页有一部分信息,详情页有一部分信息, 这两份信息对应同一个商品,我要怎么合并他俩呢?两个请求的对应处理方法也不存在调用关系,怎么传值呢?

多线程设置无效

GeccoEngine.create() .classpath("com.wanshifu.customer.support.logger") .thread(8) .loop(false) .interval(1) .start("http://baojia.3hk.cn") .run();

1522245625610

maven引入依赖后,找不到HtmlBean

您好,项目引入了该框架,但是写自己的bean实现HtmlBean的时候,却提示没有该类。相应的注解也不能使用。maven项目显示并未有依赖出错,这个是为什么呢?

JsonFieldRender渲染ajax请求返回的数据

这个ajax请求返回的数据应该是经过服务器端渲染的html片段,自己写了一个CustomFieldRender解析了其中的数据。
问题是每次请求,JsonFieldRender的Object json = JSON.parse(jsonStr)都会抛出异常c.g.gecco.spider.render.RenderException,看着有点不爽。。

DynamicGecco中的com.geccocrawler.gecco.dynamic.HtlmBean有吗????

运行test中的DynamicJD提示
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:21)
[2018.09.17 10:20:03] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanLtBhDd238270934710728 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:24)
[2018.09.17 10:20:03] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanbsSPeq238270939146017 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:31)
[2018.09.17 10:20:07] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanpbZIxS238274843082106 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:39)
[2018.09.17 10:20:08] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanbVTiPG238276603663875 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:47)
[2018.09.17 10:20:10] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.JsonBeanErCvwv238278235622259 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.JsonBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:79)
at com.geccocrawler.gecco.dynamic.DynamicGecco.json(DynamicGecco.java:28)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:56)
[2018.09.17 10:20:11] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.JsonBeanyZOiDd238278987743392 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.JsonBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:79)
at com.geccocrawler.gecco.dynamic.DynamicGecco.json(DynamicGecco.java:28)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:62)
[2018.09.17 10:20:12] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanPaQYRZ238280131554757 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:69)

发现即使是阻塞方式启动 线程也没有正常结束

相关问题
#41

GeccoEngine.create()
                    .classpath("xxx")
                    .pipelineFactory(springPipelineFactory)
                    .interval(1000)
                    .thread(thread)
                    .start(request).run();

System.out.println("here");

下次启动 线程会一直累加 一直到服务器最大线程 最后崩掉

如何排除问题出在哪里

爬虫执行大概12小时候,突然吧报错,打开界面查看,界面元素并无错误。 该如何排错?

2018-01-22 13:13:17.827  WARN 8178 --- [r.gecco.spring0] c.g.gecco.downloader.AbstractDownloader  : inputstream org.apache.http.conn.EofSensorInputStream don't to byte inputstream!
2018-01-22 13:13:17.828 ERROR 8178 --- [r.gecco.spring0] com.geccocrawler.gecco.spider.Spider     : http://www.mzitu.com/116664/42 ERROR : java.lang.NullPointerExceptionnull
2018-01-22 13:13:21.979  INFO 8178 --- [    GeccoEngine] com.geccocrawler.gecco.GeccoEngine       : close gecco!

如何处理js回调

try{userInfoCallBack({"errcode":0,"msg":"success.","userdata":{"renderJDDate":[{"r":0,"msg":{"pin":"jlinxy","userLevel":62,"nickname":"Story_龙鳞","yunMidImageUrl":"http://storage.jd.com/i.imageUpload/6a6c696e787931333738363137353733393435_mid.jpg","levelName":"金牌用户","verifyEmail":"1258****@qq.com","verifyMobile":"176*****197"}}]}});}catch(e){}
JSON.parse(jsonStr) 就抛异常了

cant't match url

[2018.11.02 23:37:10] ERROR com.geccocrawler.gecco.spider.Spider:run(77) - cant't match url : http://www.dianping.com/citylist

Process finished with exit code 0

原码下载后,运行例子不能执行,报如上错,大家有遇到这个问题嘛

Spider 和 Scheduler 增强与拓展

你好,请问后期会不会增加对 Spider 和 Schedulter 拓展设置,目前我看源代码是直接在 GeccoEngine 中 new 相关对象,如果想要拓展必须通过修改 GeccoEngine 类来实现
哈哈,我主要是想增加一种拓展,类似 ScheduledThreadPoolExecutor 或者 Netty 中的 EventLoopGroup, 更自由方便的添加爬取任务

使用 run 方式启动 监听器无效

GeccoEngine.create()
.classpath("xxx")
.pipelineFactory(springPipelineFactory)
.interval(1000)
.thread(thread)
.setEventListener(listener)
.start(request).run();

监听器的方法不会被调用

返回HttpResponse中的数据流丢失

当String content = EntityUtils.toString(responseEntity, charset);执行完以后,前面操作的resp.setRaw(responseEntity.getContent());中的流已经被消耗了,也就是从这个流中已经获取不到内容数据了

connection time out

电脑环境可以正常上网。并且可以正常访问github.但是运行demo的时候发生链接超时问题。
ERROR com.geccocrawler.gecco.spider.Spider:run(109) - https://github.com/xtuhcy/gecco DOWNLOAD ERROR :org.apache.http.conn.HttpHostConnectException: Connect to github.com:443 [github.com/192.30.255.112, github.com/192.30.255.113] failed: Connection timed out: connect
请问这个是什么问题?

自定义Spider处理DownloadServerException

现有的Spider类处理DownloadServerException只是通过日志打印出来,但我想把失败的url重新放入爬虫队列。
希望GeccoEngine可以支持设置自定义Spider,现有的run方法会给spiders重新赋值(下面图片的代码)。

wanj

参数编码encode与decode问题

1.2.5版本的UrlMatcher类的87行
value = URLDecoder.decode(value, "UTF-8");
编码类型硬编码写死的,应该采用HttpRequest里面用户设定的Charset
不然会造成编码类型不一致造成乱码的情况

获取字符集有问题

抓取GBK的网站(例如51job)的时候,会出现乱码,我debug了一下,问题出在HttpClientDownloader类download方法,if(status == 200)后(180行附近),request.getCharset()不知为何为null,且contentType只能取到text/html,无法取到charset=gbk,导致取了默认值utf-8造成乱码。请确认

线程没有正确结束?

链接大概有5-10k左右 如下:

List<HttpRequest> requests;
 GeccoEngine.create()
                .pipelineFactory(springPipelineFactory)
                .classpath("xxx")
                .interval(2000)
                .thread(thread)
                .start(requests).start();

运行了一段时间过后 项目报错

java.lang.OutOfMemoryError: Unable to create new native thread

观察了下系统线程

已经跑完的链接 run 方法并没有被结束 ,请问下爬虫跑完为什么没有让线程停止呢? 谢谢了。

List属性解析失败

List 内容如果是表格的每一行,会解析不到数据!
例如:`

1 2 3 4 ` 这样的html 内容,通过jsoup获取不到数据,jsuop将文本处理成` 1 2 3 4 `导致解析不到数据,希望作者修复下,谢谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.