xtuhcy / gecco Goto Github PK
View Code? Open in Web Editor NEWEasy to use lightweight web crawler(易用的轻量化网络爬虫)
License: MIT License
Easy to use lightweight web crawler(易用的轻量化网络爬虫)
License: MIT License
int timeout = 1000;
if(context != null) {
currDownloader = context.getDownloader();
before = context.getBeforeDownload();
after = context.getAfterDownload();
timeout = context.getTimeout();
} else {
currDownloader = engine.getSpiderBeanFactory().getDownloaderFactory().defaultDownloader();
}
if(before != null) {
before.process(request);
}
HttpResponse response = currDownloader.download(request, timeout);
这段代码貌似是通过下载器去下载网页的,但是超时时间是硬编码写死的,有点不方便,可否开发一个接口让用户自己设置了
能否添加下 每个版本的修改日志,增加了什么功能,修改了什么功能。这样也方便大家及时更新
当频繁的使用爬虫时候,某些网站会采取一些措施,是否会在代码中考虑加入代理ip?
GeccoEngine.create()
.classpath("xxx")
.pipelineFactory(springPipelineFactory)
.interval(1000)
.start(request).start();
如上 创建多个会出现以下异常
Exception in thread "GeccoEngine" java.lang.IllegalStateException: zip file closed
at java.util.zip.ZipFile.ensureOpen(ZipFile.java:669)
at java.util.zip.ZipFile.access$200(ZipFile.java:60)
at java.util.zip.ZipFile$ZipEntryIterator.hasNext(ZipFile.java:493)
at java.util.zip.ZipFile$ZipEntryIterator.hasMoreElements(ZipFile.java:488)
at java.util.jar.JarFile$JarEntryIterator.hasNext(JarFile.java:253)
at java.util.jar.JarFile$JarEntryIterator.hasMoreElements(JarFile.java:262)
at org.reflections.vfs.ZipDir$1$1.computeNext(ZipDir.java:30)
at org.reflections.vfs.ZipDir$1$1.computeNext(ZipDir.java:26)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at org.reflections.Reflections.scan(Reflections.java:243)
at org.reflections.Reflections.scan(Reflections.java:202)
at org.reflections.Reflections.<init>(Reflections.java:123)
at org.reflections.Reflections.<init>(Reflections.java:168)
at org.reflections.Reflections.<init>(Reflections.java:141)
at com.geccocrawler.gecco.monitor.GeccoJmx.export(GeccoJmx.java:16)
at com.geccocrawler.gecco.GeccoEngine.run(GeccoEngine.java:250)
在启动时 我不想配置 starts.json
文件 有什么方式能够避免报错? 如果不配置 starts.json
会抛一个异常。
感谢!
有时候为避免重复抓取需要自定义 Scheduler,这个时候怎么自定义与设置?
I have an error log as:
"ERROR com.geccocrawler.gecco.spider.Spider - cant't match url : https://someurl
and Gecco shutdown after that:
2287 [GeccoEngine] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection manager is shutting down
2287 [GeccoEngine] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection manager shut down
2295 [GeccoEngine] INFO com.geccocrawler.gecco.GeccoEngine - close gecco!
what's happening here? How can I solve this?
如果入参InputStream为null, finally中instream.reset()会发生空指针异常
URL正则匹配的问题
请问URL是否支持正则匹配?我自己测试下来是不支持的。
动态生成字段问题
我指定生成一个Integer
类型的List
集合:.listField("storiesId", Integer.class)
。
断点跟踪解析页面的值格式为:Integer
数组。但是在beanMap.put
赋值时,抛出异常Object can not cast Integer
的类型转换异常。我将动态字段的类型改为String
,也抛出转换异常。最后将类型改为Object
才运行成功。请问集合字段只能设置为Object
吗?
如何动态更新代理ip,可否给个提示
matchUrl = "https://www.xxx.com/currencies/{codeName}/#/?coin_code={codeID}"
使用的地址是:https://www.xxx.com/currencies/dogecoin/#/?coin_code=1
但是提示
[ ERROR] [2018-03-14 11:13:29] com.geccocrawler.gecco.spider.Spider [98] - cant't match url : https://www.xxx.com/currencies/dogecoin/
[ INFO ] [2018-03-14 11:13:29] com.geccocrawler.gecco.GeccoEngine [355] - close gecco!
麻烦作者看一下
如果需要访问的ajax地址只能使用jsonp的形式访问,那么返回的数据形式就是callback(data),需要如何处理?
我clone源代码到本地,运行MyGithub.java失败,比如“com.alibaba.fastjson.JSON”找不到,然后我手动在pom.xml
文件中添加
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.12</version>
</dependency>
还是找不到符号
希望能增加读取失败后能循环重试
ERROR com.geccocrawler.gecco.spider.Spider:run(107) - com.geccocrawler.gecco.downloader.DownloadException: java.net.SocketTimeoutException: Read timed out
请教下HtmlRender类中JSVarFieldRender和AjaxFieldRender的顺序先后有什么考虑吗?
目前是AjaxFieldRender在前,JSVarFieldRender在后.
[Spider-0] WARN o.a.h.c.p.ResponseProcessCookies - Invalid cookie header: "Set-Cookie: logged_in=no; domain=.github.com; path=/; expires=Fri, 19 Jun 2037 08:57:04 -0000; secure; HttpOnly". Invalid 'expires' attribute: Fri, 19 Jun 2037 08:57:04 -0000
[Spider-0] ERROR c.g.gecco.spider.Spider - cant't match url : https://github.com/xtuhcy/gecco
运行了Quick start出现url不匹配问题,请问这是什么原因?谢谢~
比如我要爬商品信息, 列表页有一部分信息,详情页有一部分信息, 这两份信息对应同一个商品,我要怎么合并他俩呢?两个请求的对应处理方法也不存在调用关系,怎么传值呢?
登录应该属于爬虫核心功能,是否有wiki?或者api?
您好,项目引入了该框架,但是写自己的bean实现HtmlBean的时候,却提示没有该类。相应的注解也不能使用。maven项目显示并未有依赖出错,这个是为什么呢?
这个ajax请求返回的数据应该是经过服务器端渲染的html片段,自己写了一个CustomFieldRender解析了其中的数据。
问题是每次请求,JsonFieldRender的Object json = JSON.parse(jsonStr)都会抛出异常c.g.gecco.spider.render.RenderException,看着有点不爽。。
赞一个
运行test中的DynamicJD提示
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:21)
[2018.09.17 10:20:03] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanLtBhDd238270934710728 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:24)
[2018.09.17 10:20:03] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanbsSPeq238270939146017 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:31)
[2018.09.17 10:20:07] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanpbZIxS238274843082106 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:39)
[2018.09.17 10:20:08] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanbVTiPG238276603663875 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:47)
[2018.09.17 10:20:10] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.JsonBeanErCvwv238278235622259 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.JsonBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:79)
at com.geccocrawler.gecco.dynamic.DynamicGecco.json(DynamicGecco.java:28)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:56)
[2018.09.17 10:20:11] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.JsonBeanyZOiDd238278987743392 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.JsonBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:79)
at com.geccocrawler.gecco.dynamic.DynamicGecco.json(DynamicGecco.java:28)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:62)
[2018.09.17 10:20:12] ERROR com.geccocrawler.gecco.dynamic.JavassistDynamicBean:(84) - create class com.geccocrawler.gecco.dynamic.HtlmBeanPaQYRZ238280131554757 error.
javassist.NotFoundException: com.geccocrawler.gecco.webmagic.HtmlBean
at javassist.ClassPool.get(ClassPool.java:452)
at com.geccocrawler.gecco.dynamic.JavassistDynamicBean.(JavassistDynamicBean.java:76)
at com.geccocrawler.gecco.dynamic.DynamicGecco.html(DynamicGecco.java:24)
at com.geccocrawler.gecco.demo.dynamic.DynamicJD.main(DynamicJD.java:69)
能否把日志打印门面改成 slf4l
btw,我看代码里面有些sys out ,能否改成日志
如果可以的话我来改, 我可以提个MR
相关问题
#41
GeccoEngine.create()
.classpath("xxx")
.pipelineFactory(springPipelineFactory)
.interval(1000)
.thread(thread)
.start(request).run();
System.out.println("here");
下次启动 线程会一直累加 一直到服务器最大线程 最后崩掉
爬虫执行大概12小时候,突然吧报错,打开界面查看,界面元素并无错误。 该如何排错?
2018-01-22 13:13:17.827 WARN 8178 --- [r.gecco.spring0] c.g.gecco.downloader.AbstractDownloader : inputstream org.apache.http.conn.EofSensorInputStream don't to byte inputstream!
2018-01-22 13:13:17.828 ERROR 8178 --- [r.gecco.spring0] com.geccocrawler.gecco.spider.Spider : http://www.mzitu.com/116664/42 ERROR : java.lang.NullPointerExceptionnull
2018-01-22 13:13:21.979 INFO 8178 --- [ GeccoEngine] com.geccocrawler.gecco.GeccoEngine : close gecco!
how informations(Htmlbean) be saved as a file? Use pipeline?
try{userInfoCallBack({"errcode":0,"msg":"success.","userdata":{"renderJDDate":[{"r":0,"msg":{"pin":"jlinxy","userLevel":62,"nickname":"Story_龙鳞","yunMidImageUrl":"http://storage.jd.com/i.imageUpload/6a6c696e787931333738363137353733393435_mid.jpg","levelName":"金牌用户","verifyEmail":"1258****@qq.com","verifyMobile":"176*****197"}}]}});}catch(e){}
JSON.parse(jsonStr) 就抛异常了
[2018.11.02 23:37:10] ERROR com.geccocrawler.gecco.spider.Spider:run(77) - cant't match url : http://www.dianping.com/citylist
Process finished with exit code 0
原码下载后,运行例子不能执行,报如上错,大家有遇到这个问题嘛
如何设置代理?
你好,请问后期会不会增加对 Spider 和 Schedulter 拓展设置,目前我看源代码是直接在 GeccoEngine 中 new 相关对象,如果想要拓展必须通过修改 GeccoEngine 类来实现
哈哈,我主要是想增加一种拓展,类似 ScheduledThreadPoolExecutor 或者 Netty 中的 EventLoopGroup, 更自由方便的添加爬取任务
GeccoEngine.create()
.classpath("xxx")
.pipelineFactory(springPipelineFactory)
.interval(1000)
.thread(thread)
.setEventListener(listener)
.start(request).run();
监听器的方法不会被调用
当String content = EntityUtils.toString(responseEntity, charset);执行完以后,前面操作的resp.setRaw(responseEntity.getContent());中的流已经被消耗了,也就是从这个流中已经获取不到内容数据了
电脑环境可以正常上网。并且可以正常访问github.但是运行demo的时候发生链接超时问题。
ERROR com.geccocrawler.gecco.spider.Spider:run(109) - https://github.com/xtuhcy/gecco DOWNLOAD ERROR :org.apache.http.conn.HttpHostConnectException: Connect to github.com:443 [github.com/192.30.255.112, github.com/192.30.255.113] failed: Connection timed out: connect
请问这个是什么问题?
能否把日志打印门面改成 slf4l
btw,我看代码里面有些sys out ,能否改成日志
如果可以的话我来改, 我可以提个MR
想下载参考手册, 现在 http://www.geccocrawler.com/ 这个网站不能访问
pause 后再次调用 start 无效,该如何解决?
I cant found usecase that download img to local disk by @image(download="/yours/path/img")
请问ProductDetail 在哪里用到了
1.2.5版本的UrlMatcher类的87行
value = URLDecoder.decode(value, "UTF-8");
编码类型硬编码写死的,应该采用HttpRequest里面用户设定的Charset
不然会造成编码类型不一致造成乱码的情况
[重要通知] 【安全预警】关于fastjson远程代码执行漏洞预警
fastjson近日曝出代码执行漏洞,当用户提交一个精心构造的恶意的序列化数据到服务器端时,fastjson在反序列化时存在漏洞,可导致远程任意代码执行漏洞。影响版本: 1.2.24及之前版本
我在源码中运行 com.geccocrawler.gecco.demo.ajax.JDDetail 这个demo,可以正常提取数据,然后我把相同的代码换到自己的工程中,就不能正常提取,是不是需要额外配置呢?谢谢
抓取GBK的网站(例如51job)的时候,会出现乱码,我debug了一下,问题出在HttpClientDownloader类download方法,if(status == 200)后(180行附近),request.getCharset()不知为何为null,且contentType只能取到text/html,无法取到charset=gbk,导致取了默认值utf-8造成乱码。请确认
链接大概有5-10k左右 如下:
List<HttpRequest> requests;
GeccoEngine.create()
.pipelineFactory(springPipelineFactory)
.classpath("xxx")
.interval(2000)
.thread(thread)
.start(requests).start();
运行了一段时间过后 项目报错
java.lang.OutOfMemoryError: Unable to create new native thread
观察了下系统线程
已经跑完的链接 run 方法并没有被结束 ,请问下爬虫跑完为什么没有让线程停止呢? 谢谢了。
List 内容如果是表格的每一行,会解析不到数据!
例如:`
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.