gravitylabs / goose Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 323.0 2.81 MB

Html Content / Article Extractor in Scala - open sourced from Gravity Labs

Home Page: http://gravity.com

License: Apache License 2.0

Scala 100.00%

goose's People

Stargazers

Watchers

Forkers

martino n2m5 tomazk amitagrawal nator umars nivertech ronhornbaker dikanggu joelvh micus skyfallsin softwareresearchwork we64 xsunsmile tzuryby bradtheappguy pombredanne dkeskar cpatni m64253 alexdong kdesai48 johnteslade assiotis sha0h0ng flashus andrewlin12 dkobia dhepper coryhacking danielspicar alek netconstructor eniton vuamitom grangier akshar100 robert-blumen mneedham protez chechu marciol mqbeers vireshas tc yifanzhang jhund damajor amir343 xieconnect taomin stask marathem aurality rodzyn0688 fakod sunahsuh sarchak perif aaai xxxazxxx raymondtangsc rushit amalinovskiy revskill10 excerlee jaytaylor bborn jaytoday modulexploited smuppala jasonab matth dr3s evan0 orangelpai dungvn3000 tritium6 uriagassi luoroger kunlqt kunalmodi edwardt treper jiangwei1221 petrusp pawl plutext efsavage openube alfa07 nitin-cloudfinch fanfannothing sentione crono jiminoc listings-xx qu1j0t3 mozii

goose's Issues

Bad Java examples of content extraction

Hi,

I'm trying to use Goose in a java environment and am having some trouble. Using the example found on the front page, (goose.extractContent...) but found that it creates a thread every time I call this method. To remove these threads I have to stop the global Goose.crawlingActor() object, which of course prevents access from any other thread, this is not acceptable. I found an example in the test directory, StaticHTMLTest.java, which is exactly what I want, but it's all commented out. Trying to instantiate ContentExtractor as shown in this example produces the "com.gravity.goose.extractors.ContentExtractor is abstract; cannot be instantiated" . How can I just run the content extractor without starting threads (or only use one extra thread)? Will there be more valid examples included in the source tree soon?

Thanks

NullPoinerException error when parsing content

Calling:

Goose = new GooseContentParser( new Configuration());
contentParser.parseContentUsingGoose('http://www.tulsaworld.com/site/articlepath.aspx?articleid=20111118_61_A16_Opposi344152&rss_lnk=7');

Causes a NullPointerException. I reported the link on the demo page as it fails to load this article as well.

nytimes.com getting bad redir / canonical link?

I haven't made the jump to Scala, I'm on basically the last Java version, so feel free to tell me to do my own debugging (I have some customizations that would need porting), but I'm seeing an issue with nytimes urls. For example, from today's front page:

http://www.nytimes.com/2011/09/20/arts/design/preserving-the-american-folk-art-museums-place-in-new-york.html?ref=arts

Is somehow being redirected behind the scenes to www10.nytimes.com which always lands me on the login screen if I use that canonical url. It's correctly extracting the content and top image, etc, but it's got the wrong domain and canonical URL, which makes me think it's something nyt changes in their paywall bouncing redirs?

Is it just me? Maybe I'll just rewrite the canonical url and forget about it for now.

Great work as always!

ImageUtil appears not to release http connections

After a while my log becomes full of these messages and it just hangs this didn't appear to be an issue in the previous version :
INFO com.gravity.goose.images.ImageUtils - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
....

at com.gravity.goose.images.ImageUtils$.fetchEntity(ImageUtils.scala:267)
at com.gravity.goose.images.ImageUtils$.storeImageToLocalFile(ImageUtils.scala:172)
at com.gravity.goose.images.UpgradedImageIExtractor.getLocallyStoredImage(UpgradedImageIExtractor.scala:465)
at com.gravity.goose.images.UpgradedImageIExtractor$$anonfun$com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest$1.apply(UpgradedImageIExtractor.scala:348)
at com.gravity.goose.images.UpgradedImageIExtractor$$anonfun$com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest$1.apply(UpgradedImageIExtractor.scala:341)
at scala.collection.Iterator$class.foreach(Iterator.scala:652)

However I was calling ImageExtractor directly and now I'm using com.gravity.goose.Goose.extractContent and it appears to be calling UpgradedImageExtractor ....

com.gravity.goose.images.UpgradedImageIExtractor.com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest(UpgradedImageIExtractor.scala:341)

Please put Goose under an Open Source License

Would help a lot.

Jar does not work in Android - tmp folder not found

I tested this jar for goose:
https://www.dropbox.com/s/h0tu7bhl834ylnz/goose.jar from this tread:
https://github.com/jiminoc/goose/issues/59 by qnex.
It works in a normal java project but in Android I get this error:

10-05 12:45:05.858: E/AndroidRuntime(1825): FATAL EXCEPTION: main
10-05 12:45:05.858: E/AndroidRuntime(1825): java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:638)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at dalvik.system.NativeStart.main(Native Method)
10-05 12:45:05.858: E/AndroidRuntime(1825): Caused by: java.lang.reflect.InvocationTargetException
10-05 12:45:05.858: E/AndroidRuntime(1825):     at java.lang.reflect.Method.invokeNative(Native Method)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at java.lang.reflect.Method.invoke(Method.java:507)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:880)
10-05 12:45:05.858: E/AndroidRuntime(1825):     ... 2 more
10-05 12:45:05.858: E/AndroidRuntime(1825): Caused by: java.lang.Exception: /tmp/goose directory does not seem to exist, you need to set this for image processing downloads
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.gravity.goose.Goose.initializeEnvironment(Goose.scala:68)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.gravity.goose.Goose.<init>(Goose.scala:31)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.example.goose.MainActivity$1.onClick(MainActivity.java:27)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.view.View.performClick(View.java:2485)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.view.View$PerformClick.run(View.java:9081)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.os.Handler.handleCallback(Handler.java:587)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.os.Handler.dispatchMessage(Handler.java:92)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.os.Looper.loop(Looper.java:130)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.app.ActivityThread.main(ActivityThread.java:3770)
10-05 12:45:05.858: E/AndroidRuntime(1825):     ... 5 more

Can someone help me with that? :)

Fails to get correct content from some pages.

https://github.com/jiminoc/goose
http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works

How to obtain the binary data of topImage ?

Hi,

I modified TalkToMeGoose.scala to output more fields and plan to use it from the command line. However, when I do:

println(article.topImage)

I get

com.gravity.goose.images.Image@36db492

Which is not a file path, url or binary data. I looked in /tmp but couldn't find the image file in there.

Can you please show me how to get the image file?

Thanks

Not extracted the exact content

http://allthingsd.com/20111123/nokia-siemens-to-cut-17000-jobs/
http://allthingsd.com/20111123/tripadvisor-makes-itself-available-offline-to-help-travelers-avoid-roaming-charges/

Actual Content tag score less than irrelevant tag and also link density is higher in original content.

Goose in php

Hello,

What is the best way to put the goose in a webservice running over php?

Thanks,
Rui Gaspar

Slate.com does not parse properly with Goose

Using topNode, Goose throws away parts of the text that do not score well. Each paragraph is a separate div, and Goose remove those that are too short.

nytimes.com extraction problems

I see that nytimes.com is on your list of sites that still need unit testing.

I've successfully installed Goose and have run the unit tests without trouble.

Here are two issues I found while trying to extract text from nytimes.com:

When you run the code as is on any nytimes article (try: http://www.nytimes.com/2010/12/20/opinion/20cohen.html), I get the following output:

INFO [main] (HtmlFetcher.java:203) - Initializing HttpClient
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
WARN [main] (HtmlFetcher.java:132) - Connection reset
INFO [main] (HtmlFetcher.java:159) - starting...
INFO [main] (HtmlFetcher.java:161) - HTMLRESULT is empty or null

I'm not a Java guy, but when I tried to do something similar in Python, I discovered that setting the User-agent header causes the response from nytimes.com to send a 301 status and never reaches a 200 status. If I comment out line 244 in HtmlFetcher.java in Goose in order to not set the User-agent header and run the code, it successfully gets a response with article text. But then I see the 2nd problem:

the extracted content omits the first paragraph of the article.

support for double-byte chars languages?

I've tried to use goose API to parse Chinese, Japanese, Korean webpages,and none of them work. Any plan to support this ?

Add main class to run goose from the command line

would be nice to have a command line runnable version of goose to be able to do extractions on the fly that can be piped to other unix commands such as

goose http://site.com/article1 > article1.txt

suggested by John (https://github.com/b79)

Bad image returned from CNN page

http://www.cnn.com/2012/11/07/politics/why-romney-lost/index.html returns http://i.cdn.turner.com/cnn/.e/img/3.0/mosaic/bttn_close.gif as its main image, probably because it's the first image inside the cnn_strycntntlft div.

This page has a good og:image tag, but that is overridden by the site-specific data.

badlink test fails when compiling with 'mvn compile package'

It appears that goose.extractContent(url) is returning a valid com.gravity.goose.Article object, however it appears that it is empty (at least article.cleanedArticleText returns nothing). The assert that fails expects that extractContent(url) will return null rather than an object.

image confidence score not a good reflection of quality

The image confidence score is purely based on the number of images retrieved, but has no relationship to the relative scores of the images. Confidence score should reflect the difference in the scores of the various images retrieved, as well as the absolute score of the image.

For example, if several images have a similar score, we are not confident which one is correct, but if 10 images have poor scores and one has a good score, we can be fairly confidence.

Content Extractor not working in this URL

http://www.accountancyage.com/aa/analysis/2111729/institutes-ifrs-bang
DefaultDocumentCleaner.cleanBadTags detected "content_print" as naughtyID and removing. The actual news is under this div tag.

[id~=(" + regExRemoveNodes + ")] matches print in id="content_print" and removing.
Is it better to go for exact word boundary match instead of fuzzy match??

DocumentCleaner can change order of content.

When parsing some articles some paragraphs can appear in wrong order after the DocumentCleaner went trough them.

In src/main/scala/com/gravity/goose/cleaners/DocumentCleaner.scala the method convertDivsToParagraphs

When child elements of a div are parsed for text-nodes, all the text nodes are appended to a single text string and all text nodes are later removed and a new paragraph node is added as the first child of the div. This can mess up order if the div contains multiple other elements such a paragraphs or spans. The problem is that the text nodes are collected into a single paragraph that is always added as the first child. Some text nodes can be interwoven between other paragraph nodes which has the effect that text that originally appeared after a paragraph now appears in front of it.

You can see the effect when cleaning this page: http://danielspicar.github.com/goose-bug.html

Note how TextNode 1 and TextNode 2 are merged into one paragraph at the beginning of the div container. This has the effect that TextNode 2 now appears before Paragraph 1 (if you don't see TextNode 2, it's now part of the text in the first paragraph).

In the original page TextNode 2 followed after Paragraph 1.

NPE thrown from DocumentCleaner

Running goose against the url "http://www.apo-rot.de/indexdetails.html?_filterartnr=1997030&partnerid=preisfuerst"
we recieve the following NPE:

java.lang.NullPointerException
at com.gravity.goose.cleaners.DocumentCleaner$class.cleanBadTags(DocumentCleaner.scala:132)
at com.gravity.goose.cleaners.DocumentCleaner$class.clean(DocumentCleaner.scala:51)
at com.gravity.goose.cleaners.StandardDocumentCleaner.clean(StandardDocumentCleaner.scala:26)
at com.gravity.goose.Crawler$$anonfun$crawl$1$$anonfun$apply$1$$anonfun$apply$2.apply(Crawler.scala:71)
at com.gravity.goose.Crawler$$anonfun$crawl$1$$anonfun$apply$1$$anonfun$apply$2.apply(Crawler.scala:48)

How can I use it in Java?

Hello,
How can I use this in java since Android does not support Scala?
thanks.

Ability to use goose with a network proxy

At a glance it seems that Goose is ignoring the network proxy configuration via System properties (http.proxyHost and "http.proxyPost).

Is there a way to configure that? If so, can someone please point me out the documentation?

Thanks

Ability to determine String replacements for article titles

Goose needs a way to filter out janky article titles where the title may have multiple delimiters such as
Breaking News: KCAL05: This just in - some guy won a million bucks

It gets confusing where to separate the titles from the prefix. It would be nice to have a text file that you can add special cases to where you can put in the text to replace with blanks
example:

domain replace
kcal9.com Breaking News: KCAL05:

Cannot extract any content

I created a simple scala project to test out goose, but i cant seem to extract any content. My code is below. I will say that this is my first time using scala, so i expect that to be related. Any help here is greatly appreciated.

import com.gravity.goose._
import org.jsoup.nodes._

object Reader {
def main(args : Array[String]) {
val url = args(0)
val html = ""
println("Retrieving URL : " + url)

val goose = new Goose(new Configuration)    
val article = goose.extractContent(url, html)
println(article.cleanedArticleText)

}
}

Main image choice seems odd for this page.

http://blog.posterous.com/big-news

Akka repository

Hey,

It seems that Akka repository is changed to this: http://repo.akka.io/releases/
I couldn't build with the repository you used in the pom file.

/Amir

Handling Redirection

Well not sure whether it is an issue or not.
I am using goose to extract the text but I also needed it to return the redirected final url. So I changed the htmlFetcher static method to do it. My question would be, does it make goose slow? I have not done any performance testing with it. Wanted to ask you guys before using the changes I have made.

I would be happy to give a patch if you guys think it would be a value add.

Another change which I made was on image fetching. You guys had the code to download the images and then search for image tags for back-up. It was slow and I needed the image more for aesthetic purpose and changed the code and added extra boolean parameter to decide whether to use the downloading code or just use image tag parsing.

Thanks,
Sharath

Chooses incorrect image on http://hosted.ap.org

Love this as a java option to readability and am currently using it in my android app. One thing I have noticed though is for every article on http://host.ap.org it chooses this image (http://hosted.ap.org/specials/images/ap_photo_promo.jpg) as the main image.

Upload artifacts to a Maven repository

Specifying Goose as a Maven dependency in a project fails to fetch the artifacts. Please upload the artifacts to Maven Central, or publish the Maven repository URL in the README if it is something other than Maven Central.

Code snippets ommitted

I noticed the code snippets are ommitted from the following articles:

http://hacks.mozilla.org/2010/01/offline-web-applications/
http://www.nczonline.net/blog/2009/01/13/speed-up-your-javascript-part-1/

Dispatch library

Hi,

I just wondered if you have seen Dispatch library 1 to consider for your HTTP access instead of Apache HTTPClient

Best,
Amir

All formatting lost from a page

http://timesofindia.indiatimes.com/india/Govt-uses-special-powers-to-slash-cancer-drug-price-by-97/articleshow/12240143.cms

Content not extracted from "article" tag

I've noticed that goose doesn't work for an italian site because that site use the "article" tag.
An example url is the following:
http://www.repubblica.it/economia/2012/05/12/news/giovani_anziani_asili_nido_e_soldi_per_il_sud_ecco_il_progetto_del_governo_per_l_equit-34962952/
Note that i work with italian stopwords.

I solved (for this site) adding
nodesToCheck.addAll(doc.getElementsByTag("article"))
here: https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L387

and changing:
if (e.tagName != "p" && e.tagName != "article")
here
https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L509

Can this solution works for every site or it mess up something?

Thank you

Find page 2

It would be a great feature to be able to find the next page and/or the single page version automatically.

cannot deal with webpages which language is in traditional chinese

example url : http://violetiva.pixnet.net/blog/post/28652839-%5B%E9%A3%9F%E8%A8%98%5D-%E5%8F%B0%E5%8C%97%E2%80%A7%E6%9D%8F%E5%AD%90%E8%B1%AC%E6%8E%92-

content will be sucessfully extracted using viewtex.org api
e.g., http://viewtext.org/article?url=http://violetiva.pixnet.net/blog/post/28652839-%5B%E9%A3%9F%E8%A8%98%5D-%E5%8F%B0%E5%8C%97%E2%80%A7%E6%%209D%8F%E5%AD%90%E8%B1%AC%E6%8E%92-

but get nothing using goose api , is there something wrong?

Failing to extract list items

Fails to get the list items near the start of:

http://www.mikealrogers.com/posts/the-way-of-node.html

No support for articles in Chinese?

Hi, I was just trying goose out on some Chinese language news sites, and it doesn't appear to be able to pull any article text. Examples:

http://news.xhby.net/system/2011/10/03/011788372.shtml
http://news.iqilu.com/shehui/huahuashijie/20111003/565892.html

Will your algorithm work on Chinese with a minor fix or does it need to be a latin language?

Thanks,
Joel

Possible Memory Leak

Hi,
I'm trying to test Goose with 11M urls and I always have an out of memory due to jsoup after few pages. In the stack I can always see that jsoup accumulate lot of data. Is there any way to clean the extractor after a page has been processed?

Upload project to a maven repo please

Upgrading to HttpComponents 4.1

Re : https://github.com/jiminoc/goose/blob/master/src/main/java/com/jimplush/goose/network/HtmlFetcher.java#L135
Can you tell me the significance of the number 15728640 bytes or 15 MB ?

Is there any reason that the header Accept-Encoding (gzip or deflate) is not used ?

Are there any other considerations I should keep in mind while I go ahead with the upgrade ?

Thanks.

Order of phrases completely reversed, 5-4-3-2-1, example

I tried this page:

http://www.telegraph.co.uk/foodanddrink/foodanddrinknews/8808120/Worlds-hottest-chilli-contest-leaves-two-in-hospital.html

And got this result (the numbers between square brackets show the original order of the phrases) :

[5] Participants were required to sign a legal disclaimer prior to taking part in the competition, and two members of the British Red Cross were on hand, but they could not cope with the nature of the injuries sustained.

[4] Paramedics attended the event on Saturday - the busiest day of the week for the ambulance service - costing the service several hundred pounds.

[3] Today, the Scottish Ambulance Service said it wanted the restaurant to review the way the event was managed.

[2] One participant, Curie Kim was so ill after sampling the "Kismot Killer" that she had to be taken by ambulance to the Edinburgh Royal Infirmary twice in a matter of hours.

[1] Emergency services were called to Kismot Restaurant's curry-eating challenge, on St Leonards Place, Edinburgh, after competitors started writhing on the floor in agony, vomiting and fainting during the contest.

[6] Curry house owner Abdul Ali admitted that he would have to "tone down" the contest, but said the challenge had raised hundreds of pounds for charity CHAS.

[7] He added that half of the 20 people who took part in the challenge had dropped out after witnessing the first 10 diners vomiting, collapsing, sweating and panting.

[8] Previously the restaurant's Kismot Killer dish has caused diners to suffer nose bleeds and one elderly man had to go to hospital.
...

Problems with ImageFetchException

I have a pipeline that use Goose to download text form news articles. Sometimes can happen that a bad article stop the pipeline because Goose can't download images and raise exceptions for every image every 2 minutes, like in the following log example.

May 7, 2012 3:12:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2011_01/5350399008_96bfb1d665_s.jpg
May 7, 2012 3:14:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2011_09/6165889887_57c08896b9_s.jpg
May 7, 2012 3:16:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2010_10/4935053263_ef10f461c5_s.jpg

I wish to catch the exception and stop the process of that news, by I can't caught the exception... someone can help me?

MinBytesForImages duplicated

Value exists in Configuration, but also exists in UpgradedImageIExtractor, which references its local version

Need the ability to set a user agent

currently Goose defaults to a mozilla user agent, need ability to provide alternative user agents

*note working on this today.

Exception handling problems in HtmlFetcher

Several problems with this class:

Line 192 catches an exception from httpget.abort() and swallows it
Several exceptions after line 152 log at trace lever, not even info or warn

These issues are making it difficult to diagnose problems with my library.

Move all logging to debug level instead of info level

Info level should be for administrators/user information, not debugging information. It will help clean up the logs

selecting a element with regular experssion

I want to select a element whose class start with some specified string. for example

[img class="test-124" src="..." ]
[img class="test-child-121" src="..." ]
[img class="test-123" src="..." ]
[img class="test-122" src="..." ]

I want to select all the above element.

How can i do that?

regards,
Jaya

Fix is to ensure that a new ImageExtractor class is created on each call to extractContent(). I have put a patch together below which fixes the problem.

index 3a287c1..f16d9ba 100644
--- a/src/main/java/com/jimplush/goose/ContentExtractor.java
+++ b/src/main/java/com/jimplush/goose/ContentExtractor.java
@@ -64,8 +64,6 @@ public class ContentExtractor {
   // once we have our topNode then we want to format that guy for output to the user
   private OutputFormatter outputFormatter;

-  private ImageExtractor imageExtractor;
-

   /**
    * you can optionally pass in a configuration object here that will allow you to override the settings
@@ -121,7 +119,7 @@ public class ContentExtractor {

       if (config.isEnableImageFetching()) {
         HttpClient httpClient = HtmlFetcher.getHttpClient();
-        imageExtractor = getImageExtractor(httpClient, urlToCrawl);
+        ImageExtractor imageExtractor = getImageExtractor(httpClient, urlToCrawl);
         article.setTopImage(imageExtractor.getBestImage(doc, article.getTopNode()));

       }
@@ -170,12 +168,8 @@ public class ContentExtractor {

   private ImageExtractor getImageExtractor(HttpClient httpClient, String urlToCrawl) {

-    if (imageExtractor == null) {
       BestImageGuesser bestImageGuesser = new BestImageGuesser(this.config, httpClient, urlToCrawl);

gravitylabs / goose Goto Github PK

goose's People

Stargazers

Watchers

Forkers

goose's Issues

Recommend Projects

Recommend Topics

Recommend Org