Giter Site home page Giter Site logo

goose's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goose's Issues

Bad Java examples of content extraction

Hi,

I'm trying to use Goose in a java environment and am having some trouble. Using the example found on the front page, (goose.extractContent...) but found that it creates a thread every time I call this method. To remove these threads I have to stop the global Goose.crawlingActor() object, which of course prevents access from any other thread, this is not acceptable. I found an example in the test directory, StaticHTMLTest.java, which is exactly what I want, but it's all commented out. Trying to instantiate ContentExtractor as shown in this example produces the "com.gravity.goose.extractors.ContentExtractor is abstract; cannot be instantiated" . How can I just run the content extractor without starting threads (or only use one extra thread)? Will there be more valid examples included in the source tree soon?

Thanks

nytimes.com getting bad redir / canonical link?

I haven't made the jump to Scala, I'm on basically the last Java version, so feel free to tell me to do my own debugging (I have some customizations that would need porting), but I'm seeing an issue with nytimes urls. For example, from today's front page:

http://www.nytimes.com/2011/09/20/arts/design/preserving-the-american-folk-art-museums-place-in-new-york.html?ref=arts

Is somehow being redirected behind the scenes to www10.nytimes.com which always lands me on the login screen if I use that canonical url. It's correctly extracting the content and top image, etc, but it's got the wrong domain and canonical URL, which makes me think it's something nyt changes in their paywall bouncing redirs?

Is it just me? Maybe I'll just rewrite the canonical url and forget about it for now.

Great work as always!

ImageUtil appears not to release http connections

After a while my log becomes full of these messages and it just hangs this didn't appear to be an issue in the previous version :
INFO com.gravity.goose.images.ImageUtils - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
....

at com.gravity.goose.images.ImageUtils$.fetchEntity(ImageUtils.scala:267)
at com.gravity.goose.images.ImageUtils$.storeImageToLocalFile(ImageUtils.scala:172)
at com.gravity.goose.images.UpgradedImageIExtractor.getLocallyStoredImage(UpgradedImageIExtractor.scala:465)
at com.gravity.goose.images.UpgradedImageIExtractor$$anonfun$com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest$1.apply(UpgradedImageIExtractor.scala:348)
at com.gravity.goose.images.UpgradedImageIExtractor$$anonfun$com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest$1.apply(UpgradedImageIExtractor.scala:341)
at scala.collection.Iterator$class.foreach(Iterator.scala:652)

However I was calling ImageExtractor directly and now I'm using com.gravity.goose.Goose.extractContent and it appears to be calling UpgradedImageExtractor ....

com.gravity.goose.images.UpgradedImageIExtractor.com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest(UpgradedImageIExtractor.scala:341)

Jar does not work in Android - tmp folder not found

I tested this jar for goose:
https://www.dropbox.com/s/h0tu7bhl834ylnz/goose.jar from this tread:
https://github.com/jiminoc/goose/issues/59 by qnex.
It works in a normal java project but in Android I get this error:

10-05 12:45:05.858: E/AndroidRuntime(1825): FATAL EXCEPTION: main
10-05 12:45:05.858: E/AndroidRuntime(1825): java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:638)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at dalvik.system.NativeStart.main(Native Method)
10-05 12:45:05.858: E/AndroidRuntime(1825): Caused by: java.lang.reflect.InvocationTargetException
10-05 12:45:05.858: E/AndroidRuntime(1825):     at java.lang.reflect.Method.invokeNative(Native Method)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at java.lang.reflect.Method.invoke(Method.java:507)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:880)
10-05 12:45:05.858: E/AndroidRuntime(1825):     ... 2 more
10-05 12:45:05.858: E/AndroidRuntime(1825): Caused by: java.lang.Exception: /tmp/goose directory does not seem to exist, you need to set this for image processing downloads
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.gravity.goose.Goose.initializeEnvironment(Goose.scala:68)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.gravity.goose.Goose.<init>(Goose.scala:31)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at com.example.goose.MainActivity$1.onClick(MainActivity.java:27)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.view.View.performClick(View.java:2485)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.view.View$PerformClick.run(View.java:9081)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.os.Handler.handleCallback(Handler.java:587)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.os.Handler.dispatchMessage(Handler.java:92)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.os.Looper.loop(Looper.java:130)
10-05 12:45:05.858: E/AndroidRuntime(1825):     at android.app.ActivityThread.main(ActivityThread.java:3770)
10-05 12:45:05.858: E/AndroidRuntime(1825):     ... 5 more

Can someone help me with that? :)

How to obtain the binary data of topImage ?

Hi,

I modified TalkToMeGoose.scala to output more fields and plan to use it from the command line. However, when I do:

println(article.topImage)

I get

com.gravity.goose.images.Image@36db492

Which is not a file path, url or binary data. I looked in /tmp but couldn't find the image file in there.

Can you please show me how to get the image file?

Thanks

Goose in php

Hello,

What is the best way to put the goose in a webservice running over php?

Thanks,
Rui Gaspar

nytimes.com extraction problems

I see that nytimes.com is on your list of sites that still need unit testing.

I've successfully installed Goose and have run the unit tests without trouble.

Here are two issues I found while trying to extract text from nytimes.com:

  1. When you run the code as is on any nytimes article (try: http://www.nytimes.com/2010/12/20/opinion/20cohen.html), I get the following output:
INFO [main] (HtmlFetcher.java:203) - Initializing HttpClient
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
WARN [main] (HtmlFetcher.java:132) - Connection reset
INFO [main] (HtmlFetcher.java:159) - starting...
INFO [main] (HtmlFetcher.java:161) - HTMLRESULT is empty or null

I'm not a Java guy, but when I tried to do something similar in Python, I discovered that setting the User-agent header causes the response from nytimes.com to send a 301 status and never reaches a 200 status. If I comment out line 244 in HtmlFetcher.java in Goose in order to not set the User-agent header and run the code, it successfully gets a response with article text. But then I see the 2nd problem:

  1. the extracted content omits the first paragraph of the article.

badlink test fails when compiling with 'mvn compile package'

It appears that goose.extractContent(url) is returning a valid com.gravity.goose.Article object, however it appears that it is empty (at least article.cleanedArticleText returns nothing). The assert that fails expects that extractContent(url) will return null rather than an object.

image confidence score not a good reflection of quality

The image confidence score is purely based on the number of images retrieved, but has no relationship to the relative scores of the images. Confidence score should reflect the difference in the scores of the various images retrieved, as well as the absolute score of the image.

For example, if several images have a similar score, we are not confident which one is correct, but if 10 images have poor scores and one has a good score, we can be fairly confidence.

DocumentCleaner can change order of content.

When parsing some articles some paragraphs can appear in wrong order after the DocumentCleaner went trough them.

In src/main/scala/com/gravity/goose/cleaners/DocumentCleaner.scala the method convertDivsToParagraphs

When child elements of a div are parsed for text-nodes, all the text nodes are appended to a single text string and all text nodes are later removed and a new paragraph node is added as the first child of the div. This can mess up order if the div contains multiple other elements such a paragraphs or spans. The problem is that the text nodes are collected into a single paragraph that is always added as the first child. Some text nodes can be interwoven between other paragraph nodes which has the effect that text that originally appeared after a paragraph now appears in front of it.

You can see the effect when cleaning this page: http://danielspicar.github.com/goose-bug.html

Note how TextNode 1 and TextNode 2 are merged into one paragraph at the beginning of the div container. This has the effect that TextNode 2 now appears before Paragraph 1 (if you don't see TextNode 2, it's now part of the text in the first paragraph).

In the original page TextNode 2 followed after Paragraph 1.

NPE thrown from DocumentCleaner

Running goose against the url "http://www.apo-rot.de/indexdetails.html?_filterartnr=1997030&partnerid=preisfuerst"
we recieve the following NPE:

java.lang.NullPointerException
at com.gravity.goose.cleaners.DocumentCleaner$class.cleanBadTags(DocumentCleaner.scala:132)
at com.gravity.goose.cleaners.DocumentCleaner$class.clean(DocumentCleaner.scala:51)
at com.gravity.goose.cleaners.StandardDocumentCleaner.clean(StandardDocumentCleaner.scala:26)
at com.gravity.goose.Crawler$$anonfun$crawl$1$$anonfun$apply$1$$anonfun$apply$2.apply(Crawler.scala:71)
at com.gravity.goose.Crawler$$anonfun$crawl$1$$anonfun$apply$1$$anonfun$apply$2.apply(Crawler.scala:48)

Ability to use goose with a network proxy

At a glance it seems that Goose is ignoring the network proxy configuration via System properties (http.proxyHost and "http.proxyPost).

Is there a way to configure that? If so, can someone please point me out the documentation?

Thanks

Ability to determine String replacements for article titles

Goose needs a way to filter out janky article titles where the title may have multiple delimiters such as
Breaking News: KCAL05: This just in - some guy won a million bucks

It gets confusing where to separate the titles from the prefix. It would be nice to have a text file that you can add special cases to where you can put in the text to replace with blanks
example:

domain replace
kcal9.com Breaking News: KCAL05:

Cannot extract any content

I created a simple scala project to test out goose, but i cant seem to extract any content. My code is below. I will say that this is my first time using scala, so i expect that to be related. Any help here is greatly appreciated.

import com.gravity.goose._
import org.jsoup.nodes._

object Reader {
def main(args : Array[String]) {
val url = args(0)
val html = ""
println("Retrieving URL : " + url)

val goose = new Goose(new Configuration)    
val article = goose.extractContent(url, html)
println(article.cleanedArticleText)

}
}

Handling Redirection

Well not sure whether it is an issue or not.
I am using goose to extract the text but I also needed it to return the redirected final url. So I changed the htmlFetcher static method to do it. My question would be, does it make goose slow? I have not done any performance testing with it. Wanted to ask you guys before using the changes I have made.

I would be happy to give a patch if you guys think it would be a value add.

Another change which I made was on image fetching. You guys had the code to download the images and then search for image tags for back-up. It was slow and I needed the image more for aesthetic purpose and changed the code and added extra boolean parameter to decide whether to use the downloading code or just use image tag parsing.

Thanks,
Sharath

Upload artifacts to a Maven repository

Specifying Goose as a Maven dependency in a project fails to fetch the artifacts. Please upload the artifacts to Maven Central, or publish the Maven repository URL in the README if it is something other than Maven Central.

Dispatch library

Hi,

I just wondered if you have seen Dispatch library 1 to consider for your HTTP access instead of Apache HTTPClient

Best,
Amir

Content not extracted from "article" tag

I've noticed that goose doesn't work for an italian site because that site use the "article" tag.
An example url is the following:
http://www.repubblica.it/economia/2012/05/12/news/giovani_anziani_asili_nido_e_soldi_per_il_sud_ecco_il_progetto_del_governo_per_l_equit-34962952/
Note that i work with italian stopwords.

I solved (for this site) adding
nodesToCheck.addAll(doc.getElementsByTag("article"))
here: https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L387

and changing:
if (e.tagName != "p" && e.tagName != "article")
here
https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L509

Can this solution works for every site or it mess up something?

Thank you

Find page 2

It would be a great feature to be able to find the next page and/or the single page version automatically.

cannot deal with webpages which language is in traditional chinese

Possible Memory Leak

Hi,
I'm trying to test Goose with 11M urls and I always have an out of memory due to jsoup after few pages. In the stack I can always see that jsoup accumulate lot of data. Is there any way to clean the extractor after a page has been processed?

Order of phrases completely reversed, 5-4-3-2-1, example

I tried this page:

http://www.telegraph.co.uk/foodanddrink/foodanddrinknews/8808120/Worlds-hottest-chilli-contest-leaves-two-in-hospital.html

And got this result (the numbers between square brackets show the original order of the phrases) :

[5] Participants were required to sign a legal disclaimer prior to taking part in the competition, and two members of the British Red Cross were on hand, but they could not cope with the nature of the injuries sustained.

[4] Paramedics attended the event on Saturday - the busiest day of the week for the ambulance service - costing the service several hundred pounds.

[3] Today, the Scottish Ambulance Service said it wanted the restaurant to review the way the event was managed.

[2] One participant, Curie Kim was so ill after sampling the "Kismot Killer" that she had to be taken by ambulance to the Edinburgh Royal Infirmary twice in a matter of hours.

[1] Emergency services were called to Kismot Restaurant's curry-eating challenge, on St Leonards Place, Edinburgh, after competitors started writhing on the floor in agony, vomiting and fainting during the contest.

[6] Curry house owner Abdul Ali admitted that he would have to "tone down" the contest, but said the challenge had raised hundreds of pounds for charity CHAS.

[7] He added that half of the 20 people who took part in the challenge had dropped out after witnessing the first 10 diners vomiting, collapsing, sweating and panting.

[8] Previously the restaurant's Kismot Killer dish has caused diners to suffer nose bleeds and one elderly man had to go to hospital.
...

Problems with ImageFetchException

I have a pipeline that use Goose to download text form news articles. Sometimes can happen that a bad article stop the pipeline because Goose can't download images and raise exceptions for every image every 2 minutes, like in the following log example.

May 7, 2012 3:12:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2011_01/5350399008_96bfb1d665_s.jpg
May 7, 2012 3:14:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2011_09/6165889887_57c08896b9_s.jpg
May 7, 2012 3:16:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2010_10/4935053263_ef10f461c5_s.jpg

I wish to catch the exception and stop the process of that news, by I can't caught the exception... someone can help me?

MinBytesForImages duplicated

Value exists in Configuration, but also exists in UpgradedImageIExtractor, which references its local version

Exception handling problems in HtmlFetcher

Several problems with this class:

Line 192 catches an exception from httpget.abort() and swallows it
Several exceptions after line 152 log at trace lever, not even info or warn

These issues are making it difficult to diagnose problems with my library.

selecting a element with regular experssion

I want to select a element whose class start with some specified string. for example

[img class="test-124" src="..." ]
[img class="test-child-121" src="..." ]
[img class="test-123" src="..." ]
[img class="test-122" src="..." ]

I want to select all the above element.

How can i do that?

regards,
Jaya

recipes

i've noticed on recipe sites, like foodtv.com, the ingredients section gets cut off

akka dependency?

Why is akka listed in the maven pom dependencies? i can run mvn test successfully without it in the pom.

Multiple calls to extractContent() result in the same best image being selected

When using goose through proose there is one static instance of ContentExtractor and there are multiple calls to extractContent().

However, when you do this the return from getTopImage() always gives the same answer - the best image from the first url passed to extractContent().

Fix is to ensure that a new ImageExtractor class is created on each call to extractContent(). I have put a patch together below which fixes the problem.

index 3a287c1..f16d9ba 100644
--- a/src/main/java/com/jimplush/goose/ContentExtractor.java
+++ b/src/main/java/com/jimplush/goose/ContentExtractor.java
@@ -64,8 +64,6 @@ public class ContentExtractor {
   // once we have our topNode then we want to format that guy for output to the user
   private OutputFormatter outputFormatter;

-  private ImageExtractor imageExtractor;
-

   /**
    * you can optionally pass in a configuration object here that will allow you to override the settings
@@ -121,7 +119,7 @@ public class ContentExtractor {

       if (config.isEnableImageFetching()) {
         HttpClient httpClient = HtmlFetcher.getHttpClient();
-        imageExtractor = getImageExtractor(httpClient, urlToCrawl);
+        ImageExtractor imageExtractor = getImageExtractor(httpClient, urlToCrawl);
         article.setTopImage(imageExtractor.getBestImage(doc, article.getTopNode()));

       }
@@ -170,12 +168,8 @@ public class ContentExtractor {

   private ImageExtractor getImageExtractor(HttpClient httpClient, String urlToCrawl) {

-    if (imageExtractor == null) {
       BestImageGuesser bestImageGuesser = new BestImageGuesser(this.config, httpClient, urlToCrawl);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.