arthurn / linter Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.27 MB

A Java project to process URLs and return interesting metadata

Java 100.00%

linter's People

Contributors

Watchers

linter's Issues

Broken preview images for FeedZilla

FeedZilla links have broken preview images. URL points to http://www.feedzilla.com

<meta property="og:image" content="/" />

Probably just ignore og:images without an actual image.

Linter should not follow infinite loop redirects

Check to see if the last URL is the same as the current. If it is, stop because that means we are on the verge of an infinite loop.

A good one to test (for some reason?)

http://vince-gill.bixxy.info/term/george+strait+listen

Investigate additional meta data for image hosting

Recognizing Tweets w/pics as images

Tweets that include Twitter's new built-in picture feature should be recognized as images (similar to TwitPic or yFrog).

Linter fails on relative Location redirects

When the Location header is a relative location (i.e. Location /#!/item/13msf), Linter fails with java.net.MalformedURLException: no protocol

Linter fails to detect infinite loop w/ relative URL redirects

URL: http://t.co/UxIS7vV
Trace:

2011-08-28 14:08:53,026 [main] TRACE LintedPage: Discovered redirect to http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert
2011-08-28 14:08:53,027 [main] TRACE LintedPage: Following http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert...
2011-08-28 14:08:53,501 [main] TRACE LintedPage: Relative URL redirect. Appending prefix: http://www.dazeddigital.com
2011-08-28 14:08:53,501 [main] TRACE LintedPage: Discovered redirect to http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert

Need to put the infinite loop detection code after appending the domain prefix to relative URLs.

Scraping massive files causes OOM

If a non-HTML link is entered (i.e. http://outlawzmedia.net/Mixtapes/Outlawz-Killuminati-2K11.zip), Linter should not attempt to download & scrape the whole thing.

Perhaps checking content-type HTTP header would be sufficient to know whether or not to process the link?

NullPointerException on downloading page

Strange NPE:

LintedPage: [http://www.mytruspot.com/?t=live] Unable to download page: java.lang.NullPointerException

No other details given

NYtimes provider

Not sure of its priority, since the erroneous Link Item has < 100 mentions since we started running 1.5 months ago, but NYtimes behaves almost identically to Facebook: some articles go to the actual article, some articles go to the NYTimes home page, some articles go to a NYtimes login page.

SSL handshake exceptions for HTTPS links

We don't do SSL properly:

2011-09-15 13:52:32,540 [pool-1-thread-2] ERROR LintedPage: IO Exception [https://www.stargroup1.com/blog/qr-codes-or-sms-mobile-marketing-battle-heats]: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Linter should convert relative fav icon URLs to absolute

If the tag includes a relative href, i.e. "/webincludes/images/favicon.ico" Linter should convert that to absolute by prefixing the domain. This can be accomplished when we do provider URL and name in issue #2.

HTTPUrlConnection ignores timeout value provided

connection.setConnectTimeout(LintedPage.HTTP_CONNECT_TIMEOUT);

with HTTP_CONNECT_TIMEOUT set at 10 seconds, it still seems to get into a case with a timeout of 0 (infinite):

2011-08-30 10:46:52,617 [main] INFO  Linter: Running Linter
2011-08-30 10:46:52,620 [main] INFO  LintedPage: Processing URL: http://ow.ly/67AA9
2011-08-30 10:46:52,620 [main] DEBUG LintedPage: Expanding any shortened URLs...
2011-08-30 10:46:52,620 [main] TRACE LintedPage: Following http://ow.ly/67AA9...
2011-08-30 10:46:55,312 [main] TRACE LintedPage: Discovered redirect to http://www.sonymusic.co.jp/Music/Arch/SR/emirimiyamoto/index.html
2011-08-30 10:46:55,312 [main] TRACE LintedPage: Following http://www.sonymusic.co.jp/Music/Arch/SR/emirimiyamoto/index.html...
** stall **

Increase page connection timeout

Seeing a lot of 404's (FileNotFoundException) that work when I load them in my browser. I think it's just timing out before it can connect. Maybe 5 or 6 sec?

IOException can return HTTP 200, should show exception anyway

While scraping content, we handle IOExceptions:

        } catch (IOException ioe) {
            try {
                _parseError = "HTTP ERROR " + Integer.toString(connection.getResponseCode());
            } catch (IOException e) {
                _parseError = " Unable to download page: " + e;
            }
            logger.error(logPrefix + " " + _parseError);
            return;
        }

Even if we can get the response code successfully, we should still show the exception details.

YFrog Videos marked as Photos

Linter is incorrectly identifying YFrog videos as photos.

On determining types:

Videos are easily determined by the presence of video meta tags og:video, video_src, or some variant. These tags are present on all major video sharing sites.
Images are much more difficult to detect as the primary content of a web page, as most pages have an image of some sort and dimensions cannot be reliably determined ahead of time. No consistent meta data tags indicate an image-sharing service or gallery, and the line between Image and Link is much more arbitrary. Currently, image type is determined by matching the url for known popular image providers.

We currently identify YFrog photos with this pattern:
http://yfrog\.com/.*

This should be modified to ignore video variants.

Null Pointer Exception removing URL parameters after bad url UnknownHostException

2011-11-02 17:29:26,582 [pool-1-thread-3] ERROR LintedPage: IO Exception [http://www.cnn.comhttp://news.blogs.cnn.com/2011/11/02/syria-agrees-to-end-crackdown-on-demonstrators-arab-league-says/]: java.net.UnknownHostException: www.cnn.comhttp
2011-11-02 17:29:26,582 [pool-1-thread-3] ERROR TweetReceiver: Error during message consumption:
java.lang.NullPointerException
at org.linter.URLParser.removeParameters(Unknown Source)
at org.linter.LintedPage.removeDestinationUrlParamters(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Do not download/scrape content with bad content-length

If content-length is not set, or is > 1048576 bytes (1MB)

Preview Image Relative URLs

Verify that no Preview Images contain relative URLs. This is covered in AlgorithmicImageSelector, but it may have somehow fallen through (no provider?), or the original page provided a relative url in its og:image tag.

yFrog and TwitPics should have previews pre-vote

It would be incredibly useful to see preview images for yFrog, TwitPic, and similar content items in the Live Stream before they've received their first up vote.

Project readme and OSS publication

Write up project goals, how to help, etc.

Need to publicize 1.0 when it is ready

Fix key names

Standardize key name formatting. We have favIconUrl, provider_url, and preview-image-url.

NullPointerException on scraping for something

2011-08-08 17:57:52,432 [pool-1-thread-1] ERROR TweetReceiver: Error during message consumption: 
java.lang.NullPointerException
        at net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(CharacterReference.java:310)
        at net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(CharacterReference.java:306)
        at org.linter.LintedPage.scrapeMetadata(Unknown Source)
        at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
        at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
        at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
        at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
        at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

Need to set trace on dev environment and see what it's looking for that gets NPE. Still, it fails gracefully.

Move ServiceParserYoutubeAPI, should derrive from ServiceParser

AlgorithmicImageSelector: NumberFormatException determining dimensions

    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Integer.parseInt(Integer.java:470)
    at java.lang.Integer.parseInt(Integer.java:499)
    at org.linter.AlgorithmicImageItem.dimensionStringToInt(Unknown Source)
    at org.linter.AlgorithmicImageItem.setWidth(Unknown Source)
    at org.linter.AlgorithmicImageSelector.parseAllImages(Unknown Source)
    at org.linter.AlgorithmicImageSelector.getPreviewUrl(Unknown Source)
    at org.linter.ServiceParserAlgorithmic.parsePreviewImage(Unknown Source)
    at org.linter.ServiceParserAlgorithmic.parse(Unknown Source)
    at org.linter.LintedPage.scrapeMetadata(Unknown Source)
    at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
    at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
    at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
    at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
    at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

Set Default Provider Name and URL

Determining the default Provider Name/URL from the URL is performed in both the Live Stream (client-side, javascript) and Top Info (server-side, ruby) prior to Embedly lookup. Moving this to Linter will avoid having two separate implementations.

Create test suite

Test cases:

Should not download non text/html content
Should not allow circular redirects
Should allow relative redirects
Should handle bad title/favicon/description HTML
Should follow normal redirects to the end
Should work on 't.co' (which does different things depending on user-agent)
Should not allow circular redirects when next redirect is relative URL (http://t.co/UxIS7vV)

URLs with odd characters fail to process

"Odd" characters include ' and #

See TweetMuncher see beta-0.9.1 logs for HTTP ERROR 404

Exception parsing image extension

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1931)
at java.lang.String.substring(String.java:1904)
at org.linter.AlgorithmicImageItem.getExtension(Unknown Source)
at org.linter.AlgorithmicImageSelector.increaseScoreByFormat(Unknown Source)
at org.linter.AlgorithmicImageSelector.getPreviewUrl(Unknown Source)
at org.linter.ServiceParserAlgorithmic.parsePreviewImage(Unknown Source)
at org.linter.ServiceParserAlgorithmic.parse(Unknown Source)
at org.linter.LintedPage.scrapeMetadata(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Check for null attribute values when scraping

_description = CharacterReference.decodeCollapseWhiteSpace(descElement.getAttributeValue("content"));

should always check for null getAttributeValue() before calling decodeCollapseWhiteSpace

even though these NPEs are handled gracefully by commit 40288d5

Handle 404/403/500 errors directly instead of waiting on the IOException

2011-08-16 14:54:50,097 [pool-1-thread-1] ERROR LintedPage: [http://hypem.com/item/1c6tt?awesm=awe.sm_5RAvk&utm_campaign=&utm_medium=awe.sm-twitter&utm_source=t.co&utm_content=autotweet] Unable to download page: java.io.IOException: Server returned HTTP response code: 403 for URL: http://hypem.com/item/1c6tt?awesm=awe.sm_5RAvk&utm_campaign=&utm_medium=awe.sm-twitter&utm_source=t.co&utm_content=autotweet

t.co links failing

For some reason, Linter is not following t.co links to their destination site. Next to examine.

arthurn / linter Goto Github PK

linter's People

Contributors

Watchers

linter's Issues

Recommend Projects

Recommend Topics

Recommend Org