Giter Site home page Giter Site logo

linter's People

Contributors

arthurn avatar lastzactionhero avatar

Watchers

 avatar  avatar  avatar  avatar

linter's Issues

Linter should not follow infinite loop redirects

Check to see if the last URL is the same as the current. If it is, stop because that means we are on the verge of an infinite loop.

A good one to test (for some reason?)

http://vince-gill.bixxy.info/term/george+strait+listen

Linter fails to detect infinite loop w/ relative URL redirects

URL: http://t.co/UxIS7vV
Trace:

2011-08-28 14:08:53,026 [main] TRACE LintedPage: Discovered redirect to http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert
2011-08-28 14:08:53,027 [main] TRACE LintedPage: Following http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert...
2011-08-28 14:08:53,501 [main] TRACE LintedPage: Relative URL redirect. Appending prefix: http://www.dazeddigital.com
2011-08-28 14:08:53,501 [main] TRACE LintedPage: Discovered redirect to http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert

Need to put the infinite loop detection code after appending the domain prefix to relative URLs.

Scraping massive files causes OOM

If a non-HTML link is entered (i.e. http://outlawzmedia.net/Mixtapes/Outlawz-Killuminati-2K11.zip), Linter should not attempt to download & scrape the whole thing.

Perhaps checking content-type HTTP header would be sufficient to know whether or not to process the link?

NYtimes provider

Not sure of its priority, since the erroneous Link Item has < 100 mentions since we started running 1.5 months ago, but NYtimes behaves almost identically to Facebook: some articles go to the actual article, some articles go to the NYTimes home page, some articles go to a NYtimes login page.

SSL handshake exceptions for HTTPS links

We don't do SSL properly:

2011-09-15 13:52:32,540 [pool-1-thread-2] ERROR LintedPage: IO Exception [https://www.stargroup1.com/blog/qr-codes-or-sms-mobile-marketing-battle-heats]: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

HTTPUrlConnection ignores timeout value provided

connection.setConnectTimeout(LintedPage.HTTP_CONNECT_TIMEOUT);

with HTTP_CONNECT_TIMEOUT set at 10 seconds, it still seems to get into a case with a timeout of 0 (infinite):

2011-08-30 10:46:52,617 [main] INFO  Linter: Running Linter
2011-08-30 10:46:52,620 [main] INFO  LintedPage: Processing URL: http://ow.ly/67AA9
2011-08-30 10:46:52,620 [main] DEBUG LintedPage: Expanding any shortened URLs...
2011-08-30 10:46:52,620 [main] TRACE LintedPage: Following http://ow.ly/67AA9...
2011-08-30 10:46:55,312 [main] TRACE LintedPage: Discovered redirect to http://www.sonymusic.co.jp/Music/Arch/SR/emirimiyamoto/index.html
2011-08-30 10:46:55,312 [main] TRACE LintedPage: Following http://www.sonymusic.co.jp/Music/Arch/SR/emirimiyamoto/index.html...
** stall **

Increase page connection timeout

Seeing a lot of 404's (FileNotFoundException) that work when I load them in my browser. I think it's just timing out before it can connect. Maybe 5 or 6 sec?

IOException can return HTTP 200, should show exception anyway

While scraping content, we handle IOExceptions:

        } catch (IOException ioe) {
            try {
                _parseError = "HTTP ERROR " + Integer.toString(connection.getResponseCode());
            } catch (IOException e) {
                _parseError = " Unable to download page: " + e;
            }
            logger.error(logPrefix + " " + _parseError);
            return;
        }

Even if we can get the response code successfully, we should still show the exception details.

YFrog Videos marked as Photos

Linter is incorrectly identifying YFrog videos as photos.

On determining types:

  • Videos are easily determined by the presence of video meta tags og:video, video_src, or some variant. These tags are present on all major video sharing sites.
  • Images are much more difficult to detect as the primary content of a web page, as most pages have an image of some sort and dimensions cannot be reliably determined ahead of time. No consistent meta data tags indicate an image-sharing service or gallery, and the line between Image and Link is much more arbitrary. Currently, image type is determined by matching the url for known popular image providers.

We currently identify YFrog photos with this pattern:
http://yfrog\.com/.*

This should be modified to ignore video variants.

Null Pointer Exception removing URL parameters after bad url UnknownHostException

2011-11-02 17:29:26,582 [pool-1-thread-3] ERROR LintedPage: IO Exception [http://www.cnn.comhttp://news.blogs.cnn.com/2011/11/02/syria-agrees-to-end-crackdown-on-demonstrators-arab-league-says/]: java.net.UnknownHostException: www.cnn.comhttp
2011-11-02 17:29:26,582 [pool-1-thread-3] ERROR TweetReceiver: Error during message consumption:
java.lang.NullPointerException
at org.linter.URLParser.removeParameters(Unknown Source)
at org.linter.LintedPage.removeDestinationUrlParamters(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Preview Image Relative URLs

Verify that no Preview Images contain relative URLs. This is covered in AlgorithmicImageSelector, but it may have somehow fallen through (no provider?), or the original page provided a relative url in its og:image tag.

Fix key names

Standardize key name formatting. We have favIconUrl, provider_url, and preview-image-url.

NullPointerException on scraping for something

2011-08-08 17:57:52,432 [pool-1-thread-1] ERROR TweetReceiver: Error during message consumption: 
java.lang.NullPointerException
        at net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(CharacterReference.java:310)
        at net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(CharacterReference.java:306)
        at org.linter.LintedPage.scrapeMetadata(Unknown Source)
        at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
        at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
        at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
        at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
        at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

Need to set trace on dev environment and see what it's looking for that gets NPE. Still, it fails gracefully.

AlgorithmicImageSelector: NumberFormatException determining dimensions

    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Integer.parseInt(Integer.java:470)
    at java.lang.Integer.parseInt(Integer.java:499)
    at org.linter.AlgorithmicImageItem.dimensionStringToInt(Unknown Source)
    at org.linter.AlgorithmicImageItem.setWidth(Unknown Source)
    at org.linter.AlgorithmicImageSelector.parseAllImages(Unknown Source)
    at org.linter.AlgorithmicImageSelector.getPreviewUrl(Unknown Source)
    at org.linter.ServiceParserAlgorithmic.parsePreviewImage(Unknown Source)
    at org.linter.ServiceParserAlgorithmic.parse(Unknown Source)
    at org.linter.LintedPage.scrapeMetadata(Unknown Source)
    at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
    at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
    at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
    at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
    at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

Set Default Provider Name and URL

Determining the default Provider Name/URL from the URL is performed in both the Live Stream (client-side, javascript) and Top Info (server-side, ruby) prior to Embedly lookup. Moving this to Linter will avoid having two separate implementations.

Create test suite

Test cases:

  1. Should not download non text/html content
  2. Should not allow circular redirects
  3. Should allow relative redirects
  4. Should handle bad title/favicon/description HTML
  5. Should follow normal redirects to the end
  6. Should work on 't.co' (which does different things depending on user-agent)
  7. Should not allow circular redirects when next redirect is relative URL (http://t.co/UxIS7vV)

Exception parsing image extension

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1931)
at java.lang.String.substring(String.java:1904)
at org.linter.AlgorithmicImageItem.getExtension(Unknown Source)
at org.linter.AlgorithmicImageSelector.increaseScoreByFormat(Unknown Source)
at org.linter.AlgorithmicImageSelector.getPreviewUrl(Unknown Source)
at org.linter.ServiceParserAlgorithmic.parsePreviewImage(Unknown Source)
at org.linter.ServiceParserAlgorithmic.parse(Unknown Source)
at org.linter.LintedPage.scrapeMetadata(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Check for null attribute values when scraping

_description = CharacterReference.decodeCollapseWhiteSpace(descElement.getAttributeValue("content"));

should always check for null getAttributeValue() before calling decodeCollapseWhiteSpace

even though these NPEs are handled gracefully by commit 40288d5

Handle 404/403/500 errors directly instead of waiting on the IOException

2011-08-16 14:54:50,097 [pool-1-thread-1] ERROR LintedPage: [http://hypem.com/item/1c6tt?awesm=awe.sm_5RAvk&utm_campaign=&utm_medium=awe.sm-twitter&utm_source=t.co&utm_content=autotweet] Unable to download page: java.io.IOException: Server returned HTTP response code: 403 for URL: http://hypem.com/item/1c6tt?awesm=awe.sm_5RAvk&utm_campaign=&utm_medium=awe.sm-twitter&utm_source=t.co&utm_content=autotweet

t.co links failing

For some reason, Linter is not following t.co links to their destination site. Next to examine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.