arthurn / linter Goto Github PK
View Code? Open in Web Editor NEWA Java project to process URLs and return interesting metadata
A Java project to process URLs and return interesting metadata
FeedZilla links have broken preview images. URL points to http://www.feedzilla.com
<meta property="og:image" content="/" />
Probably just ignore og:images without an actual image.
Check to see if the last URL is the same as the current. If it is, stop because that means we are on the verge of an infinite loop.
A good one to test (for some reason?)
http://vince-gill.bixxy.info/term/george+strait+listen
Tweets that include Twitter's new built-in picture feature should be recognized as images (similar to TwitPic or yFrog).
When the Location header is a relative location (i.e. Location /#!/item/13msf
), Linter fails with java.net.MalformedURLException: no protocol
URL: http://t.co/UxIS7vV
Trace:
2011-08-28 14:08:53,026 [main] TRACE LintedPage: Discovered redirect to http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert
2011-08-28 14:08:53,027 [main] TRACE LintedPage: Following http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert...
2011-08-28 14:08:53,501 [main] TRACE LintedPage: Relative URL redirect. Appending prefix: http://www.dazeddigital.com
2011-08-28 14:08:53,501 [main] TRACE LintedPage: Discovered redirect to http://www.dazeddigital.com/music/article/11159/1/just-tell-the-truth-matthew-herbert
Need to put the infinite loop detection code after appending the domain prefix to relative URLs.
If a non-HTML link is entered (i.e. http://outlawzmedia.net/Mixtapes/Outlawz-Killuminati-2K11.zip
), Linter should not attempt to download & scrape the whole thing.
Perhaps checking content-type
HTTP header would be sufficient to know whether or not to process the link?
Strange NPE:
LintedPage: [http://www.mytruspot.com/?t=live] Unable to download page: java.lang.NullPointerException
No other details given
Not sure of its priority, since the erroneous Link Item has < 100 mentions since we started running 1.5 months ago, but NYtimes behaves almost identically to Facebook: some articles go to the actual article, some articles go to the NYTimes home page, some articles go to a NYtimes login page.
We don't do SSL properly:
2011-09-15 13:52:32,540 [pool-1-thread-2] ERROR LintedPage: IO Exception [https://www.stargroup1.com/blog/qr-codes-or-sms-mobile-marketing-battle-heats]: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
If the tag includes a relative href, i.e. "/webincludes/images/favicon.ico" Linter should convert that to absolute by prefixing the domain. This can be accomplished when we do provider URL and name in issue #2.
connection.setConnectTimeout(LintedPage.HTTP_CONNECT_TIMEOUT);
with HTTP_CONNECT_TIMEOUT
set at 10 seconds, it still seems to get into a case with a timeout of 0 (infinite):
2011-08-30 10:46:52,617 [main] INFO Linter: Running Linter
2011-08-30 10:46:52,620 [main] INFO LintedPage: Processing URL: http://ow.ly/67AA9
2011-08-30 10:46:52,620 [main] DEBUG LintedPage: Expanding any shortened URLs...
2011-08-30 10:46:52,620 [main] TRACE LintedPage: Following http://ow.ly/67AA9...
2011-08-30 10:46:55,312 [main] TRACE LintedPage: Discovered redirect to http://www.sonymusic.co.jp/Music/Arch/SR/emirimiyamoto/index.html
2011-08-30 10:46:55,312 [main] TRACE LintedPage: Following http://www.sonymusic.co.jp/Music/Arch/SR/emirimiyamoto/index.html...
** stall **
Seeing a lot of 404's (FileNotFoundException) that work when I load them in my browser. I think it's just timing out before it can connect. Maybe 5 or 6 sec?
While scraping content, we handle IOExceptions:
} catch (IOException ioe) {
try {
_parseError = "HTTP ERROR " + Integer.toString(connection.getResponseCode());
} catch (IOException e) {
_parseError = " Unable to download page: " + e;
}
logger.error(logPrefix + " " + _parseError);
return;
}
Even if we can get the response code successfully, we should still show the exception details.
Linter is incorrectly identifying YFrog videos as photos.
On determining types:
We currently identify YFrog photos with this pattern:
http://yfrog\.com/.*
This should be modified to ignore video variants.
2011-11-02 17:29:26,582 [pool-1-thread-3] ERROR LintedPage: IO Exception [http://www.cnn.comhttp://news.blogs.cnn.com/2011/11/02/syria-agrees-to-end-crackdown-on-demonstrators-arab-league-says/]: java.net.UnknownHostException: www.cnn.comhttp
2011-11-02 17:29:26,582 [pool-1-thread-3] ERROR TweetReceiver: Error during message consumption:
java.lang.NullPointerException
at org.linter.URLParser.removeParameters(Unknown Source)
at org.linter.LintedPage.removeDestinationUrlParamters(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
If content-length is not set, or is > 1048576 bytes (1MB)
Verify that no Preview Images contain relative URLs. This is covered in AlgorithmicImageSelector, but it may have somehow fallen through (no provider?), or the original page provided a relative url in its og:image tag.
It would be incredibly useful to see preview images for yFrog, TwitPic, and similar content items in the Live Stream before they've received their first up vote.
Write up project goals, how to help, etc.
Need to publicize 1.0 when it is ready
Standardize key name formatting. We have favIconUrl, provider_url, and preview-image-url.
2011-08-08 17:57:52,432 [pool-1-thread-1] ERROR TweetReceiver: Error during message consumption:
java.lang.NullPointerException
at net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(CharacterReference.java:310)
at net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(CharacterReference.java:306)
at org.linter.LintedPage.scrapeMetadata(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Need to set trace on dev environment and see what it's looking for that gets NPE. Still, it fails gracefully.
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:470)
at java.lang.Integer.parseInt(Integer.java:499)
at org.linter.AlgorithmicImageItem.dimensionStringToInt(Unknown Source)
at org.linter.AlgorithmicImageItem.setWidth(Unknown Source)
at org.linter.AlgorithmicImageSelector.parseAllImages(Unknown Source)
at org.linter.AlgorithmicImageSelector.getPreviewUrl(Unknown Source)
at org.linter.ServiceParserAlgorithmic.parsePreviewImage(Unknown Source)
at org.linter.ServiceParserAlgorithmic.parse(Unknown Source)
at org.linter.LintedPage.scrapeMetadata(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Determining the default Provider Name/URL from the URL is performed in both the Live Stream (client-side, javascript) and Top Info (server-side, ruby) prior to Embedly lookup. Moving this to Linter will avoid having two separate implementations.
Test cases:
http://t.co/UxIS7vV
)"Odd" characters include ' and #
See TweetMuncher see beta-0.9.1 logs for HTTP ERROR 404
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1931)
at java.lang.String.substring(String.java:1904)
at org.linter.AlgorithmicImageItem.getExtension(Unknown Source)
at org.linter.AlgorithmicImageSelector.increaseScoreByFormat(Unknown Source)
at org.linter.AlgorithmicImageSelector.getPreviewUrl(Unknown Source)
at org.linter.ServiceParserAlgorithmic.parsePreviewImage(Unknown Source)
at org.linter.ServiceParserAlgorithmic.parse(Unknown Source)
at org.linter.LintedPage.scrapeMetadata(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processUrl(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.processTweet(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.onTweetReceived(Unknown Source)
at com.crowdspoke.msg.TweetReceiver.startReceiving(Unknown Source)
at com.crowdspoke.TweetMuncherRunnable.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
_description = CharacterReference.decodeCollapseWhiteSpace(descElement.getAttributeValue("content"));
should always check for null getAttributeValue()
before calling decodeCollapseWhiteSpace
even though these NPEs are handled gracefully by commit 40288d5
2011-08-16 14:54:50,097 [pool-1-thread-1] ERROR LintedPage: [http://hypem.com/item/1c6tt?awesm=awe.sm_5RAvk&utm_campaign=&utm_medium=awe.sm-twitter&utm_source=t.co&utm_content=autotweet] Unable to download page: java.io.IOException: Server returned HTTP response code: 403 for URL: http://hypem.com/item/1c6tt?awesm=awe.sm_5RAvk&utm_campaign=&utm_medium=awe.sm-twitter&utm_source=t.co&utm_content=autotweet
For some reason, Linter is not following t.co links to their destination site. Next to examine.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.