danmactough / node-feedparser Goto Github PK

View Code? Open in Web Editor NEW

2.0K 2.0K 191.0 742 KB

Robust RSS, Atom, and RDF feed parsing in Node.js

License: Other

JavaScript 99.74% HTML 0.26%

node-feedparser's Introduction

Feedparser - Robust RSS, Atom, and RDF feed parsing in Node.js

Feedparser is for parsing RSS, Atom, and RDF feeds in node.js.

It has a couple features you don't usually see in other feed parsers:

It resolves relative URLs (such as those seen in Tim Bray's "ongoing" feed).
It properly handles XML namespaces (including those in unusual feeds that define a non-default namespace for the main feed elements).

Installation

npm install feedparser

Usage

This example is just to briefly demonstrate basic concepts.

Please also review the complete example for a thorough working example that is a suitable starting point for your app.

var FeedParser = require('feedparser');
var fetch = require('node-fetch'); // for fetching the feed

var req = fetch('http://somefeedurl.xml')
var feedparser = new FeedParser([options]);

req.then(function (res) {
  if (res.status !== 200) {
    throw new Error('Bad status code');
  }
  else {
    // The response `body` -- res.body -- is a stream
    res.body.pipe(feedparser);
  }
}, function (err) {
  // handle any request errors
});

feedparser.on('error', function (error) {
  // always handle errors
});

feedparser.on('readable', function () {
  // This is where the action is!
  var stream = this; // `this` is `feedparser`, which is a stream
  var meta = this.meta; // **NOTE** the "meta" is always available in the context of the feedparser instance
  var item;

  while (item = stream.read()) {
    console.log(item);
  }
});

You can also check out this nice working implementation that demonstrates one way to handle all the hard and annoying stuff. 😃

options

normalize - Set to false to override Feedparser's default behavior, which is to parse feeds into an object that contains the generic properties patterned after (although not identical to) the RSS 2.0 format, regardless of the feed's format.
addmeta - Set to false to override Feedparser's default behavior, which is to add the feed's meta information to each article.
feedurl - The url (string) of the feed. FeedParser is very good at resolving relative urls in feeds. But some feeds use relative urls without declaring the xml:base attribute any place in the feed. This is perfectly valid, but we don't know know the feed's url before we start parsing the feed and trying to resolve those relative urls. If we discover the feed's url, we will go back and resolve the relative urls we've already seen, but this takes a little time (not much). If you want to be sure we never have to re-resolve relative urls (or if FeedParser is failing to properly resolve relative urls), you should set the feedurl option. Otherwise, feel free to ignore this option.
resume_saxerror - Set to false to override Feedparser's default behavior, which is to emit any SAXError on error and then automatically resume parsing. In my experience, SAXErrors are not usually fatal, so this is usually helpful behavior. If you want total control over handling these errors and optionally aborting parsing the feed, use this option.

Examples

See the examples directory.

API

Transform Stream

Feedparser is a transform stream operating in "object mode": XML in -> Javascript objects out. Each readable chunk is an object representing an article in the feed.

Events Emitted

meta - called with feed meta when it has been parsed
error - called with error whenever there is a Feedparser error of any kind (SAXError, Feedparser error, etc.)

What is the parsed output produced by feedparser?

Feedparser parses each feed into a meta (emitted on the meta event) portion and one or more articles (emited on the data event or readable after the readable is emitted).

Regardless of the format of the feed, the meta and each article contain a uniform set of generic properties patterned after (although not identical to) the RSS 2.0 format, as well as all of the properties originally contained in the feed. So, for example, an Atom feed may have a meta.description property, but it will also have a meta['atom:subtitle'] property.

The purpose of the generic properties is to provide the user a uniform interface for accessing a feed's information without needing to know the feed's format (i.e., RSS versus Atom) or having to worry about handling the differences between the formats. However, the original information is also there, in case you need it. In addition, Feedparser supports some popular namespace extensions (or portions of them), such as portions of the itunes, media, feedburner and pheedo extensions. So, for example, if a feed article contains either an itunes:image or media:thumbnail, the url for that image will be contained in the article's image.url property.

All generic properties are "pre-initialized" to null (or empty arrays or objects for certain properties). This should save you from having to do a lot of checking for undefined, such as, for example, when you are using jade templates.

In addition, all properties (and namespace prefixes) use only lowercase letters, regardless of how they were capitalized in the original feed. ("xmlUrl" and "pubDate" also are still used to provide backwards compatibility.) This decision places ease-of-use over purity -- hopefully, you will never need to think about whether you should camelCase "pubDate" ever again.

The title and description properties of meta and the title property of each article have any HTML stripped if you let feedparser normalize the output. If you really need the HTML in those elements, there are always the originals: e.g., meta['atom:subtitle']['#'].

List of meta properties

title
description
link (website link)
xmlurl (the canonical link to the feed, as specified by the feed)
date (most recent update)
pubdate (original published date)
author
language
image (an Object containing url and title properties)
favicon (a link to the favicon -- only provided by Atom feeds)
copyright
generator
categories (an Array of Strings)

List of article properties

title
description (frequently, the full article content)
summary (frequently, an excerpt of the article content)
link
origlink (when FeedBurner or Pheedo puts a special tracking url in the link property, origlink contains the original link)
permalink (when an RSS feed has a guid field and the isPermalink attribute is not set to false, permalink contains the value of guid)
date (most recent update)
pubdate (original published date)
author
guid (a unique identifier for the article)
comments (a link to the article's comments section)
image (an Object containing url and title properties)
categories (an Array of Strings)
source (an Object containing url and title properties pointing to the original source for an article; see the RSS Spec for an explanation of this element)
enclosures (an Array of Objects, each representing a podcast or other enclosure and having a url property and possibly type and length properties)
meta (an Object containing all the feed meta properties; especially handy when using the EventEmitter interface to listen to article emissions)

Help

Don't be afraid to report an issue.
You can drop by Gitter, too.

Contributors

View all the contributors.

Although node-feedparser no longer shares any code with node-easyrss, it was the original inspiration and a starting point.

License

(The MIT License)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

node-feedparser's People

Contributors

Stargazers

Watchers

Forkers

zephrax carlos8f ojshua jarod-stewart realyze brianmaissy agsh abpin blackgold9 patrickod p-baleine jchris damianof rborn madimalo andriy-ml ortoo nyoraimi rdbcci peterclemenko lemonhall refactorthis matsufan contextworks cc-lam aoj elliotf andreasmadsen lgsunnyvale rbrcurtis onzu michalliu devildeveloper jcrugzz piixiiees marufsiddiqui alanmoran robertwithp bawerd masterdubs elwhizard paulmougel thebennos renatosvo drequeceler haughton uikit0 c0d3rm0nk3y eiriklv m2lan ryancoleman juancarloscruzd thsonvt ectsoft chiester gmfx dcharbonnier cogitocs uxdesigner sothree alltom casetext wangfei0001 clixsoftware andeshc xuanbang productify nomeasmo otechnology brantc gayanvirajith zalecao twicey inxi-pc spicecoder designfrontier mickael-van-der-beek ntas tonymds damianofusco mnjstwins arezki1990 morfj pdehaan andris9 csotiriou newsio humblelistener rockylo smileyt robertomalatesta alabeduarte mcanthony robloach jamesblunt lightvolk breeze2 nicopost tuanle259 hammi85

node-feedparser's Issues

Please do not change the Array prototype. It is very bad and causes major headaches that are very difficult to debug! No exception here.

Look at what this does:

var x = [1, 2, 3];
for(var i in x) console.log(i); //Prints 0, 1, 2
require('feedparser');
for(var i in x) console.log(i); //Prints 0, 1, 2, and unique!! NOOOOO!!!!

Please fix this at your earliest convenience. Thanks!

"Unexpected End" (uncaught error)

Occasionally, I'm getting the following uncaught error from Feedparser, causing my node app to crash. Of course, I've bound the .on('error') handler, but this is still not caught.

Tue Nov 06 2012 04:06:32 GMT-0800 (PST): Uncaught exception: Error: Unexpected end
Line: 0
Column: 0
Char:
at error (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/sax/lib/sax.js:352:8)
at end (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/sax/lib/sax.js:359:32)
at Object.SAXParser.end (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/sax/lib/sax.js:137:24)
at SAXStream.end (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/sax/lib/sax.js:209:16)
at Request.onend (stream.js:66:10)
at Request.EventEmitter.emit (events.js:123:20)
at IncomingMessage.Request.start.self.req.self.httpModule.request.buffer (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/request/main.js:517:14)
at IncomingMessage.exports.callback.args.(anonymous function) (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/nodetime/lib/proxy.js:81:20)
at IncomingMessage.EventEmitter.emit (events.js:123:20)
at IncomingMessage._emitEnd (http.js:366:10)

Update docs

Need to document new options.

Not working with Facebook RSS

var feedparser = require('feedparser');
var request = require('request');

function callback (article) {
  console.log('Got article: %s', JSON.stringify(article));
}

feedparser.parseUrl('https://www.facebook.com/feeds/page.php?id=158797964140471&format=rss20')
  .on('article', callback);

got following error

events.js:71
        throw arguments[1]; // Unhandled 'error' event
                       ^
Error: Remote server did not respond with a feed
    at Request.FeedParser.parseUrl.handleResponse (/Users/lusi/SkyDrive/Projects/Web/arryone/node_modules/feedparser/main.js:1169:13)
    at Request.EventEmitter.emit (events.js:96:17)
    at ClientRequest.<anonymous> (/Users/lusi/SkyDrive/Projects/Web/arryone/node_modules/feedparser/node_modules/request/main.js:521:12)
    at ClientRequest.g (events.js:192:14)
    at ClientRequest.EventEmitter.emit (events.js:96:17)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (http.js:1462:7)
    at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:111:23)
    at Socket.socketOnData [as ondata] (http.js:1367:20)
    at TCP.onread (net.js:403:27)

Please furnish url on error msgs from parseUrl.handleResponse

The generic "Error: Remote server responded:" is lacking when dealing with multiple feed urls.

can we just paser the limited number articles for special url?

eg: just get the latest 3 articles from a special url. Because a feed url may has too many articles.

Make tests run on v0.10

Update the tests so they run under v0.10

Ask servers for non-compressed data

Sometimes feeds get served gzipped, deflated, etc. We're not going to handle those formats (at least not at this time).

Accept-Encoding: identity

Ex: http://happygiraffe.net/blog/feed/

Exception while parsing.

The following test data generate an exception, most likely due to the duplicate "admin:generatorAgent" tag.

<rss version="2.0"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:content="http://purl.org/rss/1.0/modules/content/">

    <channel>
        <title>Lesnoticies.com </title>
        <link>http://www.lesnoticies.com</link>
        <description>Periódicu dixital n'asturianu</description>
        <dc:language>es-es</dc:language>
        <dc:creator>Publicaciones Ámbitu S.L.</dc:creator>
        <dc:rights>Copyright 2012</dc:rights>
        <admin:generatorAgent rdf:resource="http://www.codeigniter.com/" />

    <dc:rights>Copyright 2012</dc:rights>
    <admin:generatorAgent rdf:resource="http://www.codeigniter.com/" />
            <item>
          <title>Mayu acabó n&#39;Asturies con 1.777 persones menos nes llistes del paru</title>
          <link>http://www.lesnoticies.com/lesnoticies/noticies/mayu&#45;acabo&#45;n&#39;asturies&#45;con&#45;1.777&#45;persones&#45;menos&#45;nes&#45;llistes&#45;del&#45;paru/9745</link>
          <guid>i feel unique</guid>
          <description>test</description>
      <pubDate>Mon, 04 Jun 2012 15:16:40 +0200</pubDate>
        </item>
        </channel>
</rss>

Thanks for the lib!

Crash using parseString

This works:

    var feedparser = require('feedparser');
    feedparser.parseUrl("http://code.google.com/feeds/p/trophyim/updates/basic", function(err,data) {
               console.log(err,data);
    });

this doesn't:

  var request = require('request');
  var feedparser = require('feedparser');

  var reqObj = {'uri': "http://code.google.com/feeds/p/trophyim/updates/basic"};

  request(reqObj, function (err, response, body){
     feedparser.parseString(body, function(err,data) {
             console.log(err,data);
     });
    });

it crashes with the error:

TypeError: Cannot use 'in' operator to search for '#' in false
    at FeedParser.handleText (/Users/gbird/projects/reader/node_modules/feedparser/main.js:340:28)
    at SAXStream.EventEmitter.emit (events.js:95:17)
    at Object.me._parser.(anonymous function) [as ontext] (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:220:15)
    at emit (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:589:33)
    at closeText (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:599:24)
    at end (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:625:3)
    at Object.SAXParser.end (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:137:24)
    at SAXStream.end (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:209:16)
    at Function.FeedParser.parseString (/Users/gbird/projects/reader/node_modules/feedparser/main.js:1032:6)
    at Request._callback (/Users/gbird/projects/reader/test.js:18:17)

parseUrl craps on a null url

Probably needs to take null url and act gracefully. thanks, Bob

parser = feedparser.parseUrl(null);
FeedParser.parseUrl.url=null
TypeError: Cannot use 'in' operator to search for 'href' in null
at Function.FeedParser.parseUrl (C:\Users\Mike\node\feedparser\node_modules\feedparser\main.js:1218:21)
at repl:1:21
at REPLServer.self.eval (repl.js:109:21)
at rli.on.self.bufferedCmd (repl.js:258:20)
at REPLServer.self.eval (repl.js:116:5)
at Interface. (repl.js:248:12)
at Interface.EventEmitter.emit (events.js:96:17)
at Interface._onLine (readline.js:200:10)
at Interface._line (readline.js:518:8)
at Interface._ttyWrite (readline.js:736:14)

Make parseFile() async

Dupe and wrong

Charset encoding

I cant seem to find a way to get hold of the charset encoding of a feed unless I download and parse the XML doc myself. Is there a way to

get feedparser to return xml encoding as a meta property, or
force it to return all text as UTF-8 encoding?

Resolve all relative URIs

We do a pretty good job with annoying relative URIs in Atom feeds, but we could do even better.

The feed interwitingly doesn't declare xml:base, but still uses relative URIs. Let's follow the Python feedparser's lead as closely as possible:
http://packages.python.org/feedparser/resolving-relative-links.html#how-relative-uris-are-resolved

If no xml:base is specified, the feed has a default base URI defined in the Content-Location HTTP header.

If no Content-Location HTTP header is present, the URL used to retrieve the feed itself is the default base URI for all relative links within the feed. If the feed was retrieved via an HTTP redirect (any HTTP 3xx status code), then the final URL of the feed is the default base URI.

Handle unexpected arrays

Feed producers are so bad. Example: http://www.tabletitans.com/feed

Hey, let's put two identical dc:date elements in every item! Wtf not?! We're inexplicably using dc bullshit in an RSS 2.0 feed anyway!

Maybe utils.get can be enhanced to handle all cases of arrays appearing where there should be a string and just use the first element of the array.

relative links get wiped out

I'm working through it now, but it appears that relative links in entries are getting null()ed out. The feed I'm having issues with is:

http://substack.net/blog.xml

The offending entry is like

<link rel="self" href="/weaning_yourself_off_jquery" />

I'm going to try to make a failing spec then poke around lines https://github.com/danmactough/node-feedparser/blob/master/main.js#L444

If you have any suggestions, please let me know.

sax unexpected end error when piping 304 response to parser.stream

var FeedParser = require('feedparser');
var parser = new FeedParser();
// The following modules are used in the examples below
var request = require('request');

parser.on('article', function (article){
    console.log('Got article: %s', JSON.stringify(article));
});

var reqObj = {uri: 'http://cyber.law.harvard.edu/rss/examples/rss2sample.xml',
              headers: { 'If-Modified-Since': 'Fri, 06 Apr 2007 15:11:55 GMT',
                         'If-None-Match': '"d46a5b-9e0-42d731ba304c0"'
                       }
             };

request(reqObj).pipe(parser.stream);

Based on the example code (with modified reqObj to pass the conditional headers).
Throws an unexpected end error.

Error with RDF format.

When I parsed RDF feed, I found that I got the type name as rdf:RDF, not the rdf:rdf.And I couldn't get any information from the parser's result.

In http://www.w3.org/TR/rdf-primer/ they usually use rdf:RDF, so the library need to deal with this situation.Now I just fixed it with add n['#name'] == 'rdf:RDF ||' in handleOpenTag function .

The test feed is http://www.w3.org/News/news.rss.

da2422b causing TypeError: Object.keys called on non-object

feed at http://vocal.ly/feed/ parsing to array not object for new code at

case('cloud'):
Object.keys(el['@']).forEach(function (attr) {

in FeedParser.prototype.handleMeta

Error with RDF format.

When I parsed RDF feed, I found that I got the type name as rdf:RDF, not the rdf:rdf.And I couldn't get any information from the parser's result.

The test feed is http://www.w3.org/News/news.rss.

Pass data on ending callback

It could be useful to have a way to pass data on the ending callback.
I got stuck at this point, as I am doing multiple feed parsing at the same time and I need to distinguish which data comes from which feed.

function myCallback (error, meta, articles, [options]) ?

Not async

Hi, I was testing this library and I found that when I call obj.parseFile, I got execution blocked until that process finish.

I think that would be a great improvement to make it async, because although it's using EventEmitter for sending results, the execution is being blocked.

Just an idea =)

Doesn't process feeds with complex namespaces

Example file I used with complex namespace:
https://github.com/zinic/atom-nuke/blob/master/src/test/resources/META-INF/examples/atom/ComplexNamespaceFeed.xml

No entries (articles) are returned.

Should have conditional GET

Since feeds should only be re-downloaded when they changed, it would be great for this library to support HTTP's conditional GET:

http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers/

Use feed link for xml:base if no other option

What if a feed has relative URIs but doesn't give xml:base? We already fall back to the xmlUrl, but when we drop parseUrl support, we could get a feed with no xmlUrl, as it's not a required element.

Fall back to the channel:link.

Other fallback candidates?

Issue with parseString in the new API?

Hi there,

(I'm trying out this module for the first time)

Running this very simple example:

var feedparser = require('feedparser');
feedparser.parseString(string)
  .on('article', console.log);

The Issue: The event handler (console.log) never gets invoked.

As I understand it, that's because the string is already parsed by the time on() gets to register the event handler. To prove it, I modified FeedParser.parseString() in the feedparser source with a setTimeout like so:

FeedParser.parseString = function (string, options, callback) {
  var fp = feedparser(options, callback);
  setTimeout(function() {
    fp.stream
      .on('error', fp.handleError.bind(fp))
      .end(string, Buffer.isBuffer(string) ? null : 'utf8'); // Accomodate a Buffer in addition to a String
  },100)
  return fp;
};

and indeed, this made things work. Am I misunderstanding something?

PS: I'm running feedparser v0.10.6 with node.js v0.8.8.

Text child of "author" element not parsed

When I load a feed that contains bare text content in an author subelement of an item, the author property of the resulting article is null.

I believe this is because, at main.js#858, only name, email, and uri child elements of the author element are loaded. This is correct for Atom / RFC4287 feeds.

But in RSS 2.0, a bare text child of the author element is specified. I think the correct approach is going to be to accept the text child of author, in the case where no other subelements are found.

This construct was seen in the wild at http://www.tnr.com/rss/articles

Cannot parse Facebook feeds - callback error is null

Hello!

Just discovered the Facebook page-related feeds (maybe others too..?!) cannot be parsed, either in RSS 2.0 or Atom 1.0.

Examples (with or without https):
http://www.facebook.com/feeds/page.php?format=rss20&id=121431088001283
http://www.facebook.com/feeds/page.php?format=atom10&id=121431088001283

When the feed is finished being parsed, the callback provides the following values (JSON stringified):

error = null
meta = {"#ns":[],"@":[]}
articles = []
and there is no exception throwed.

Additionally, when the URL return a 404, error the behavior is the same (example with http://www.google.com/go404.xml) - except if the 404 page contains some xmlns, they will be populated in ns this way: meta = {"#ns":[{"xmlns":"http://www.w3.org/1999/xhtml"}],"@":[]}.

It's possible to check if something goes wrong using if (meta.xmlUrl === undefined) // do something but it is impossible to know what the problem is.

Parsing Engadget RSS feed continuously causes feedparser to freeze

Parsing feeds using setInterval works with other feeds out there, but Engadget feed somehow causes the feedparser to freeze. There of course could be other feeds that do not work as well.

I've not had time to look the feedparser code, but I think I found a way to reproduce this:

Create file including the code below:

var feedparser = require('feedparser');
var request = require('request');

function callback (article) {
  console.log(Date.now()+': Got article: ' + article.guid);
}

function parseEngadget() {
    feedparser.parseUrl('http://www.engadget.com/rss.xml').on('article', callback);
}

setInterval(parseEngadget, 1000);

Run it. Notice that every second article guids are printed to console.
After a while (30 minutes to few hours) no new lines are printed to console.
Change feed url in the code to something else (I tested with http://yle.fi/uutiset/rss/paauutiset.rss).
Test again and notice that the freeze does not occur even after a long time.

Difference between parseUrl and parseString

If I do:

 var feedparser = require('feedparser');
    feedparser.parseUrl("http://github.com/feeds/jeffmcfadden/commits/magicframework/master", function(err,data) {
               console.log(err,data);
    });

it works!

If I use 'request' to get the data and use parseString instead:

  var request = require('request');
  var feedparser = require('feedparser');

  var reqObj = {'uri': "http://github.com/feeds/jeffmcfadden/commits/magicframework/master"};

  request(reqObj, function (err, response, body){
     feedparser.parseString(body, function(err,data) {
             console.log(err,data);
     });
    });

It dies!

with the error:

TypeError: Parameter 'url' must be a string, not object
at Url.parse (url.js:118:11)
at urlParse (url.js:112:5)
at Url.resolve (url.js:406:29)
at Object.urlResolve as resolve
at /Users/gbird/projects/reader/node_modules/feedparser/utils.js:163:39
at Array.forEach (native)
at /Users/gbird/projects/reader/node_modules/feedparser/utils.js:161:19
at Array.forEach (native)
at resolveLevel (/Users/gbird/projects/reader/node_modules/feedparser/utils.js:152:9)
at Object.reresolve (/Users/gbird/projects/reader/node_modules/feedparser/utils.js:173:10)

This is on Node 0.10.0

Example request in README is wrong

Where it says:

    var reqObj = {'uri': 'http://cyber.law.harvard.edu/rss/examples/rss2sample.xml',
                  'If-Modified-Since' : <your cached 'lastModified' value>,
                  'If-None-Match' : <your cached 'etag' value>};

it should say:

    var reqObj = {'uri': 'http://cyber.law.harvard.edu/rss/examples/rss2sample.xml',
                  'headers': {'If-Modified-Since' : <your cached 'lastModified' value>,
                              'If-None-Match' : <your cached 'etag' value>}};

Inconsistent events

Using feedparser 0.9.10, there is a small issue when using the parseUrl() function.
If the supplied URL triggers an error such as ETIMEOUT or ENOENT, that error is emitted as an 'error' event. However there is no 'end' event emitted afterwards, which is inconsistent with how parse errors are handled; they first emit 'error', then later an 'end' event follows.

Small code sample I have used to provoke the situation:

var FeedParser = require('feedparser');
var parser = new FeedParser();

parser.on('error', function onParserError(error) {
    console.log("ERROR: " + error);
});

parser.on('end', function onParserEnd(articles) {
    console.log("END: " + articles.length);
});

// Generates ENOENT error, but no 'end' event:
parser.parseUrl("http://invalid.feed.url/whatever.xml");

The only workarounds I can think of at the moment is either parse the error message for know request errors (ugly!) or do the request myself and only use feedparser for parsing a successful response (e.g. parseString()).

Problem with ISO-8859-1 feeds

Hi,

I tear my hair ... ^^ I have encoding problem with ISO-8869-1 feeds and [email protected]

For exemple with :
http://fr.canoe.ca/rss/feed/nouvelles/aujourdhui.xml

The title "Blessures très sévères · Un automobiliste percute un tracteur"
becomes "Blessures trï¿½s sï¿½vï¿½res ï¿½ Un automobiliste percute un tracteur"

I do simply feedparser.parseUrl(myUrl)
and .on('article') -> console.log(article.title)

I tried to play with encoding, iconv, etc. but without success
I can not find where the problem is, it's a bug ? or i forget something ?

node v0.10.0 issue with url parse

try the code below

var parser = require('feedparser');
var url = 'http://com-stol.ru/?feed=rss2';
parser.parseUrl(
url,
{ addmeta: false, feedurl: true },
function(e, meta, articles){
console.log('got % articles from %s feed',articles.length, url);
}
);

results:

url.js:118
throw new TypeError("Parameter 'url' must be a string, not " + typeof url)
^
TypeError: Parameter 'url' must be a string, not boolean
at Url.parse (url.js:118:11)
at urlParse (url.js:112:5)
at Object.urlResolve as resolve
at Object.resolve (/Users/vseledkin/WORK/feedstore/node_modules/feedparser/utils.js:111:14)
at FeedParser. (/Users/vseledkin/WORK/feedstore/node_modules/feedparser/main.js:388:26)
at Array.forEach (native)
at FeedParser.handleAttributes (/Users/vseledkin/WORK/feedstore/node_modules/feedparser/main.js:370:22)
at FeedParser.handleOpenTag (/Users/vseledkin/WORK/feedstore/node_modules/feedparser/main.js:170:19)
at SAXStream.EventEmitter.emit (events.js:95:17)

I assume that there is something with URL API changes in node v0.10.0

Multiple media:thumbnail leads to null image

Hi,
Some feeds have more images. When this happens, the image property of the article doesn't get populated.

http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml

parseUrl called in succession will only invoke the callback on the last call.

I'm not sure if you consider this a bug or not (I do) but there are two ways to cal parseUrl in succession:

This is the way that you would think would work, alas it will only call dostuff once with the articles from url3

var parser = new FeedParser();
parser.parseUrl(url, function (err, meta, articles) { dostuff(articles); });
parser.parseUrl(url2, function (err, meta, articles) { dostuff(articles); });
parser.parseUrl(url3, function (err, meta, articles) { dostuff(articles); });

This is the way that actually works

var parser = new FeedParser();
parser.parseUrl(url, function (err, meta, articles) { dostuff(articles); });
var parser2 = new FeedParser();
parser2.parseUrl(url2, function (err, meta, articles) { dostuff(articles); });
var parser3 = new FeedParser();
parser3.parseUrl(url3, function (err, meta, articles) { dostuff(articles); });

Because of the way the parser works it will constantly overwrite its stream with the latest call to parseUrl. My suggestion would be something like having per-url streams

unable to set timeout option in `request` options object on parseUrl

feedparser README.md documentation seems to indicate all request options are available to be set. however, code indicates that only 'header' options are passed.

Date.parse() falls down a lot

Hey Dan,

Thanks for this project! I'm using it for http://magnet.io and it's saving me a lot of effort.

Maybe the project should include moment.js or another date parsing library for the pubdate parsing. There's a lot of feeds where the various date entries (dc:date, pubdate, etc.) just aren't detected correctly and thus return a null. I ran into this example last night:

http://www.spinner.ca/rsscanada.xml

The date is formatted like:

"2013-03-17T12:44:00 00:00"

Which is almost but not quite what Javascript likes - that space at the end causes Date.parse() to fail, where a "-" would be fine. There's lots and lots of other badly formatted dates, but this shows how touchy Date.parse() can be. Moment seems to be able to parse it well enough, though really something more like Simplepie's date parsing script might be called for:

https://github.com/simplepie/simplepie/blob/master/library/SimplePie/Parse/Date.php

If I get around to converting that to Javascript, I'll send you a pull request. ;-)

As a workaround now, when I see a null, I dive into the meta object and see if moment.js can parse the date and it seems to be working well enough, but it should probably be in your library directly.

Thanks again!

-Russ

Buffer overrun in sax causes crash

When trying to parseUrl(“http://altdevblogaday.org/feed/”) an exception (Error: Max buffer length exceeded: procInstBody) will be thrown by sax.
The error is actually reported by feedparsers error-event, but the whole node process exits immediately afterwards, instead of emitting the end-event.
I am pretty sure this feed is somehow wrong or at least a corner case, since it is the only one out of many hundreds that causes feedparser trouble.

You can circumvent the whole problem by increasing sax.MAX_BUFFER_LENGTH but maybe there is a way that feedparser could handle this error more graciously.
It is especially bad when using feedparser in conjunction with async.queue, because it changes the behaviour from crashing the process to the queue never being drained (because feedparsers 'end-event' is simply not emitted after the error).

versions
node: 0.8.15
current feedparser 0.10.8

GeoRSS?

Any chance of getting GeoRSS fields supported? (Or maybe they already are and its just my feed thats the problem?)

media:content is added to enclosures array

in the FeedParser.prototype.handleItem switch statement, media:content is treated as an enclosure. this does not seem correct. for instnce, if the same url is in both places, it ends up twice in the enclosures array. anyways since i don't why you did it i am confused.

Versions <=0.9.1: not clear when parseFile's callback is triggered

Hi danmactough,
Can you explain here or in the documentation when parseFile's callback is triggered, or what event we should monitor to assess that all articles have been parsed?

My expectation was that the callback is called when all articles have been parsed, but some strange behaviour of my code suggests differently. It may be my mistake, but it would be good to hear how you designed that, too.

Thanks!

Author/creator missing in result

Hello,

Do you plan to include the author (rss) or dc:creator (atom) in the resulting article?
Since these are standard fields in at least both RSS and Atom (not sure about RDF) it should be included. If you don't have that much time I can also try to change it by myself and send you a pull request, but this can take some time, since node.js is a language I currently don't understand, I only use it for now :)

Thanks!

Giving bad URL causes bad behavior

with query = "http://lifehacker.com/"

var parser = new FeedParser().parseUrl(query, function (err, meta, art) {
//...
});

we have:

TypeError: Cannot call method 'trim' of undefined
at /usr/home/_/data/www/_**/node_modules/feedparser/main.js:101:60

Handling feed modifications

Hi,
Some little newbie questions:

How do you correctly handle If-Modified-Since ? It seems the server (Google Calendar) always send me back the feed.
How do you monitor changes ? Should I use something like 'ncb000gt / node-cron' to trigger a request every X minutes ?
Is there a smart way to handle only new feed elements ? Or should I parse date until I get something ? Any good date parser ?

Incorrect origlink when parsing atom feeds containing <link rel="canonical">

When I tried to parse a feed containing the lines

(and no other link tags)

The link field contained "http://feedproxy.google.com/~r/smittenkitchen/~3/EeHjPgqUpmA/" (Probably the desired value)
But the origlink field contained null, instead of "http://smittenkitchen.com/2012/07/blackberry-gin-fizz/"

Here is a sample feed for replication of the issue: http://www.google.com/reader/atom/feed/http://feeds.feedburner.com/smittenkitchen?n=1

I can fix it and submit a pull request if my assessment of the issue seems correct

Discovering Pubsubhubbub

Hello,

I'm trying to figure out what the best way to extracting any pubsubhubbub reference from a given feed would be. I could just listen for the response event, but that would exclude functions like parseStr. Plus it seem kinda stupid when feedparser already is parsing the feed. :)

For those who do not know. Hubs are advertised as a <link>-tag with a rel="hub" attribute. Ie <link rel="hub" href="http://hubby.com/?subscribe" />.

I tried just putting some code in handleOpenTag:

  if (node.name === n['#prefix'] + ':link') {
    if (n['@']['rel'] === 'hub') {
      this.meta['#hub'] = n['@']['rel'];
    }
  }

Neglecting the fact that the colon is hard coded and stuff, that works for the feeds I've tested. It's not very elegant though and probably breaks on lots of feeds. :P

Is this something that you as the author @danmactough feels like it could fit within feedparser? I'd be willing to make a patch and submit a pull requests. I just wanted to raise the issue and ask for some pointers. :)

Wrong example with parseUrl and object

Hi,

I think this example cannot work :

// Or you could try letting feedparser handle working with request (experimental)
feedparser.parseUrl(reqObj)
  .on('response', function (response){
    // do something like save the HTTP headers for a future request
  })
  .on('article', callback);

Because, in your code you are only sending url information to "request"

var req = {
    uri: url,
    headers: { 'Accept-Encoding': 'identity' }
  };
  request(req)
    .on('error', fp.handleError.bind(fp))
    .on('response', handleResponse)
    .pipe(fp.stream)
    ;

Thank you for you work! :)

Edwin

danmactough / node-feedparser Goto Github PK

node-feedparser's Introduction

Feedparser - Robust RSS, Atom, and RDF feed parsing in Node.js

Installation

Usage

options

Examples

API

Transform Stream

Events Emitted

What is the parsed output produced by feedparser?

List of meta properties

List of article properties

Help

Contributors

License

node-feedparser's People

Contributors

Stargazers

Watchers

Forkers

node-feedparser's Issues

Recommend Projects

Recommend Topics

Recommend Org