Giter Site home page Giter Site logo

node-feedparser's Introduction

Feedparser - Robust RSS, Atom, and RDF feed parsing in Node.js

Greenkeeper badge

Join the chat at https://gitter.im/danmactough/node-feedparser

Build Status

NPM

Feedparser is for parsing RSS, Atom, and RDF feeds in node.js.

It has a couple features you don't usually see in other feed parsers:

  1. It resolves relative URLs (such as those seen in Tim Bray's "ongoing" feed).
  2. It properly handles XML namespaces (including those in unusual feeds that define a non-default namespace for the main feed elements).

Installation

npm install feedparser

Usage

This example is just to briefly demonstrate basic concepts.

Please also review the complete example for a thorough working example that is a suitable starting point for your app.

var FeedParser = require('feedparser');
var fetch = require('node-fetch'); // for fetching the feed

var req = fetch('http://somefeedurl.xml')
var feedparser = new FeedParser([options]);

req.then(function (res) {
  if (res.status !== 200) {
    throw new Error('Bad status code');
  }
  else {
    // The response `body` -- res.body -- is a stream
    res.body.pipe(feedparser);
  }
}, function (err) {
  // handle any request errors
});

feedparser.on('error', function (error) {
  // always handle errors
});

feedparser.on('readable', function () {
  // This is where the action is!
  var stream = this; // `this` is `feedparser`, which is a stream
  var meta = this.meta; // **NOTE** the "meta" is always available in the context of the feedparser instance
  var item;

  while (item = stream.read()) {
    console.log(item);
  }
});

You can also check out this nice working implementation that demonstrates one way to handle all the hard and annoying stuff. 😃

options

  • normalize - Set to false to override Feedparser's default behavior, which is to parse feeds into an object that contains the generic properties patterned after (although not identical to) the RSS 2.0 format, regardless of the feed's format.

  • addmeta - Set to false to override Feedparser's default behavior, which is to add the feed's meta information to each article.

  • feedurl - The url (string) of the feed. FeedParser is very good at resolving relative urls in feeds. But some feeds use relative urls without declaring the xml:base attribute any place in the feed. This is perfectly valid, but we don't know know the feed's url before we start parsing the feed and trying to resolve those relative urls. If we discover the feed's url, we will go back and resolve the relative urls we've already seen, but this takes a little time (not much). If you want to be sure we never have to re-resolve relative urls (or if FeedParser is failing to properly resolve relative urls), you should set the feedurl option. Otherwise, feel free to ignore this option.

  • resume_saxerror - Set to false to override Feedparser's default behavior, which is to emit any SAXError on error and then automatically resume parsing. In my experience, SAXErrors are not usually fatal, so this is usually helpful behavior. If you want total control over handling these errors and optionally aborting parsing the feed, use this option.

Examples

See the examples directory.

API

Transform Stream

Feedparser is a transform stream operating in "object mode": XML in -> Javascript objects out. Each readable chunk is an object representing an article in the feed.

Events Emitted

  • meta - called with feed meta when it has been parsed
  • error - called with error whenever there is a Feedparser error of any kind (SAXError, Feedparser error, etc.)

What is the parsed output produced by feedparser?

Feedparser parses each feed into a meta (emitted on the meta event) portion and one or more articles (emited on the data event or readable after the readable is emitted).

Regardless of the format of the feed, the meta and each article contain a uniform set of generic properties patterned after (although not identical to) the RSS 2.0 format, as well as all of the properties originally contained in the feed. So, for example, an Atom feed may have a meta.description property, but it will also have a meta['atom:subtitle'] property.

The purpose of the generic properties is to provide the user a uniform interface for accessing a feed's information without needing to know the feed's format (i.e., RSS versus Atom) or having to worry about handling the differences between the formats. However, the original information is also there, in case you need it. In addition, Feedparser supports some popular namespace extensions (or portions of them), such as portions of the itunes, media, feedburner and pheedo extensions. So, for example, if a feed article contains either an itunes:image or media:thumbnail, the url for that image will be contained in the article's image.url property.

All generic properties are "pre-initialized" to null (or empty arrays or objects for certain properties). This should save you from having to do a lot of checking for undefined, such as, for example, when you are using jade templates.

In addition, all properties (and namespace prefixes) use only lowercase letters, regardless of how they were capitalized in the original feed. ("xmlUrl" and "pubDate" also are still used to provide backwards compatibility.) This decision places ease-of-use over purity -- hopefully, you will never need to think about whether you should camelCase "pubDate" ever again.

The title and description properties of meta and the title property of each article have any HTML stripped if you let feedparser normalize the output. If you really need the HTML in those elements, there are always the originals: e.g., meta['atom:subtitle']['#'].

List of meta properties

  • title
  • description
  • link (website link)
  • xmlurl (the canonical link to the feed, as specified by the feed)
  • date (most recent update)
  • pubdate (original published date)
  • author
  • language
  • image (an Object containing url and title properties)
  • favicon (a link to the favicon -- only provided by Atom feeds)
  • copyright
  • generator
  • categories (an Array of Strings)

List of article properties

  • title
  • description (frequently, the full article content)
  • summary (frequently, an excerpt of the article content)
  • link
  • origlink (when FeedBurner or Pheedo puts a special tracking url in the link property, origlink contains the original link)
  • permalink (when an RSS feed has a guid field and the isPermalink attribute is not set to false, permalink contains the value of guid)
  • date (most recent update)
  • pubdate (original published date)
  • author
  • guid (a unique identifier for the article)
  • comments (a link to the article's comments section)
  • image (an Object containing url and title properties)
  • categories (an Array of Strings)
  • source (an Object containing url and title properties pointing to the original source for an article; see the RSS Spec for an explanation of this element)
  • enclosures (an Array of Objects, each representing a podcast or other enclosure and having a url property and possibly type and length properties)
  • meta (an Object containing all the feed meta properties; especially handy when using the EventEmitter interface to listen to article emissions)

Help

  • Don't be afraid to report an issue.
  • You can drop by Gitter, too.

Contributors

View all the contributors.

Although node-feedparser no longer shares any code with node-easyrss, it was the original inspiration and a starting point.

License

(The MIT License)

Copyright (c) 2011-2020 Dan MacTough and contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

node-feedparser's People

Contributors

alex7kom avatar andreasmadsen avatar andris9 avatar brianmaissy avatar danmactough avatar dcharbonnier avatar designfrontier avatar devongovett avatar diego-suarez avatar gitter-badger avatar greenkeeper[bot] avatar jakutis avatar jcrugzz avatar kof avatar paulmougel avatar pdehaan avatar rborn avatar realyze avatar sgri avatar supahgreg avatar tusbar avatar wberndt avatar william-riley-land avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-feedparser's Issues

VERY IMPORTANT! Do not override Array.prototype!

See https://github.com/danmactough/node-feedparser/blob/master/lib/utils.js#L70

Please do not change the Array prototype. It is very bad and causes major headaches that are very difficult to debug! No exception here.

Look at what this does:

var x = [1, 2, 3];
for(var i in x) console.log(i); //Prints 0, 1, 2
require('feedparser');
for(var i in x) console.log(i); //Prints 0, 1, 2, and unique!! NOOOOO!!!!

Please fix this at your earliest convenience. Thanks!

"Unexpected End" (uncaught error)

Occasionally, I'm getting the following uncaught error from Feedparser, causing my node app to crash. Of course, I've bound the .on('error') handler, but this is still not caught.

Tue Nov 06 2012 04:06:32 GMT-0800 (PST): Uncaught exception: Error: Unexpected end
Line: 0
Column: 0
Char:
at error (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/sax/lib/sax.js:352:8)
at end (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/sax/lib/sax.js:359:32)
at Object.SAXParser.end (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/sax/lib/sax.js:137:24)
at SAXStream.end (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/sax/lib/sax.js:209:16)
at Request.onend (stream.js:66:10)
at Request.EventEmitter.emit (events.js:123:20)
at IncomingMessage.Request.start.self.req.self.httpModule.request.buffer (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/feedparser/node_modules/request/main.js:517:14)
at IncomingMessage.exports.callback.args.(anonymous function) (/home/webapps/SGPArray/releases/20121105192815/api/node_modules/nodetime/lib/proxy.js:81:20)
at IncomingMessage.EventEmitter.emit (events.js:123:20)
at IncomingMessage._emitEnd (http.js:366:10)

Not working with Facebook RSS

var feedparser = require('feedparser');
var request = require('request');

function callback (article) {
  console.log('Got article: %s', JSON.stringify(article));
}

feedparser.parseUrl('https://www.facebook.com/feeds/page.php?id=158797964140471&format=rss20')
  .on('article', callback);

got following error

events.js:71
        throw arguments[1]; // Unhandled 'error' event
                       ^
Error: Remote server did not respond with a feed
    at Request.FeedParser.parseUrl.handleResponse (/Users/lusi/SkyDrive/Projects/Web/arryone/node_modules/feedparser/main.js:1169:13)
    at Request.EventEmitter.emit (events.js:96:17)
    at ClientRequest.<anonymous> (/Users/lusi/SkyDrive/Projects/Web/arryone/node_modules/feedparser/node_modules/request/main.js:521:12)
    at ClientRequest.g (events.js:192:14)
    at ClientRequest.EventEmitter.emit (events.js:96:17)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (http.js:1462:7)
    at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:111:23)
    at Socket.socketOnData [as ondata] (http.js:1367:20)
    at TCP.onread (net.js:403:27)

Exception while parsing.

The following test data generate an exception, most likely due to the duplicate "admin:generatorAgent" tag.

<rss version="2.0"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:content="http://purl.org/rss/1.0/modules/content/">

    <channel>
        <title>Lesnoticies.com </title>
        <link>http://www.lesnoticies.com</link>
        <description>Periódicu dixital n'asturianu</description>
        <dc:language>es-es</dc:language>
        <dc:creator>Publicaciones Ámbitu S.L.</dc:creator>
        <dc:rights>Copyright 2012</dc:rights>
        <admin:generatorAgent rdf:resource="http://www.codeigniter.com/" />

    <dc:rights>Copyright 2012</dc:rights>
    <admin:generatorAgent rdf:resource="http://www.codeigniter.com/" />
            <item>
          <title>Mayu acabó n&#39;Asturies con 1.777 persones menos nes llistes del paru</title>
          <link>http://www.lesnoticies.com/lesnoticies/noticies/mayu&#45;acabo&#45;n&#39;asturies&#45;con&#45;1.777&#45;persones&#45;menos&#45;nes&#45;llistes&#45;del&#45;paru/9745</link>
          <guid>i feel unique</guid>
          <description>test</description>
      <pubDate>Mon, 04 Jun 2012 15:16:40 +0200</pubDate>
        </item>
        </channel>
</rss>

Thanks for the lib!

Crash using parseString

This works:

    var feedparser = require('feedparser');
    feedparser.parseUrl("http://code.google.com/feeds/p/trophyim/updates/basic", function(err,data) {
               console.log(err,data);
    });

this doesn't:

  var request = require('request');
  var feedparser = require('feedparser');

  var reqObj = {'uri': "http://code.google.com/feeds/p/trophyim/updates/basic"};

  request(reqObj, function (err, response, body){
     feedparser.parseString(body, function(err,data) {
             console.log(err,data);
     });
    });

it crashes with the error:

TypeError: Cannot use 'in' operator to search for '#' in false
    at FeedParser.handleText (/Users/gbird/projects/reader/node_modules/feedparser/main.js:340:28)
    at SAXStream.EventEmitter.emit (events.js:95:17)
    at Object.me._parser.(anonymous function) [as ontext] (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:220:15)
    at emit (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:589:33)
    at closeText (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:599:24)
    at end (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:625:3)
    at Object.SAXParser.end (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:137:24)
    at SAXStream.end (/Users/gbird/projects/reader/node_modules/sax/lib/sax.js:209:16)
    at Function.FeedParser.parseString (/Users/gbird/projects/reader/node_modules/feedparser/main.js:1032:6)
    at Request._callback (/Users/gbird/projects/reader/test.js:18:17)

parseUrl craps on a null url

Probably needs to take null url and act gracefully. thanks, Bob

parser = feedparser.parseUrl(null);
FeedParser.parseUrl.url=null
TypeError: Cannot use 'in' operator to search for 'href' in null
at Function.FeedParser.parseUrl (C:\Users\Mike\node\feedparser\node_modules\feedparser\main.js:1218:21)
at repl:1:21
at REPLServer.self.eval (repl.js:109:21)
at rli.on.self.bufferedCmd (repl.js:258:20)
at REPLServer.self.eval (repl.js:116:5)
at Interface. (repl.js:248:12)
at Interface.EventEmitter.emit (events.js:96:17)
at Interface._onLine (readline.js:200:10)
at Interface._line (readline.js:518:8)
at Interface._ttyWrite (readline.js:736:14)

Charset encoding

I cant seem to find a way to get hold of the charset encoding of a feed unless I download and parse the XML doc myself. Is there a way to

  • get feedparser to return xml encoding as a meta property, or
  • force it to return all text as UTF-8 encoding?

Resolve all relative URIs

We do a pretty good job with annoying relative URIs in Atom feeds, but we could do even better.

The feed interwitingly doesn't declare xml:base, but still uses relative URIs. Let's follow the Python feedparser's lead as closely as possible:
http://packages.python.org/feedparser/resolving-relative-links.html#how-relative-uris-are-resolved

If no xml:base is specified, the feed has a default base URI defined in the Content-Location HTTP header.

If no Content-Location HTTP header is present, the URL used to retrieve the feed itself is the default base URI for all relative links within the feed. If the feed was retrieved via an HTTP redirect (any HTTP 3xx status code), then the final URL of the feed is the default base URI.

Handle unexpected arrays

Feed producers are so bad. Example: http://www.tabletitans.com/feed

Hey, let's put two identical dc:date elements in every item! Wtf not?! We're inexplicably using dc bullshit in an RSS 2.0 feed anyway!

Maybe utils.get can be enhanced to handle all cases of arrays appearing where there should be a string and just use the first element of the array.

sax unexpected end error when piping 304 response to parser.stream

var FeedParser = require('feedparser');
var parser = new FeedParser();
// The following modules are used in the examples below
var request = require('request');

parser.on('article', function (article){
    console.log('Got article: %s', JSON.stringify(article));
});

var reqObj = {uri: 'http://cyber.law.harvard.edu/rss/examples/rss2sample.xml',
              headers: { 'If-Modified-Since': 'Fri, 06 Apr 2007 15:11:55 GMT',
                         'If-None-Match': '"d46a5b-9e0-42d731ba304c0"'
                       }
             };

request(reqObj).pipe(parser.stream);

Based on the example code (with modified reqObj to pass the conditional headers).
Throws an unexpected end error.

Error with RDF format.

When I parsed RDF feed, I found that I got the type name as rdf:RDF, not the rdf:rdf.And I couldn't get any information from the parser's result.

In http://www.w3.org/TR/rdf-primer/ they usually use rdf:RDF, so the library need to deal with this situation.Now I just fixed it with add n['#name'] == 'rdf:RDF ||' in handleOpenTag function .

The test feed is http://www.w3.org/News/news.rss.

Error with RDF format.

When I parsed RDF feed, I found that I got the type name as rdf:RDF, not the rdf:rdf.And I couldn't get any information from the parser's result.

In http://www.w3.org/TR/rdf-primer/ they usually use rdf:RDF, so the library need to deal with this situation.Now I just fixed it with add n['#name'] == 'rdf:RDF ||' in handleOpenTag function .

The test feed is http://www.w3.org/News/news.rss.

Pass data on ending callback

It could be useful to have a way to pass data on the ending callback.
I got stuck at this point, as I am doing multiple feed parsing at the same time and I need to distinguish which data comes from which feed.

function myCallback (error, meta, articles, [options]) ?

Not async

Hi, I was testing this library and I found that when I call obj.parseFile, I got execution blocked until that process finish.

I think that would be a great improvement to make it async, because although it's using EventEmitter for sending results, the execution is being blocked.

Just an idea =)

Use feed link for xml:base if no other option

What if a feed has relative URIs but doesn't give xml:base? We already fall back to the xmlUrl, but when we drop parseUrl support, we could get a feed with no xmlUrl, as it's not a required element.

Fall back to the channel:link.

Other fallback candidates?

Issue with parseString in the new API?

Hi there,

(I'm trying out this module for the first time)

Running this very simple example:

var feedparser = require('feedparser');
feedparser.parseString(string)
  .on('article', console.log);

The Issue: The event handler (console.log) never gets invoked.

As I understand it, that's because the string is already parsed by the time on() gets to register the event handler. To prove it, I modified FeedParser.parseString() in the feedparser source with a setTimeout like so:

FeedParser.parseString = function (string, options, callback) {
  var fp = feedparser(options, callback);
  setTimeout(function() {
    fp.stream
      .on('error', fp.handleError.bind(fp))
      .end(string, Buffer.isBuffer(string) ? null : 'utf8'); // Accomodate a Buffer in addition to a String
  },100)
  return fp;
};

and indeed, this made things work. Am I misunderstanding something?

PS: I'm running feedparser v0.10.6 with node.js v0.8.8.

Text child of "author" element not parsed

When I load a feed that contains bare text content in an author subelement of an item, the author property of the resulting article is null.

I believe this is because, at main.js#858, only name, email, and uri child elements of the author element are loaded. This is correct for Atom / RFC4287 feeds.

But in RSS 2.0, a bare text child of the author element is specified. I think the correct approach is going to be to accept the text child of author, in the case where no other subelements are found.

This construct was seen in the wild at http://www.tnr.com/rss/articles

Cannot parse Facebook feeds - callback error is null

Hello!

Just discovered the Facebook page-related feeds (maybe others too..?!) cannot be parsed, either in RSS 2.0 or Atom 1.0.

Examples (with or without https):
http://www.facebook.com/feeds/page.php?format=rss20&id=121431088001283
http://www.facebook.com/feeds/page.php?format=atom10&id=121431088001283

When the feed is finished being parsed, the callback provides the following values (JSON stringified):

  • error = null
  • meta = {"#ns":[],"@":[]}
  • articles = []
  • and there is no exception throwed.

Additionally, when the URL return a 404, error the behavior is the same (example with http://www.google.com/go404.xml) - except if the 404 page contains some xmlns, they will be populated in ns this way: meta = {"#ns":[{"xmlns":"http://www.w3.org/1999/xhtml"}],"@":[]}.

It's possible to check if something goes wrong using if (meta.xmlUrl === undefined) // do something but it is impossible to know what the problem is.

Parsing Engadget RSS feed continuously causes feedparser to freeze

Parsing feeds using setInterval works with other feeds out there, but Engadget feed somehow causes the feedparser to freeze. There of course could be other feeds that do not work as well.

I've not had time to look the feedparser code, but I think I found a way to reproduce this:

  1. Create file including the code below:
var feedparser = require('feedparser');
var request = require('request');

function callback (article) {
  console.log(Date.now()+': Got article: ' + article.guid);
}

function parseEngadget() {
    feedparser.parseUrl('http://www.engadget.com/rss.xml').on('article', callback);
}

setInterval(parseEngadget, 1000);
  1. Run it. Notice that every second article guids are printed to console.
  2. After a while (30 minutes to few hours) no new lines are printed to console.
  3. Change feed url in the code to something else (I tested with http://yle.fi/uutiset/rss/paauutiset.rss).
  4. Test again and notice that the freeze does not occur even after a long time.

Difference between parseUrl and parseString

If I do:

 var feedparser = require('feedparser');
    feedparser.parseUrl("http://github.com/feeds/jeffmcfadden/commits/magicframework/master", function(err,data) {
               console.log(err,data);
    });

it works!

If I use 'request' to get the data and use parseString instead:

  var request = require('request');
  var feedparser = require('feedparser');

  var reqObj = {'uri': "http://github.com/feeds/jeffmcfadden/commits/magicframework/master"};

  request(reqObj, function (err, response, body){
     feedparser.parseString(body, function(err,data) {
             console.log(err,data);
     });
    });

It dies!

with the error:

TypeError: Parameter 'url' must be a string, not object
at Url.parse (url.js:118:11)
at urlParse (url.js:112:5)
at Url.resolve (url.js:406:29)
at Object.urlResolve as resolve
at /Users/gbird/projects/reader/node_modules/feedparser/utils.js:163:39
at Array.forEach (native)
at /Users/gbird/projects/reader/node_modules/feedparser/utils.js:161:19
at Array.forEach (native)
at resolveLevel (/Users/gbird/projects/reader/node_modules/feedparser/utils.js:152:9)
at Object.reresolve (/Users/gbird/projects/reader/node_modules/feedparser/utils.js:173:10)

This is on Node 0.10.0

Example request in README is wrong

Where it says:

    var reqObj = {'uri': 'http://cyber.law.harvard.edu/rss/examples/rss2sample.xml',
                  'If-Modified-Since' : <your cached 'lastModified' value>,
                  'If-None-Match' : <your cached 'etag' value>};

it should say:

    var reqObj = {'uri': 'http://cyber.law.harvard.edu/rss/examples/rss2sample.xml',
                  'headers': {'If-Modified-Since' : <your cached 'lastModified' value>,
                              'If-None-Match' : <your cached 'etag' value>}};

Inconsistent events

Using feedparser 0.9.10, there is a small issue when using the parseUrl() function.
If the supplied URL triggers an error such as ETIMEOUT or ENOENT, that error is emitted as an 'error' event. However there is no 'end' event emitted afterwards, which is inconsistent with how parse errors are handled; they first emit 'error', then later an 'end' event follows.

Small code sample I have used to provoke the situation:

var FeedParser = require('feedparser');
var parser = new FeedParser();

parser.on('error', function onParserError(error) {
    console.log("ERROR: " + error);
});

parser.on('end', function onParserEnd(articles) {
    console.log("END: " + articles.length);
});

// Generates ENOENT error, but no 'end' event:
parser.parseUrl("http://invalid.feed.url/whatever.xml");

The only workarounds I can think of at the moment is either parse the error message for know request errors (ugly!) or do the request myself and only use feedparser for parsing a successful response (e.g. parseString()).

Problem with ISO-8859-1 feeds

Hi,

I tear my hair ... ^^ I have encoding problem with ISO-8869-1 feeds and [email protected]

For exemple with :
http://fr.canoe.ca/rss/feed/nouvelles/aujourdhui.xml

The title "Blessures très sévères · Un automobiliste percute un tracteur"
becomes "Blessures tr�s s�v�res � Un automobiliste percute un tracteur"

I do simply feedparser.parseUrl(myUrl)
and .on('article') -> console.log(article.title)

I tried to play with encoding, iconv, etc. but without success
I can not find where the problem is, it's a bug ? or i forget something ?

node v0.10.0 issue with url parse

try the code below

var parser = require('feedparser');
var url = 'http://com-stol.ru/?feed=rss2';
parser.parseUrl(
url,
{ addmeta: false, feedurl: true },
function(e, meta, articles){
console.log('got % articles from %s feed',articles.length, url);
}
);

results:

url.js:118
throw new TypeError("Parameter 'url' must be a string, not " + typeof url)
^
TypeError: Parameter 'url' must be a string, not boolean
at Url.parse (url.js:118:11)
at urlParse (url.js:112:5)
at Object.urlResolve as resolve
at Object.resolve (/Users/vseledkin/WORK/feedstore/node_modules/feedparser/utils.js:111:14)
at FeedParser. (/Users/vseledkin/WORK/feedstore/node_modules/feedparser/main.js:388:26)
at Array.forEach (native)
at FeedParser.handleAttributes (/Users/vseledkin/WORK/feedstore/node_modules/feedparser/main.js:370:22)
at FeedParser.handleOpenTag (/Users/vseledkin/WORK/feedstore/node_modules/feedparser/main.js:170:19)
at SAXStream.EventEmitter.emit (events.js:95:17)

I assume that there is something with URL API changes in node v0.10.0

parseUrl called in succession will only invoke the callback on the last call.

I'm not sure if you consider this a bug or not (I do) but there are two ways to cal parseUrl in succession:

This is the way that you would think would work, alas it will only call dostuff once with the articles from url3

var parser = new FeedParser();
parser.parseUrl(url, function (err, meta, articles) { dostuff(articles); });
parser.parseUrl(url2, function (err, meta, articles) { dostuff(articles); });
parser.parseUrl(url3, function (err, meta, articles) { dostuff(articles); });

This is the way that actually works

var parser = new FeedParser();
parser.parseUrl(url, function (err, meta, articles) { dostuff(articles); });
var parser2 = new FeedParser();
parser2.parseUrl(url2, function (err, meta, articles) { dostuff(articles); });
var parser3 = new FeedParser();
parser3.parseUrl(url3, function (err, meta, articles) { dostuff(articles); });

Because of the way the parser works it will constantly overwrite its stream with the latest call to parseUrl. My suggestion would be something like having per-url streams

Date.parse() falls down a lot

Hey Dan,

Thanks for this project! I'm using it for http://magnet.io and it's saving me a lot of effort.

Maybe the project should include moment.js or another date parsing library for the pubdate parsing. There's a lot of feeds where the various date entries (dc:date, pubdate, etc.) just aren't detected correctly and thus return a null. I ran into this example last night:

http://www.spinner.ca/rsscanada.xml

The date is formatted like:

"2013-03-17T12:44:00 00:00"

Which is almost but not quite what Javascript likes - that space at the end causes Date.parse() to fail, where a "-" would be fine. There's lots and lots of other badly formatted dates, but this shows how touchy Date.parse() can be. Moment seems to be able to parse it well enough, though really something more like Simplepie's date parsing script might be called for:

https://github.com/simplepie/simplepie/blob/master/library/SimplePie/Parse/Date.php

If I get around to converting that to Javascript, I'll send you a pull request. ;-)

As a workaround now, when I see a null, I dive into the meta object and see if moment.js can parse the date and it seems to be working well enough, but it should probably be in your library directly.

Thanks again!

-Russ

Buffer overrun in sax causes crash

When trying to parseUrl(“http://altdevblogaday.org/feed/”) an exception (Error: Max buffer length exceeded: procInstBody) will be thrown by sax.
The error is actually reported by feedparsers error-event, but the whole node process exits immediately afterwards, instead of emitting the end-event.
I am pretty sure this feed is somehow wrong or at least a corner case, since it is the only one out of many hundreds that causes feedparser trouble.

You can circumvent the whole problem by increasing sax.MAX_BUFFER_LENGTH but maybe there is a way that feedparser could handle this error more graciously.
It is especially bad when using feedparser in conjunction with async.queue, because it changes the behaviour from crashing the process to the queue never being drained (because feedparsers 'end-event' is simply not emitted after the error).

versions
node: 0.8.15
current feedparser 0.10.8

GeoRSS?

Any chance of getting GeoRSS fields supported? (Or maybe they already are and its just my feed thats the problem?)

media:content is added to enclosures array

in the FeedParser.prototype.handleItem switch statement, media:content is treated as an enclosure. this does not seem correct. for instnce, if the same url is in both places, it ends up twice in the enclosures array. anyways since i don't why you did it i am confused.

Versions <=0.9.1: not clear when parseFile's callback is triggered

Hi danmactough,
Can you explain here or in the documentation when parseFile's callback is triggered, or what event we should monitor to assess that all articles have been parsed?

My expectation was that the callback is called when all articles have been parsed, but some strange behaviour of my code suggests differently. It may be my mistake, but it would be good to hear how you designed that, too.

Thanks!

G.

Author/creator missing in result

Hello,

Do you plan to include the author (rss) or dc:creator (atom) in the resulting article?
Since these are standard fields in at least both RSS and Atom (not sure about RDF) it should be included. If you don't have that much time I can also try to change it by myself and send you a pull request, but this can take some time, since node.js is a language I currently don't understand, I only use it for now :)

Thanks!

Giving bad URL causes bad behavior

with query = "http://lifehacker.com/"

var parser = new FeedParser().parseUrl(query, function (err, meta, art) {
//...
});

we have:

TypeError: Cannot call method 'trim' of undefined
at /usr/home/_/data/www/_**/node_modules/feedparser/main.js:101:60

Handling feed modifications

Hi,
Some little newbie questions:

  1. How do you correctly handle If-Modified-Since ? It seems the server (Google Calendar) always send me back the feed.
  2. How do you monitor changes ? Should I use something like 'ncb000gt / node-cron' to trigger a request every X minutes ?
  3. Is there a smart way to handle only new feed elements ? Or should I parse date until I get something ? Any good date parser ?

Incorrect origlink when parsing atom feeds containing <link rel="canonical">

When I tried to parse a feed containing the lines

`

`

(and no other link tags)

The link field contained "http://feedproxy.google.com/~r/smittenkitchen/~3/EeHjPgqUpmA/" (Probably the desired value)
But the origlink field contained null, instead of "http://smittenkitchen.com/2012/07/blackberry-gin-fizz/"

Here is a sample feed for replication of the issue: http://www.google.com/reader/atom/feed/http://feeds.feedburner.com/smittenkitchen?n=1

I can fix it and submit a pull request if my assessment of the issue seems correct

Discovering Pubsubhubbub

Hello,

I'm trying to figure out what the best way to extracting any pubsubhubbub reference from a given feed would be. I could just listen for the response event, but that would exclude functions like parseStr. Plus it seem kinda stupid when feedparser already is parsing the feed. :)

For those who do not know. Hubs are advertised as a <link>-tag with a rel="hub" attribute. Ie <link rel="hub" href="http://hubby.com/?subscribe" />.

I tried just putting some code in handleOpenTag:

  if (node.name === n['#prefix'] + ':link') {
    if (n['@']['rel'] === 'hub') {
      this.meta['#hub'] = n['@']['rel'];
    }
  }

Neglecting the fact that the colon is hard coded and stuff, that works for the feeds I've tested. It's not very elegant though and probably breaks on lots of feeds. :P

Is this something that you as the author @danmactough feels like it could fit within feedparser? I'd be willing to make a patch and submit a pull requests. I just wanted to raise the issue and ask for some pointers. :)

Wrong example with parseUrl and object

Hi,

I think this example cannot work :

// Or you could try letting feedparser handle working with request (experimental)
feedparser.parseUrl(reqObj)
  .on('response', function (response){
    // do something like save the HTTP headers for a future request
  })
  .on('article', callback);

Because, in your code you are only sending url information to "request"

var req = {
    uri: url,
    headers: { 'Accept-Encoding': 'identity' }
  };
  request(req)
    .on('error', fp.handleError.bind(fp))
    .on('response', handleResponse)
    .pipe(fp.stream)
    ;

Thank you for you work! :)

Edwin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.