feed-rs / feed-rs Goto Github PK

View Code? Open in Web Editor NEW

141.0 7.0 47.0 708 KB

A simple feed parser (RSS, Atom, JSON Feed)

Home Page: https://crates.io/crates/feed-rs

Rust 100.00%

feed-rs's Introduction

feed-rs

A simple feed parser (RSS, Atom, JSON Feed)

The parser library is in feed-rs, with a simple test tool in testurls

feed-rs's People

Contributors

Stargazers

Watchers

Forkers

huytd likewhatevs wezm gperinazzo ignatenkobrain kira-bruneau evilpie leinnan jangernert lmorchard kevincox chenyukang devineliu rillian wesleyac isgasho 29decibel zenria p1n3appl3 fisherdarling ntapiam yangqiong malyn yonasbsd marcatatem iohzrd sethfalco fundon frysztak w41ter 4812571 mayhemheroes ebell495 duane suitability yevgnen xstoudi nabijaczleweli enzovictorin muffinista maheshbansod amoskvin gleroi desastrenatural eramoss kryptn bromanko

feed-rs's Issues

Fails to parse logo for one some feeds

When parsing https://www.spreaker.com/show/4273892/episodes/feed channel.logo field is always None, even though feed has an image tag:

        <image>
            <url>https://d3wo5wojvuv7l.cloudfront.net/t_rss_itunes_square_1400/images.spreaker.com/original/0dcd53afca70854beb456079fa25d3f1.jpg</url>
            <title>Lwowska Fala | Radio Katowice</title>
            <link>https://www.spreaker.com/show/lwowska-fala-radio-katowice</link>
        </image>

Other feeds that have same structure of <image> have logo correctly parsed.

I'm using feed-rs version 0.6.1.

Fix GitHub actions for nightly + beta

Json Feed support

I know this one is a bit more exotic. But since the purpose of this crate is one size fits all for feeds it would be nice to have.

Of course this should be prioritized pretty low, since it's rarely used in the wild.

Tests from archive crates.io fail

Hello,

I'm packaging feed-rs as one of dependencies for Fedora and we run cargo test on all crates we package. However, it seems that testing files are not shipped (the fixtures folder). Is there reason not to?

If there is, can you probably also exclude src/util/test.rs so that cargo test won't run it?

Thanks for cooperation!

Golem RSS feed Utf8Error

When sending a get request with reqwest and getting the raw bytes from https://rss.golem.de/rss.php?feed=RSS1.0 and then feeding them to feed-rs yields the following error when parsing the description of the first entry

Err value: Error { pos: 1:1, kind: Utf8(Utf8Error { valid_up_to: 0, error_len: Some(1) }) }

Getting the response body with .text() works fine.

I prepared a small code sample

use std::io::BufReader;

#[tokio::main]
async fn main() {
    let rss_response = reqwest::get("https://rss.golem.de/rss.php?feed=RSS1.0").await.unwrap();
    //let rss_response = rss_response.text().await.unwrap();
    //let feed = feed_rs::parser::parse(rss_response.as_bytes()).unwrap();
    let rss_response = rss_response.bytes().await.unwrap().to_vec();
    let feed = feed_rs::parser::parse(BufReader::new(rss_response.as_slice())).unwrap();
    println!("{:?}", feed.title);
}

The description it chokes on is

<description>Envelope, ein Umschlag aus Papier für das eigene Telefon. Allerdings können Nutzer damit den Bildschirm nicht mehr erkennen und nur noch ein Nummernpad und wenige Tasten verwenden. Einen Bastelbogen und die App stellt das Team kostenlos zur Verfügung. (&lt;a href=&quot;https://www.golem.de/specials/smartphone/&quot;&gt;Smartphone&lt;/a&gt;, &lt;a href=&quot;https://www.golem.de/specials/google/&quot;&gt;Google&lt;/a&gt;) &lt;img src=&quot;https://cpx.golem.de/cpx.php?class=17&amp;amp;aid=146212&amp;amp;page=1&amp;amp;ts=1579701600&quot; alt=&quot;&quot; width=&quot;1&quot; height=&quot;1&quot; /&gt;</description>

The only difference I can think of is that reqwest somehow fixes a malformed sequence in the feed when using .text().

This method decodes the response body with BOM sniffing and with malformed sequences replaced with the REPLACEMENT CHARACTER. Encoding is determinated from the charset parameter of Content-Type header, and defaults to utf-8 if not presented.

What is the real license of a project?

Cargo.toml says it is MIT or Apache-2.0 but LICENSE file contains only MIT text license…

Would appreciate new release after you decide which one is correct and make necessary corrections.

Thanks!

InvalidDateTime errors

I did some testing and found the following feeds caused an InvalidDateTime error of one sort or another:

thread 'main' panicked at 'unable to parse http://www.breakingthin.gs/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://donmelton.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.evanmiller.org/news.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.sealiesoftware.com/blog/rss.xml: ParseError(InvalidDateTime(ParseError(OutOfRange)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://sourceforge.net/p/objectivelib/news/feed: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/silverback: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/simpledesktops: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://theredditblog.disqus.com/lesswrong_the_coolest_use_of_reddit_source_weve_found_to_date/latest.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.xkcd.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.mechanicalgirl.com/feeds/all/: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://cdevroe.com/status/feed: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/danielmall-articles: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://alex.amiran.it/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://feeds.feedburner.com/oshogbo: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://www.forrestthewoods.com/rss.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://osblog.stephenmarz.com/feed.rss: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://myrrlyn.net/blog.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://blog.burntsushi.net/index.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://interrupt.memfault.com/blog/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://duncan.bayne.id.au/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://groups.google.com/forum/feed/pagedout-notifications/msgs/rss.xml?num=15: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/simpledesktops: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://theredditblog.disqus.com/lesswrong_the_coolest_use_of_reddit_source_weve_found_to_date/latest.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.breakingthin.gs/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.evanmiller.org/news.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/danielmall-articles: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://duncan.bayne.id.au/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://feeds.feedburner.com/oshogbo: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.xkcd.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.sealiesoftware.com/blog/rss.xml: ParseError(InvalidDateTime(ParseError(OutOfRange)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.shervinemami.info/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://donmelton.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://cdevroe.com/status/feed: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://sourceforge.net/p/objectivelib/news/feed: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://blog.burntsushi.net/index.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://alex.amiran.it/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://myrrlyn.net/blog.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://www.forrestthewoods.com/rss.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://interrupt.memfault.com/blog/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://osblog.stephenmarz.com/feed.rss: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/silverback: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.mechanicalgirl.com/feeds/all/: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5

Address TODO items from itunes support

Some feeds can not be parsed

Examples:

Smarter feed_id hash

Original issue: https://gitlab.com/news-flash/news_flash_gtk/-/issues/13

The problem here is a website offering different feeds (RSS 2.0) but all use the same link <link>https://3dnews.ru/</link>.

A possible solution could be to not only hash the first available link but together with the title.

Fails to parse content for Ghost RSS feeds.

Version: 1.0.0

It fails to parse the content from https://blog.cloudflare.com/rss/. This appears to be using <content:encoded> and it looks like feed_rs has support for that but for some reason it isn't getting picked up.

Example entry:

Entry {
    id: "615d65eebd615902a7538284",
    title: Some(
        Text {
            content_type: "text/plain",
            src: None,
            content: "Staging TLS Certificates: Make every deployment a safe deployment",
        },
    ),
    updated: Some(
        2021-10-07T16:56:53Z,
    ),
    authors: [
        Person {
            name: "Dina Kozlov",
            uri: None,
            email: None,
        },
    ],
    content: None,
    links: [
        Link {
            href: "https://blog.cloudflare.com/staging-tls-certificate-every-deployment-safe-deployment/",
            rel: None,
            media_type: None,
            href_lang: None,
            title: None,
            length: None,
        },
    ],
    summary: Some(
        Text {
            content_type: "text/plain",
            src: None,
            content: "We are excited to announce that Enterprise customers now have the ability to test custom uploaded certificates in a staging environment before pushing them to production. ",
        },
    ),
    categories: [
        Category {
            term: "TLS",
            scheme: None,
            label: None,
        },
        Category {
            term: "SSL",
            scheme: None,
            label: None,
        },
    ],
    contributors: [],
    published: Some(
        2021-10-06T12:56:13Z,
    ),
    source: None,
    rights: None,
    media: [
        MediaObject {
            title: None,
            content: [
                MediaContent {
                    url: Some(
                        Url {
                            scheme: "https",
                            cannot_be_a_base: false,
                            username: "",
                            password: None,
                            host: Some(
                                Domain(
                                    "blog.cloudflare.com",
                                ),
                            ),
                            port: None,
                            path: "/content/images/2021/10/staging-tls-certificate-every-deployment-safe-deployment-OG-1.png",
                            query: None,
                            fragment: None,
                        },
                    ),
                    content_type: None,
                    height: None,
                    width: None,
                    duration: None,
                    size: None,
                    rating: None,
                },
            ],
            duration: None,
            thumbnails: [],
            texts: [],
            description: None,
            community: None,
            credits: [],
        },
    ],
},

Configurable parsing

Use configuration to control:

whitespace trimming
processing of common namespaces (e.g. Dublin Core)

GitHub releases atom feeds fail to parse

E.g. https://github.com/feed-rs/feed-rs/releases.atom

ParseError(MissingContent("content.type"))

Treat invalid timestamps as current time

Appears other feed readers simply default to the current time when the timestamp is not able to be parsed.

RSS `content:encoded` does not get parsed as an entry body

Hi, (great library, I really appreciate the efforts in creating a unified feed parser!) after using feed-rs for a bit and wanting to swap over to it in a project, I noticed that the entry bodies of certain feeds weren't being included, they always ended up unwrapping to None. After looking a bit at the feed data, I think it's because the content:encoded element isn't being parsed into the model as a body.

I looked at the tests to see if it was covered (and thus if I was doing something wrong) and noticed that while there is a sample file that has content:encoded in it:

feed-rs/feed-rs/fixture/rss_2.0_example_4.xml

Lines 49 to 55 in 949c3ea

    
                       <content:encoded><![CDATA[<p><img class='size-full alignleft' title='Earthquake location 37.102S, 21.9072W' alt='Earthquake location 37.102S, 21.9072W' src='http://www.earthquakenewstoday.com/wp-content/uploads/35_20.jpg' width='146' height='146' />A minor earthquake with magnitude 3.5 (ml/mb) was detected on Tuesday, 8 kilometers (5 miles) from Aris in Greece. Global date and time of event UTC/GMT: 06/08/19 / 2019-08-06 01:46:56 / August 6, 2019 @ 1:46 am. The earthquake was roughly at a depth of 10 km (6 miles). The 3.5-magnitude earthquake was detected at 03:46:56 / 3:46 am (local time epicenter). Event id: us60005146. Ids that are associated to the earthquake: us60005146. Exact location of event, depth 10 km, 21.9072&deg; East, 37.102&deg; North. </p> 
        
           <p>Closest city/cities or villages, with min 5000 pop, to hypocenter/epicentrum was Pýrgos, Trípoli, Zacháro. Epicenter of the event was 20 km (12 miles) from Kalamáta (c. 51 100 pop), 62 km (38 miles) from Trípoli (c. 26 600 pop), 76 km (47 miles) from Pýrgos (c. 22 400 pop), 46 km (29 miles) from Spárti (c. 16 200 pop), 29 km (18 miles) from Filiatrá (c. 7 000 pop), 11 km (7 miles) from Messíni (c. 6 800 pop). Nearby country/countries that might be effected, Greece (c. 11 000 000 pop). </p> 
        
           <p>Each year there are an estimated 130,000 minor earthquakes in the world. Earthquakes 3.0 to 4.0 are often felt, but only causes minor damage. In the past 24 hours, there have been one, in the last 10 days one, in the past 30 days one and in the last 365 days sixty-seven earthquakes of magnitude 3.0 or greater that have been detected nearby. </p> 
        
           <h3>Did you feel the quake?</h3> 
        
           <p>Were you asleep? Was it difficult to stand and/or walk? Leave a comment or report about shaking, activity and damage at your city, home and country. The information in this article comes from the USGS Earthquake Notification Service. Read more about the earthquake, Seismometer information, Distances, Parameters, Date-Time, Location and details about this quake, detected near: 8 km W of Aris, Greece.</p> 
        
           <p>Copyright &copy; 2019 <a href='http://www.earthquakenewstoday.com/'>earthquakenewstoday.com</a> All rights reserved.</p> 
        
           ]]></content:encoded>

The body for it isn't being tested:

feed-rs/feed-rs/src/parser/rss2/tests.rs

Lines 116 to 126 in 949c3ea

    
           .entry(Entry::default() 
        
               .title(Text::new("Minor earthquake, 3.5 mag was detected near Aris in Greece".into())) 
        
               .link(Link::new("\n                http://www.earthquakenewstoday.com/2019/08/06/minor-earthquake-3-5-mag-was-detected-near-aris-in-greece/\n            ".into())) 
        
               .published_rfc2822("Tue, 06 Aug 2019 05:01:15 +0000") 
        
               .category(Category::new("Earthquake breaking news".into())) 
        
               .category(Category::new("Minor World Earthquakes Magnitude -3.9".into())) 
        
               .category(Category::new("Spárti".into())) 
        
               .id("\n                http://www.earthquakenewstoday.com/2019/08/06/minor-earthquake-3-5-mag-was-detected-near-aris-in-greece/\n            ") 
        
               .summary(Text::new("\n                A minor earthquake magnitude 3.5 (ml/mb) strikes near Kalamáta, Trípoli, Pýrgos, Spárti, Filiatrá, Messíni, Greece on Tuesday. The temblor has occurred at 03:46:56/3:46 am (local time epicenter) at a depth of 10 km (6 miles). How did you react? Did you feel it?".into())) 
        
               .updated(actual.updated));

Get feed type

Is it possible to determine the type of parsed feed (rss , atom , json) ?

Failed build with latest `serde`

Latest version of serde made serde::export module really private and now feed-rs fails to build.

    |
10  | use serde::export::Formatter;
    |            ^^^^^^ private module
    |

Parsing inaccuracies for "A List Apart" feed

Feed URL: https://alistapart.com/main/feed

XML is dirty, full of redundant '\t' and '\n', I think trimming fields of the white-space before processing should resolve most issues, but I'm not sure on that.

Parse MediaRSS feeds in RSS 2.0

Youtube seems to use MediaRSS in an Atom feed and that is also what is supported by feed-rs, but looking at https://www.rssboard.org/media-rss, it says (my emphasis)

This is version 1.5.1 of the Media RSS specification, a namespace for RSS 2.0 published on Dec. 11, 2009

And indeed there are feeds like https://s.ch9.ms/Shows/Azure-Friday/feed/mp4high that have MediaRSS in RSS 2.0. It also has itunes tags, so finding some good way of merging those might be a bit tricky.

A workable approach might be to always prefer MediaRSS over itunes.

Error UnquotedValue parsing attributes

An issue reported to me some time ago: https://gitlab.com/news-flash/news_flash_gtk/-/issues/154#note_446893398

This unwrap() can cause an error. In the case above the error is UnquotedValue(32).

My suggestion is to filter_map instead and ignore attributes that currently unwrap to an error. If that is an acceptable solution I can prepare a PR.

Utf-8 getting decoded 2nd time

Okay, I'm not sure if this can even be fixed without slowing down the parsing quite a bit. So possibly the solution is to close this issue and put a warning somewhere "Don't do this".

I'll use testurls as an example since it also triggers the issue.

The text() method of reqwest already tries to find the encoding in the HTTP headers and decodes the text to utf-8. Now we have a utf-8 encoded xml string which contains something like <?xml version="1.0" encoding="ISO-8859-1"?>.

So feed-rs or rather quick-xml does what it is told and interprets the bytes as ISO-8859-1 and tries to convert them to utf-8. Sadly that just scrambles the nicely decoded characters again.

A stripped down version of testurls

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let xml = reqwest::blocking::get("https://rss.golem.de/rss.php?feed=RSS1.0")?.text()?;

    match parser::parse(xml.as_bytes()) {
        Ok(_feed) => println!("ok"),
        Err(error) => println!("failed: {:?}\n{:?}\n-------------------------------------------------------------", error, xml),
    }

    Ok(())
}

results in:

Some(Text { content_type: "text/plain", src: None, content: "Cloud-Computing: Cloudical kÃ¼ndigt kompletten Cloud-Open-Source-Stack an" })

Using raw bytes() instead of decoded text() works as expected

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let xml = reqwest::blocking::get("https://rss.golem.de/rss.php?feed=RSS1.0")?.bytes()?;

    match parser::parse(xml.as_ref()) {
        Ok(_feed) => println!("ok"),
        Err(error) => println!("failed: {:?}\n{:?}\n-------------------------------------------------------------", error, xml),
    }

    Ok(())
}

Some(Text { content_type: "text/plain", src: None, content: "Cloud-Computing: Cloudical kündigt kompletten Cloud-Open-Source-Stack an" })

Btw: that also explains why you couldn't reproduce #10 with testurls. Everything is starting to make sense to me.
I'll switch to bytes() in my code and see what opinions others have regarding this issue.

feed-rs is not able to parse out summary from RockPaperShotgun feed

Feed URL: http://feeds.feedburner.com/RockPaperShotgun

This one I'm not sure why is not working. There is a description tag inside XML, it contains escaped HTML, but feed-rs returns None for it.

Resolve Relative URIs

Some feeds provide relative URIs for things like enclosures. I recently implemented some basic code to resolve these URIs in my Feed Reader based on feed-rs. The bug report I got was refreshingly detailed and helpful. I got linked a step by step description how a big python feed parser handles this problem:

https://pythonhosted.org/feedparser/resolving-relative-links.html#how-relative-uris-are-resolved

But since the first few steps rely on knowledge of the feeds XML they probably should be implemented in feed-rs. Later steps which use fields of the HTTP header however can't be implemented in feed-rs and need to be handled in the library/program that makes use of it.

What is your opinion on the issue?

If URIs get resolved but some of them fail because there is not enough information in the XML itself should the resulting Link struct indicate that it is a partial URL? Or should the calling library just watch for url::ParseError::RelativeUrlWithoutBase?

[Bad Feed] No ID & no Links

Got a bug report for a particularly bad feed: https://gitlab.com/news-flash/news_flash_gtk/-/issues/213

https://feeds.feedburner.com/ingreso_dival

The issue is that the items don't have an ID or a Link. So random ids will be generated that break updating the feed.

A solution would be to combine the feed-URL with the title and has that to generate an ID. If you think that is an acceptable approach or have a better idea I can create a PR.

btw: is there ever a use-case for having randomly generated IDs? It caused more headaches for me than it solved problems. But that's just my experience.

Fails to build out-of-box on a new project

On my machine if a create a new project with cargo new and add feed-rs="0.4" as my only dependency I get the following error:

   Compiling feed-rs v0.4.1
error[E0603]: module `export` is private
   --> C:\.cargo\registry\src\github.com-1ecc6299db9ec823\feed-rs-0.4.1\src\xml\mod.rs:10:12
    |
10  | use serde::export::Formatter;
    |            ^^^^^^ private module
    |
note: the module `export` is defined here
   --> C:\.cargo\registry\src\github.com-1ecc6299db9ec823\serde-1.0.120\src\lib.rs:275:5
    |
275 | use self::__private as export;
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0603`.
error: could not compile `feed-rs`

Maybe this was due an update with serde?

Handle invalid rss feeds

Reddit generates invalid rss feeds. Can you make parser skip invalid items?

https://www.reddit.com/r/memes.rss

Support namespaces in feeds

Ignore elements from unknown namespaces
Automatically handle elements from namespaces such as DublinCore (e.g. date which is used in RSS1 feeds such as Slashdot).

feed-rs fails to parse a feed

https://www.tjrs.jus.br/site_php/noticias/news_rss.php

This feed fails with ParseError(NoFeedRoot)

Some fields may not be read correctly

I'm using feed_rs to parse content of BrotherBircks web site.

Items in their RSS feed are defined this way

<title>Cadet Thrawn outwits his opponents in the metallurgy lab</title> http://feedproxy.google.com/~r/TheBrothersBrick/~3/eWF3-ZnktaM/ https://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/#respond Fri, 08 Feb 2019 14:00:19 +0000

<guid isPermaLink="false">https://www.brothers-brick.com/?p=171427</guid>
<description><![CDATA[This detailed scene by CRCT Productions depicts the famous Grand Admiral Thrawn in his early days as an Imperial cadet.]]></description>
		<content:encoded><![CDATA[CONTENT IS REDACTED as GitHub interprets it as valid HTML]]></content:encoded>
	<wfw:commentRss>https://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>

171427 feedburner:origLinkhttps://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/</feedburner:origLink>

And the content can't be read. The full RSS feed looks like https://www.brothers-brick.com/feed/

More forgiving datetime parsing

Original report: https://gitlab.com/news-flash/news_flash_base/-/issues/24

The issue:

The date format is a little off

2014-12-29T14:53:35+0200

instead of

2014-12-29T14:53:35+02:00

Imo feed-rs should try a few common formats until giving up to parse a date.

The atom feed is: https://feeds.feedburner.com/sallar

Maybe something like RFC2822_FIXES for rfc3339 would make sense here. Sadly my knowledge of regular expressions is somewhat limited.

xml:base for content

This is a follow up to #104 since this case was not covered by #107

As mentioned in the previous issue: some feeds have relative URLs as part of their HTML content. A good example is https://insanity.industries/index.xml with

<item xml:base="https://insanity.industries/post/pareto-optimal-compression/">
...
<content:encoded><![CDATA[
    ...
    <figure>
        <a href="example.svg">
		    <img src="example.svg"
    	         alt="Compression results for different hypothetical compression algorithms, including the Pareto frontier indicated in blue."/>
        </a><figcaption><p>Compression results for different hypothetical compression algorithms, including the Pareto frontier indicated in blue.</p>
            </figcaption>
    </figure>
    ...
    ]]></content:encoded>
    ...
</item>

@markpritchard responded

I'm comfortable switching all the URLs in the RSS/Atom content to absolute by applying xml:base but I wouldn't want to add an HTML parser to feed-rs by default. Might be worth playing around with as a feature (I've never done that in Rust ... might be interesting to learn).

Overeager whitespace trimming

I'm not sure if this is an issue with xml-rs, but lets go one level deeper at a time.
Apparently feed-rs trims the spaces before and after links for the planet.gnome.org atom feed. I tracked the issue down to element_source.rs:L30. Setting trim_whitespace(false) "fixes" the issue.

Do you think turning whitespace trimming off is a sensible solution for this bug? Should this be reported to xml-rs?

Original report with image illustrating the issue: https://gitlab.com/news-flash/news_flash_base/-/issues/9

CI coverage for 1.39

Used in freedesktop sdk 19.08 per #30

Upgrade minimum rustc version

We are currently tied to 1.39.0 due to it shipping in freedesktop 19.08.

Is this still a requirement @jangernert or can we bump to latest stable (1.48.0)?

Consistent GUID for feeds

Similar to #11 feeds have randomly generated IDs for RSS 1.0 and 2.0. Atom parses an ID if it's there but could also end up with a random ID (not sure if that happens in the real world).
Would be cool to have a similar solution for this than #13 .

Audio URL not associated with itunes:duration

Every feed item in https://spezialgelagert.de/feed/podcast/ has two MediaObjects. One just contains the MP3 URL from the <enclosure> the other MediaObject has all the information from the itunes tags. I originally envisioned that those would be merged.

RSS 1.0 no consistent GUID

The RSS 1.0 parser uses the auto generated GUID from Entry::default(). This means parsing a feed a second time for new articles results in duplicate articles.
One could use the url to check for duplicates. But imo it would be nicer if feed_rs generated a consistent id for each item. Probably based on the url of the article.

feature request: support for Atom Entry Documents

I have an aging website whose backing store is a whole pile of Atom Entry Documents, and I'm contemplating replace the Python script that processes them with a Rust program. But it doesn't look like your library supports these kinds of documents: https://github.com/feed-rs/feed-rs/blob/master/feed-rs/src/parser/mod.rs#L180 doesn't look for entry root elements.

Implement media spec support

The library is great!

I think for me the only missing piece is support for media namespaces handling.
The spec is here: https://www.rssboard.org/media-rss

Sadly lots of RSS/Atom feeds is using it, including Youtube atom feeds: https://www.youtube.com/feeds/videos.xml?user=kkszysiu

I think having support for it would be great.
I will try to introduce a PR that adds at least basic implementation in the comming days.

Panics on invalid XML

The library currently panics on the following input:

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
	
>
<channel>
<title>Reuters: Most Read Articles</title>
<link>https://www.reuters.com</link>
<description>Reuters.com is your source for breaking news, business, financial and investing news, including personal finance and stocks.  Reuters is the leading global provider of news, financial information and technology solutions to the world's media, financial institutions, businesses and individuals.</description>
<image>
	<title>Reuters News</title>
	<width>120</width>
	<height>35</height>
	<link>https://www.reuters.com</link>
	<url>https://www.reuters.com/resources_v2/images/reuters125.png</url>
</image>
<language>en-us</language>
<lastBuildDate>Sat, 21 Mar 2020 06:29:51 -0400</lastBuildDate>
<copyright>All rights reserved. Users may download and print extracts of content from this website for their own personal and non-commercial use only. Republication or redistribution of Reuters content, including by framing or similar means, is expressly prohibited without the prior written consent of Reuters. Reuters and the Reuters sphere logo are registered trademarks or trademarks of the Reuters group of companies around the world. &#169; Reuters 2020</copyright>
<!--Property passed to loop is null--><!-- Exception rendering module on server  -->

This is being served by the following url (at the time of writing): http://feeds.reuters.com/reuters/MostRead

The panic is due to an unwrap on src/util/element_source.rs:250:9, and can be verified with the testurls example:

$ echo "http://feeds.reuters.com/reuters/MostRead" | cargo run --bin testurls
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/testurls`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 22:1, kind: Syntax("Unexpected end of stream: still inside the root element") }', /home/guilherme/Projects/feed-rs/feed-rs/src/util/element_source.rs:250:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
http://feeds.reuters.com/reuters/MostRead

Other inputs that will also panic:

From https://olhardigital.com.br/rss/ultimas_noticias.php: Missing channel opening tag, but has the channel closing tag.
From http://bankimplode.com/rss.xml: unescaped ampersand in an item title

Not parsing HEATED rss feed correctly

I am a user of NewsFlash GTK which uses feed-rs. I've reported an issue against their repo for not parsing the HEATED rss feed correctly. You can find the original issue here.

It seems that it tries to parse the content:encoded as an image. This is the following output the NewsFlash dev sees when trying to parse an entry from HEATED.

Entry {
    id: "https://heated.world/p/twitters-big-oil-ad-loophole",
    title: Some(Text { content_type: "text/plain", src: None, content: "Twitter\'s Big Oil ad loophole" }),
    updated: Some(2021-02-03T05:17:19Z),
    authors: [Person { name: "Emily Atkin", uri: None, email: None }],
    content: Some(Content {
        body: None,
        content_type: "image/jpeg",
        length: Some(0),
        src: Some(Link { href: "https://cdn.substack.com/image/fetch/h_600,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4a2557-4e1b-417f-8b87-6c954a50576c_1072x1102.png", rel: None, media_type: None, href_lang: None, title: None, length: None })
    }),
    links: [Link { href: "https://heated.world/p/twitters-big-oil-ad-loophole", rel: None, media_type: None, href_lang: None, title: None, length: None }],
    summary: Some(Text {
        content_type: "text/plain",
        src: None,
        content: "Climate groups can\'t pay Twitter to spread political content. But the oil industry can, and it\'s ramping up its efforts alongside Biden\'s climate push."
    }),
    categories: [],
    contributors: [],
    published: Some(2021-02-02T12:01:03Z),
    source: None,
    rights: None,
    media: []
}

You can find the rss feed here.

license file no longer included with feed-rs crate

It looks like there was some internal refactoring - the Cargo.toml specifies to include "LICENSE", but the license file was apparently renamed to "LICENSE-MIT":

https://github.com/feed-rs/feed-rs/blob/master/feed-rs/Cargo.toml#L9

Implment Clone on feed_rs::model::Entry ?

I'm trying to use feed_rs to make some aggregator-like tool with some filtering capabilities.

For this to work I came up with the idea of getting sources from different feeds, merge them into a single "stream" and to apply the filters I need, but I got stuck because I can't concatenate the feed_rs::model.entries (which is a Vec) as the feed_rs::model::Entry does not implement the Clone trait.

Is this intentional? is there a good reason why this as is, or is it just an issue was never problematic?

RSS 2.0 without `guid` field

Just stumbled across the trailers.apple.com RSS 2.0 feed which doesn't provide a guid for articles. So the same issue as #11 arises.

Sorry to open up so many issues. But I'm using this crate quite a lot :)

Build errors with 0.2.2

Sadly the 0.2.2 update introduced some build errors:

error[E0599]: no method named `map_or` found for type `std::result::Result<chrono::datetime::DateTime<chrono::offset::fixed::FixedOffset>, chrono::format::ParseError>` in the current scope
  --> feed-rs/src/parser/util.rs:47:10
   |
47 |         .map_or(None, |t| Some(t.with_timezone(&Utc)))
   |          ^^^^^^ help: there is a method with a similar name: `map_err`

error[E0599]: no method named `map_or` found for type `std::result::Result<chrono::datetime::DateTime<chrono::offset::fixed::FixedOffset>, chrono::format::ParseError>` in the current scope
  --> feed-rs/src/parser/util.rs:54:10
   |
54 |         .map_or(None, |t| Some(t.with_timezone(&Utc)))
   |          ^^^^^^ help: there is a method with a similar name: `map_err`

error: aborting due to 2 previous errors

`<content:encoded>` also for RSS 1.0

<content:encoded> seems to be supported for other formats already. Came across an RSS 1.0 feed that also uses it:

https://planet.freedesktop.org/rss10.xml
https://gitlab.com/news-flash/news_flash/-/issues/43

Convert text to Utf-8

I'm not sure if this is something that should be implemented in feed-rs itself or rather in xml-rs.
I think we all know rust strings are utf-8 and XML documents of feeds can be in a number of different encodings. The text encoding used is mentioned in the xml tag like this:

<?xml version="1.0" encoding="ISO-8859-1" ?>.

xml-rs picks up that information, but doesn't do anything with it until now.

XML document encoding.

If XML declaration is not present or does not contain encoding attribute, defaults to "UTF-8". This field is currently used for no other purpose than informational.

So the question is: should the text be converted to uft-8 in xml-rs already or is feed-rs the right place to handle this? Even if it eventually should be part of xml-rs's functionality, is a temporary solution in feed-rs a good idea?

Obligatory link to the related bug report in an application using feed-rs:
https://gitlab.com/news-flash/news_flash/-/issues/35

Create a separate field for feed_url

Can you please create a separate field in the feed model for feed url?

Currently, the feed model has links field. It's not obvious in which position feed url is localed for json feed. Json feeds have feed_url and homepage_url

Another option is to get rid of links field and create link field which will contain feed url

Cargo fmt

Would it be okay to cargo fmt everything? Right now there is a relatively large difference for me locally.

Review and action best practices

https://www.rssboard.org/rss-profile

	<content:encoded><![CDATA[<p><img class='size-full alignleft' title='Earthquake location 37.102S, 21.9072W' alt='Earthquake location 37.102S, 21.9072W' src='http://www.earthquakenewstoday.com/wp-content/uploads/35_20.jpg' width='146' height='146' />A minor earthquake with magnitude 3.5 (ml/mb) was detected on Tuesday, 8 kilometers (5 miles) from Aris in Greece. Global date and time of event UTC/GMT: 06/08/19 / 2019-08-06 01:46:56 / August 6, 2019 @ 1:46 am. The earthquake was roughly at a depth of 10 km (6 miles). The 3.5-magnitude earthquake was detected at 03:46:56 / 3:46 am (local time epicenter). Event id: us60005146. Ids that are associated to the earthquake: us60005146. Exact location of event, depth 10 km, 21.9072° East, 37.102° North. </p>
	<p>Closest city/cities or villages, with min 5000 pop, to hypocenter/epicentrum was Pýrgos, Trípoli, Zacháro. Epicenter of the event was 20 km (12 miles) from Kalamáta (c. 51 100 pop), 62 km (38 miles) from Trípoli (c. 26 600 pop), 76 km (47 miles) from Pýrgos (c. 22 400 pop), 46 km (29 miles) from Spárti (c. 16 200 pop), 29 km (18 miles) from Filiatrá (c. 7 000 pop), 11 km (7 miles) from Messíni (c. 6 800 pop). Nearby country/countries that might be effected, Greece (c. 11 000 000 pop). </p>
	<p>Each year there are an estimated 130,000 minor earthquakes in the world. Earthquakes 3.0 to 4.0 are often felt, but only causes minor damage. In the past 24 hours, there have been one, in the last 10 days one, in the past 30 days one and in the last 365 days sixty-seven earthquakes of magnitude 3.0 or greater that have been detected nearby. </p>
	<h3>Did you feel the quake?</h3>
	<p>Were you asleep? Was it difficult to stand and/or walk? Leave a comment or report about shaking, activity and damage at your city, home and country. The information in this article comes from the USGS Earthquake Notification Service. Read more about the earthquake, Seismometer information, Distances, Parameters, Date-Time, Location and details about this quake, detected near: 8 km W of Aris, Greece.</p>
	<p>Copyright © 2019 <a href='http://www.earthquakenewstoday.com/'>earthquakenewstoday.com</a> All rights reserved.</p>
	]]></content:encoded>

	.entry(Entry::default()
	.title(Text::new("Minor earthquake, 3.5 mag was detected near Aris in Greece".into()))
	.link(Link::new("\n http://www.earthquakenewstoday.com/2019/08/06/minor-earthquake-3-5-mag-was-detected-near-aris-in-greece/\n ".into()))
	.published_rfc2822("Tue, 06 Aug 2019 05:01:15 +0000")
	.category(Category::new("Earthquake breaking news".into()))
	.category(Category::new("Minor World Earthquakes Magnitude -3.9".into()))
	.category(Category::new("Spárti".into()))
	.id("\n http://www.earthquakenewstoday.com/2019/08/06/minor-earthquake-3-5-mag-was-detected-near-aris-in-greece/\n ")

	.summary(Text::new("\n A minor earthquake magnitude 3.5 (ml/mb) strikes near Kalamáta, Trípoli, Pýrgos, Spárti, Filiatrá, Messíni, Greece on Tuesday. The temblor has occurred at 03:46:56/3:46 am (local time epicenter) at a depth of 10 km (6 miles). How did you react? Did you feel it?".into()))
	.updated(actual.updated));