Giter Site home page Giter Site logo

feed-rs's Introduction

feed-rs

A simple feed parser (RSS, Atom, JSON Feed)

The parser library is in feed-rs, with a simple test tool in testurls

feed-rs's People

Contributors

markpritchard avatar kumabook avatar jangernert avatar davidcornu avatar evilpie avatar rillian avatar muffinista avatar kevincox avatar amoskvin avatar duane avatar fundon avatar gperinazzo avatar gleroi avatar kira-bruneau avatar malyn avatar sanpii avatar sethfalco avatar wezm avatar xstoudi avatar nabijaczleweli avatar

Stargazers

Yann A. Zankl avatar Murilo Santana avatar Elsie avatar AuTa avatar Sophia Wisdom avatar H1rono_K avatar Yuta Yamaguchi avatar Zak Nesler avatar William Desportes avatar  avatar Pierre Schmitz avatar yyyz avatar Mike Li avatar shawniac avatar Michael Friedrich avatar Freeesia avatar Arjun Satarkar avatar Hukadan avatar  avatar lapnd avatar ckaznable avatar sheepla avatar Nikita avatar Pionxzh avatar 藍+85CD avatar Vasco Costa avatar  avatar Kamui avatar bitcapybara avatar  avatar KallyDev avatar Tyler Hallada avatar Christopher Maverick avatar Yueqian Zhang avatar Nova avatar SHIMIZU Taku avatar Diana avatar Brad Purchase avatar Rosen avatar Felipe Mica avatar Zoo Sky avatar Logan Anderson avatar Sebastian Frysztak avatar April John avatar 三米前有蕉皮 avatar  avatar Vincent Rischmann avatar  avatar Lan Qingyong avatar  avatar 110416 avatar Max Countryman avatar  avatar  avatar Wess Cope avatar Chuck Dries avatar Khaveesh avatar  avatar Matt Dennewitz avatar Zachary Perkins avatar Cris.Q avatar Eric avatar Marcin Wolski  avatar tuhana avatar そのだ かずと avatar DCsunset avatar Luke Picciau avatar James Polera avatar  avatar  avatar Salvador Guzman avatar wangb avatar chris avatar Patrick Smith avatar  avatar 杨琼 avatar mehmetcan avatar Jeff Carpenter avatar zhq avatar Ben avatar Charles Johnson avatar zbv avatar Oleg Pykhalov avatar DevineLiu avatar dupgit avatar lin avatar Meet Godhani avatar Christian Visintin avatar Masanori Ogino avatar Darren Beukes avatar Kaizhao Zhang avatar Hossein Mayboudi avatar whoami avatar Consoli avatar t3nzin avatar orzogc avatar Askar Yusupov avatar  avatar Nicolas Marshall avatar  avatar

Watchers

Neustradamus avatar irxground avatar  avatar James Cloos avatar George Aristy avatar  avatar  avatar

feed-rs's Issues

Fails to parse logo for one some feeds

When parsing https://www.spreaker.com/show/4273892/episodes/feed channel.logo field is always None, even though feed has an image tag:

        <image>
            <url>https://d3wo5wojvuv7l.cloudfront.net/t_rss_itunes_square_1400/images.spreaker.com/original/0dcd53afca70854beb456079fa25d3f1.jpg</url>
            <title>Lwowska Fala | Radio Katowice</title>
            <link>https://www.spreaker.com/show/lwowska-fala-radio-katowice</link>
        </image>

Other feeds that have same structure of <image> have logo correctly parsed.

I'm using feed-rs version 0.6.1.

Json Feed support

I know this one is a bit more exotic. But since the purpose of this crate is one size fits all for feeds it would be nice to have.

Of course this should be prioritized pretty low, since it's rarely used in the wild.

Tests from archive crates.io fail

Hello,

I'm packaging feed-rs as one of dependencies for Fedora and we run cargo test on all crates we package. However, it seems that testing files are not shipped (the fixtures folder). Is there reason not to?

If there is, can you probably also exclude src/util/test.rs so that cargo test won't run it?

Thanks for cooperation!

Golem RSS feed Utf8Error

When sending a get request with reqwest and getting the raw bytes from https://rss.golem.de/rss.php?feed=RSS1.0 and then feeding them to feed-rs yields the following error when parsing the description of the first entry

Err value: Error { pos: 1:1, kind: Utf8(Utf8Error { valid_up_to: 0, error_len: Some(1) }) }

Getting the response body with .text() works fine.

I prepared a small code sample

use std::io::BufReader;

#[tokio::main]
async fn main() {
    let rss_response = reqwest::get("https://rss.golem.de/rss.php?feed=RSS1.0").await.unwrap();
    //let rss_response = rss_response.text().await.unwrap();
    //let feed = feed_rs::parser::parse(rss_response.as_bytes()).unwrap();
    let rss_response = rss_response.bytes().await.unwrap().to_vec();
    let feed = feed_rs::parser::parse(BufReader::new(rss_response.as_slice())).unwrap();
    println!("{:?}", feed.title);
}

The description it chokes on is

<description>Envelope, ein Umschlag aus Papier für das eigene Telefon. Allerdings können Nutzer damit den Bildschirm nicht mehr erkennen und nur noch ein Nummernpad und wenige Tasten verwenden. Einen Bastelbogen und die App stellt das Team kostenlos zur Verfügung. (&lt;a href=&quot;https://www.golem.de/specials/smartphone/&quot;&gt;Smartphone&lt;/a&gt;, &lt;a href=&quot;https://www.golem.de/specials/google/&quot;&gt;Google&lt;/a&gt;) &lt;img src=&quot;https://cpx.golem.de/cpx.php?class=17&amp;amp;aid=146212&amp;amp;page=1&amp;amp;ts=1579701600&quot; alt=&quot;&quot; width=&quot;1&quot; height=&quot;1&quot; /&gt;</description>

The only difference I can think of is that reqwest somehow fixes a malformed sequence in the feed when using .text().

This method decodes the response body with BOM sniffing and with malformed sequences replaced with the REPLACEMENT CHARACTER. Encoding is determinated from the charset parameter of Content-Type header, and defaults to utf-8 if not presented.

What is the real license of a project?

Cargo.toml says it is MIT or Apache-2.0 but LICENSE file contains only MIT text license…

Would appreciate new release after you decide which one is correct and make necessary corrections.

Thanks!

InvalidDateTime errors

I did some testing and found the following feeds caused an InvalidDateTime error of one sort or another:

thread 'main' panicked at 'unable to parse http://www.breakingthin.gs/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://donmelton.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.evanmiller.org/news.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.sealiesoftware.com/blog/rss.xml: ParseError(InvalidDateTime(ParseError(OutOfRange)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://sourceforge.net/p/objectivelib/news/feed: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/silverback: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/simpledesktops: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://theredditblog.disqus.com/lesswrong_the_coolest_use_of_reddit_source_weve_found_to_date/latest.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.xkcd.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.mechanicalgirl.com/feeds/all/: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://cdevroe.com/status/feed: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/danielmall-articles: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://alex.amiran.it/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://feeds.feedburner.com/oshogbo: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://www.forrestthewoods.com/rss.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://osblog.stephenmarz.com/feed.rss: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://myrrlyn.net/blog.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://blog.burntsushi.net/index.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://interrupt.memfault.com/blog/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://duncan.bayne.id.au/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://groups.google.com/forum/feed/pagedout-notifications/msgs/rss.xml?num=15: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/simpledesktops: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://theredditblog.disqus.com/lesswrong_the_coolest_use_of_reddit_source_weve_found_to_date/latest.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.breakingthin.gs/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.evanmiller.org/news.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/danielmall-articles: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://duncan.bayne.id.au/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://feeds.feedburner.com/oshogbo: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.xkcd.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.sealiesoftware.com/blog/rss.xml: ParseError(InvalidDateTime(ParseError(OutOfRange)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.shervinemami.info/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://donmelton.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://cdevroe.com/status/feed: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://sourceforge.net/p/objectivelib/news/feed: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://blog.burntsushi.net/index.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://alex.amiran.it/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://myrrlyn.net/blog.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://www.forrestthewoods.com/rss.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://interrupt.memfault.com/blog/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://osblog.stephenmarz.com/feed.rss: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/silverback: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.mechanicalgirl.com/feeds/all/: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5

Fails to parse content for Ghost RSS feeds.

Version: 1.0.0

It fails to parse the content from https://blog.cloudflare.com/rss/. This appears to be using <content:encoded> and it looks like feed_rs has support for that but for some reason it isn't getting picked up.

Example entry:

Entry {
    id: "615d65eebd615902a7538284",
    title: Some(
        Text {
            content_type: "text/plain",
            src: None,
            content: "Staging TLS Certificates: Make every deployment a safe deployment",
        },
    ),
    updated: Some(
        2021-10-07T16:56:53Z,
    ),
    authors: [
        Person {
            name: "Dina Kozlov",
            uri: None,
            email: None,
        },
    ],
    content: None,
    links: [
        Link {
            href: "https://blog.cloudflare.com/staging-tls-certificate-every-deployment-safe-deployment/",
            rel: None,
            media_type: None,
            href_lang: None,
            title: None,
            length: None,
        },
    ],
    summary: Some(
        Text {
            content_type: "text/plain",
            src: None,
            content: "We are excited to announce that Enterprise customers now have the ability to test custom uploaded certificates in a staging environment before pushing them to production. ",
        },
    ),
    categories: [
        Category {
            term: "TLS",
            scheme: None,
            label: None,
        },
        Category {
            term: "SSL",
            scheme: None,
            label: None,
        },
    ],
    contributors: [],
    published: Some(
        2021-10-06T12:56:13Z,
    ),
    source: None,
    rights: None,
    media: [
        MediaObject {
            title: None,
            content: [
                MediaContent {
                    url: Some(
                        Url {
                            scheme: "https",
                            cannot_be_a_base: false,
                            username: "",
                            password: None,
                            host: Some(
                                Domain(
                                    "blog.cloudflare.com",
                                ),
                            ),
                            port: None,
                            path: "/content/images/2021/10/staging-tls-certificate-every-deployment-safe-deployment-OG-1.png",
                            query: None,
                            fragment: None,
                        },
                    ),
                    content_type: None,
                    height: None,
                    width: None,
                    duration: None,
                    size: None,
                    rating: None,
                },
            ],
            duration: None,
            thumbnails: [],
            texts: [],
            description: None,
            community: None,
            credits: [],
        },
    ],
},

Configurable parsing

Use configuration to control:

  • whitespace trimming
  • processing of common namespaces (e.g. Dublin Core)

RSS `content:encoded` does not get parsed as an entry body

Hi, (great library, I really appreciate the efforts in creating a unified feed parser!) after using feed-rs for a bit and wanting to swap over to it in a project, I noticed that the entry bodies of certain feeds weren't being included, they always ended up unwrapping to None. After looking a bit at the feed data, I think it's because the content:encoded element isn't being parsed into the model as a body.

I looked at the tests to see if it was covered (and thus if I was doing something wrong) and noticed that while there is a sample file that has content:encoded in it:

<content:encoded><![CDATA[<p><img class='size-full alignleft' title='Earthquake location 37.102S, 21.9072W' alt='Earthquake location 37.102S, 21.9072W' src='http://www.earthquakenewstoday.com/wp-content/uploads/35_20.jpg' width='146' height='146' />A minor earthquake with magnitude 3.5 (ml/mb) was detected on Tuesday, 8 kilometers (5 miles) from Aris in Greece. Global date and time of event UTC/GMT: 06/08/19 / 2019-08-06 01:46:56 / August 6, 2019 @ 1:46 am. The earthquake was roughly at a depth of 10 km (6 miles). The 3.5-magnitude earthquake was detected at 03:46:56 / 3:46 am (local time epicenter). Event id: us60005146. Ids that are associated to the earthquake: us60005146. Exact location of event, depth 10 km, 21.9072&deg; East, 37.102&deg; North. </p>
<p>Closest city/cities or villages, with min 5000 pop, to hypocenter/epicentrum was Pýrgos, Trípoli, Zacháro. Epicenter of the event was 20 km (12 miles) from Kalamáta (c. 51 100 pop), 62 km (38 miles) from Trípoli (c. 26 600 pop), 76 km (47 miles) from Pýrgos (c. 22 400 pop), 46 km (29 miles) from Spárti (c. 16 200 pop), 29 km (18 miles) from Filiatrá (c. 7 000 pop), 11 km (7 miles) from Messíni (c. 6 800 pop). Nearby country/countries that might be effected, Greece (c. 11 000 000 pop). </p>
<p>Each year there are an estimated 130,000 minor earthquakes in the world. Earthquakes 3.0 to 4.0 are often felt, but only causes minor damage. In the past 24 hours, there have been one, in the last 10 days one, in the past 30 days one and in the last 365 days sixty-seven earthquakes of magnitude 3.0 or greater that have been detected nearby. </p>
<h3>Did you feel the quake?</h3>
<p>Were you asleep? Was it difficult to stand and/or walk? Leave a comment or report about shaking, activity and damage at your city, home and country. The information in this article comes from the USGS Earthquake Notification Service. Read more about the earthquake, Seismometer information, Distances, Parameters, Date-Time, Location and details about this quake, detected near: 8 km W of Aris, Greece.</p>
<p>Copyright &copy; 2019 <a href='http://www.earthquakenewstoday.com/'>earthquakenewstoday.com</a> All rights reserved.</p>
]]></content:encoded>

The body for it isn't being tested:

.entry(Entry::default()
.title(Text::new("Minor earthquake, 3.5 mag was detected near Aris in Greece".into()))
.link(Link::new("\n http://www.earthquakenewstoday.com/2019/08/06/minor-earthquake-3-5-mag-was-detected-near-aris-in-greece/\n ".into()))
.published_rfc2822("Tue, 06 Aug 2019 05:01:15 +0000")
.category(Category::new("Earthquake breaking news".into()))
.category(Category::new("Minor World Earthquakes Magnitude -3.9".into()))
.category(Category::new("Spárti".into()))
.id("\n http://www.earthquakenewstoday.com/2019/08/06/minor-earthquake-3-5-mag-was-detected-near-aris-in-greece/\n ")
.summary(Text::new("\n A minor earthquake magnitude 3.5 (ml/mb) strikes near Kalamáta, Trípoli, Pýrgos, Spárti, Filiatrá, Messíni, Greece on Tuesday. The temblor has occurred at 03:46:56/3:46 am (local time epicenter) at a depth of 10 km (6 miles). How did you react? Did you feel it?".into()))
.updated(actual.updated));

Get feed type

Is it possible to determine the type of parsed feed (rss , atom , json) ?

Failed build with latest `serde`

Latest version of serde made serde::export module really private and now feed-rs fails to build.

    |
10  | use serde::export::Formatter;
    |            ^^^^^^ private module
    |

Parse MediaRSS feeds in RSS 2.0

Youtube seems to use MediaRSS in an Atom feed and that is also what is supported by feed-rs, but looking at https://www.rssboard.org/media-rss, it says (my emphasis)

This is version 1.5.1 of the Media RSS specification, a namespace for RSS 2.0 published on Dec. 11, 2009

And indeed there are feeds like https://s.ch9.ms/Shows/Azure-Friday/feed/mp4high that have MediaRSS in RSS 2.0. It also has itunes tags, so finding some good way of merging those might be a bit tricky.

A workable approach might be to always prefer MediaRSS over itunes.

Utf-8 getting decoded 2nd time

Okay, I'm not sure if this can even be fixed without slowing down the parsing quite a bit. So possibly the solution is to close this issue and put a warning somewhere "Don't do this".

I'll use testurls as an example since it also triggers the issue.

The text() method of reqwest already tries to find the encoding in the HTTP headers and decodes the text to utf-8. Now we have a utf-8 encoded xml string which contains something like <?xml version="1.0" encoding="ISO-8859-1"?>.

So feed-rs or rather quick-xml does what it is told and interprets the bytes as ISO-8859-1 and tries to convert them to utf-8. Sadly that just scrambles the nicely decoded characters again.

A stripped down version of testurls

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let xml = reqwest::blocking::get("https://rss.golem.de/rss.php?feed=RSS1.0")?.text()?;

    match parser::parse(xml.as_bytes()) {
        Ok(_feed) => println!("ok"),
        Err(error) => println!("failed: {:?}\n{:?}\n-------------------------------------------------------------", error, xml),
    }

    Ok(())
}

results in:

Some(Text { content_type: "text/plain", src: None, content: "Cloud-Computing: Cloudical kündigt kompletten Cloud-Open-Source-Stack an" })

Using raw bytes() instead of decoded text() works as expected

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let xml = reqwest::blocking::get("https://rss.golem.de/rss.php?feed=RSS1.0")?.bytes()?;

    match parser::parse(xml.as_ref()) {
        Ok(_feed) => println!("ok"),
        Err(error) => println!("failed: {:?}\n{:?}\n-------------------------------------------------------------", error, xml),
    }

    Ok(())
}
Some(Text { content_type: "text/plain", src: None, content: "Cloud-Computing: Cloudical kündigt kompletten Cloud-Open-Source-Stack an" })

Btw: that also explains why you couldn't reproduce #10 with testurls. Everything is starting to make sense to me.
I'll switch to bytes() in my code and see what opinions others have regarding this issue.

Resolve Relative URIs

Some feeds provide relative URIs for things like enclosures. I recently implemented some basic code to resolve these URIs in my Feed Reader based on feed-rs. The bug report I got was refreshingly detailed and helpful. I got linked a step by step description how a big python feed parser handles this problem:

https://pythonhosted.org/feedparser/resolving-relative-links.html#how-relative-uris-are-resolved

But since the first few steps rely on knowledge of the feeds XML they probably should be implemented in feed-rs. Later steps which use fields of the HTTP header however can't be implemented in feed-rs and need to be handled in the library/program that makes use of it.

What is your opinion on the issue?

If URIs get resolved but some of them fail because there is not enough information in the XML itself should the resulting Link struct indicate that it is a partial URL? Or should the calling library just watch for url::ParseError::RelativeUrlWithoutBase?

[Bad Feed] No ID & no Links

Got a bug report for a particularly bad feed: https://gitlab.com/news-flash/news_flash_gtk/-/issues/213

https://feeds.feedburner.com/ingreso_dival

The issue is that the items don't have an ID or a Link. So random ids will be generated that break updating the feed.

A solution would be to combine the feed-URL with the title and has that to generate an ID. If you think that is an acceptable approach or have a better idea I can create a PR.

btw: is there ever a use-case for having randomly generated IDs? It caused more headaches for me than it solved problems. But that's just my experience.

Fails to build out-of-box on a new project

On my machine if a create a new project with cargo new and add feed-rs="0.4" as my only dependency I get the following error:

   Compiling feed-rs v0.4.1
error[E0603]: module `export` is private
   --> C:\.cargo\registry\src\github.com-1ecc6299db9ec823\feed-rs-0.4.1\src\xml\mod.rs:10:12
    |
10  | use serde::export::Formatter;
    |            ^^^^^^ private module
    |
note: the module `export` is defined here
   --> C:\.cargo\registry\src\github.com-1ecc6299db9ec823\serde-1.0.120\src\lib.rs:275:5
    |
275 | use self::__private as export;
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0603`.
error: could not compile `feed-rs`

Maybe this was due an update with serde?

Support namespaces in feeds

  • Ignore elements from unknown namespaces
  • Automatically handle elements from namespaces such as DublinCore (e.g. date which is used in RSS1 feeds such as Slashdot).

Some fields may not be read correctly

I'm using feed_rs to parse content of BrotherBircks web site.

Items in their RSS feed are defined this way

<title>Cadet Thrawn outwits his opponents in the metallurgy lab</title> http://feedproxy.google.com/~r/TheBrothersBrick/~3/eWF3-ZnktaM/ https://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/#respond Fri, 08 Feb 2019 14:00:19 +0000
<guid isPermaLink="false">https://www.brothers-brick.com/?p=171427</guid>
<description><![CDATA[This detailed scene by CRCT Productions depicts the famous Grand Admiral Thrawn in his early days as an Imperial cadet.]]></description>
		<content:encoded><![CDATA[CONTENT IS REDACTED as GitHub interprets it as valid HTML]]></content:encoded>
	<wfw:commentRss>https://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>

171427 feedburner:origLinkhttps://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/</feedburner:origLink>

And the content can't be read. The full RSS feed looks like https://www.brothers-brick.com/feed/

xml:base for content

This is a follow up to #104 since this case was not covered by #107

As mentioned in the previous issue: some feeds have relative URLs as part of their HTML content. A good example is https://insanity.industries/index.xml with

<item xml:base="https://insanity.industries/post/pareto-optimal-compression/">
...
<content:encoded><![CDATA[
    ...
    <figure>
        <a href="example.svg">
		    <img src="example.svg"
    	         alt="Compression results for different hypothetical compression algorithms, including the Pareto frontier indicated in blue."/>
        </a><figcaption><p>Compression results for different hypothetical compression algorithms, including the Pareto frontier indicated in blue.</p>
            </figcaption>
    </figure>
    ...
    ]]></content:encoded>
    ...
</item>

@markpritchard responded

I'm comfortable switching all the URLs in the RSS/Atom content to absolute by applying xml:base but I wouldn't want to add an HTML parser to feed-rs by default. Might be worth playing around with as a feature (I've never done that in Rust ... might be interesting to learn).

Overeager whitespace trimming

I'm not sure if this is an issue with xml-rs, but lets go one level deeper at a time.
Apparently feed-rs trims the spaces before and after links for the planet.gnome.org atom feed. I tracked the issue down to element_source.rs:L30. Setting trim_whitespace(false) "fixes" the issue.

Do you think turning whitespace trimming off is a sensible solution for this bug? Should this be reported to xml-rs?

Original report with image illustrating the issue: https://gitlab.com/news-flash/news_flash_base/-/issues/9

Consistent GUID for feeds

Similar to #11 feeds have randomly generated IDs for RSS 1.0 and 2.0. Atom parses an ID if it's there but could also end up with a random ID (not sure if that happens in the real world).
Would be cool to have a similar solution for this than #13 .

RSS 1.0 no consistent GUID

The RSS 1.0 parser uses the auto generated GUID from Entry::default(). This means parsing a feed a second time for new articles results in duplicate articles.
One could use the url to check for duplicates. But imo it would be nicer if feed_rs generated a consistent id for each item. Probably based on the url of the article.

Panics on invalid XML

The library currently panics on the following input:

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
	
>
<channel>
<title>Reuters: Most Read Articles</title>
<link>https://www.reuters.com</link>
<description>Reuters.com is your source for breaking news, business, financial and investing news, including personal finance and stocks.  Reuters is the leading global provider of news, financial information and technology solutions to the world's media, financial institutions, businesses and individuals.</description>
<image>
	<title>Reuters News</title>
	<width>120</width>
	<height>35</height>
	<link>https://www.reuters.com</link>
	<url>https://www.reuters.com/resources_v2/images/reuters125.png</url>
</image>
<language>en-us</language>
<lastBuildDate>Sat, 21 Mar 2020 06:29:51 -0400</lastBuildDate>
<copyright>All rights reserved. Users may download and print extracts of content from this website for their own personal and non-commercial use only. Republication or redistribution of Reuters content, including by framing or similar means, is expressly prohibited without the prior written consent of Reuters. Reuters and the Reuters sphere logo are registered trademarks or trademarks of the Reuters group of companies around the world. &#169; Reuters 2020</copyright>
<!--Property passed to loop is null--><!-- Exception rendering module on server  -->

This is being served by the following url (at the time of writing): http://feeds.reuters.com/reuters/MostRead

The panic is due to an unwrap on src/util/element_source.rs:250:9, and can be verified with the testurls example:

$ echo "http://feeds.reuters.com/reuters/MostRead" | cargo run --bin testurls
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/testurls`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 22:1, kind: Syntax("Unexpected end of stream: still inside the root element") }', /home/guilherme/Projects/feed-rs/feed-rs/src/util/element_source.rs:250:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
http://feeds.reuters.com/reuters/MostRead

Other inputs that will also panic:

Not parsing HEATED rss feed correctly

I am a user of NewsFlash GTK which uses feed-rs. I've reported an issue against their repo for not parsing the HEATED rss feed correctly. You can find the original issue here.

It seems that it tries to parse the content:encoded as an image. This is the following output the NewsFlash dev sees when trying to parse an entry from HEATED.

Entry {
    id: "https://heated.world/p/twitters-big-oil-ad-loophole",
    title: Some(Text { content_type: "text/plain", src: None, content: "Twitter\'s Big Oil ad loophole" }),
    updated: Some(2021-02-03T05:17:19Z),
    authors: [Person { name: "Emily Atkin", uri: None, email: None }],
    content: Some(Content {
        body: None,
        content_type: "image/jpeg",
        length: Some(0),
        src: Some(Link { href: "https://cdn.substack.com/image/fetch/h_600,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4a2557-4e1b-417f-8b87-6c954a50576c_1072x1102.png", rel: None, media_type: None, href_lang: None, title: None, length: None })
    }),
    links: [Link { href: "https://heated.world/p/twitters-big-oil-ad-loophole", rel: None, media_type: None, href_lang: None, title: None, length: None }],
    summary: Some(Text {
        content_type: "text/plain",
        src: None,
        content: "Climate groups can\'t pay Twitter to spread political content. But the oil industry can, and it\'s ramping up its efforts alongside Biden\'s climate push."
    }),
    categories: [],
    contributors: [],
    published: Some(2021-02-02T12:01:03Z),
    source: None,
    rights: None,
    media: []
}

You can find the rss feed here.

Implment Clone on feed_rs::model::Entry ?

I'm trying to use feed_rs to make some aggregator-like tool with some filtering capabilities.

For this to work I came up with the idea of getting sources from different feeds, merge them into a single "stream" and to apply the filters I need, but I got stuck because I can't concatenate the feed_rs::model.entries (which is a Vec) as the feed_rs::model::Entry does not implement the Clone trait.

Is this intentional? is there a good reason why this as is, or is it just an issue was never problematic?

Build errors with 0.2.2

Sadly the 0.2.2 update introduced some build errors:

error[E0599]: no method named `map_or` found for type `std::result::Result<chrono::datetime::DateTime<chrono::offset::fixed::FixedOffset>, chrono::format::ParseError>` in the current scope
  --> feed-rs/src/parser/util.rs:47:10
   |
47 |         .map_or(None, |t| Some(t.with_timezone(&Utc)))
   |          ^^^^^^ help: there is a method with a similar name: `map_err`

error[E0599]: no method named `map_or` found for type `std::result::Result<chrono::datetime::DateTime<chrono::offset::fixed::FixedOffset>, chrono::format::ParseError>` in the current scope
  --> feed-rs/src/parser/util.rs:54:10
   |
54 |         .map_or(None, |t| Some(t.with_timezone(&Utc)))
   |          ^^^^^^ help: there is a method with a similar name: `map_err`

error: aborting due to 2 previous errors

Convert text to Utf-8

I'm not sure if this is something that should be implemented in feed-rs itself or rather in xml-rs.
I think we all know rust strings are utf-8 and XML documents of feeds can be in a number of different encodings. The text encoding used is mentioned in the xml tag like this:

<?xml version="1.0" encoding="ISO-8859-1" ?>.

xml-rs picks up that information, but doesn't do anything with it until now.

XML document encoding.

If XML declaration is not present or does not contain encoding attribute, defaults to "UTF-8". This field is currently used for no other purpose than informational.

So the question is: should the text be converted to uft-8 in xml-rs already or is feed-rs the right place to handle this? Even if it eventually should be part of xml-rs's functionality, is a temporary solution in feed-rs a good idea?

Obligatory link to the related bug report in an application using feed-rs:
https://gitlab.com/news-flash/news_flash/-/issues/35

Create a separate field for feed_url

Can you please create a separate field in the feed model for feed url?

Currently, the feed model has links field. It's not obvious in which position feed url is localed for json feed. Json feeds have feed_url and homepage_url

Another option is to get rid of links field and create link field which will contain feed url

Cargo fmt

Would it be okay to cargo fmt everything? Right now there is a relatively large difference for me locally.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.