A simple feed parser (RSS, Atom, JSON Feed)
The parser library is in feed-rs, with a simple test tool in testurls
A simple feed parser (RSS, Atom, JSON Feed)
Home Page: https://crates.io/crates/feed-rs
When parsing https://www.spreaker.com/show/4273892/episodes/feed channel.logo
field is always None
, even though feed has an image tag:
<image>
<url>https://d3wo5wojvuv7l.cloudfront.net/t_rss_itunes_square_1400/images.spreaker.com/original/0dcd53afca70854beb456079fa25d3f1.jpg</url>
<title>Lwowska Fala | Radio Katowice</title>
<link>https://www.spreaker.com/show/lwowska-fala-radio-katowice</link>
</image>
Other feeds that have same structure of <image>
have logo
correctly parsed.
I'm using feed-rs
version 0.6.1
.
I know this one is a bit more exotic. But since the purpose of this crate is one size fits all for feeds it would be nice to have.
Of course this should be prioritized pretty low, since it's rarely used in the wild.
Hello,
I'm packaging feed-rs as one of dependencies for Fedora and we run cargo test
on all crates we package. However, it seems that testing files are not shipped (the fixtures
folder). Is there reason not to?
If there is, can you probably also exclude src/util/test.rs
so that cargo test won't run it?
Thanks for cooperation!
When sending a get request with reqwest
and getting the raw bytes from https://rss.golem.de/rss.php?feed=RSS1.0
and then feeding them to feed-rs
yields the following error when parsing the description of the first entry
Err value: Error { pos: 1:1, kind: Utf8(Utf8Error { valid_up_to: 0, error_len: Some(1) }) }
Getting the response body with .text()
works fine.
I prepared a small code sample
use std::io::BufReader;
#[tokio::main]
async fn main() {
let rss_response = reqwest::get("https://rss.golem.de/rss.php?feed=RSS1.0").await.unwrap();
//let rss_response = rss_response.text().await.unwrap();
//let feed = feed_rs::parser::parse(rss_response.as_bytes()).unwrap();
let rss_response = rss_response.bytes().await.unwrap().to_vec();
let feed = feed_rs::parser::parse(BufReader::new(rss_response.as_slice())).unwrap();
println!("{:?}", feed.title);
}
The description it chokes on is
<description>Envelope, ein Umschlag aus Papier für das eigene Telefon. Allerdings können Nutzer damit den Bildschirm nicht mehr erkennen und nur noch ein Nummernpad und wenige Tasten verwenden. Einen Bastelbogen und die App stellt das Team kostenlos zur Verfügung. (<a href="https://www.golem.de/specials/smartphone/">Smartphone</a>, <a href="https://www.golem.de/specials/google/">Google</a>) <img src="https://cpx.golem.de/cpx.php?class=17&amp;aid=146212&amp;page=1&amp;ts=1579701600" alt="" width="1" height="1" /></description>
The only difference I can think of is that reqwest
somehow fixes a malformed sequence in the feed when using .text()
.
This method decodes the response body with BOM sniffing and with malformed sequences replaced with the REPLACEMENT CHARACTER. Encoding is determinated from the charset parameter of Content-Type header, and defaults to utf-8 if not presented.
Cargo.toml says it is MIT or Apache-2.0
but LICENSE file contains only MIT text license…
Would appreciate new release after you decide which one is correct and make necessary corrections.
Thanks!
I did some testing and found the following feeds caused an InvalidDateTime
error of one sort or another:
thread 'main' panicked at 'unable to parse http://www.breakingthin.gs/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://donmelton.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.evanmiller.org/news.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.sealiesoftware.com/blog/rss.xml: ParseError(InvalidDateTime(ParseError(OutOfRange)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://sourceforge.net/p/objectivelib/news/feed: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/silverback: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/simpledesktops: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://theredditblog.disqus.com/lesswrong_the_coolest_use_of_reddit_source_weve_found_to_date/latest.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.xkcd.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.mechanicalgirl.com/feeds/all/: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://cdevroe.com/status/feed: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/danielmall-articles: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://alex.amiran.it/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://feeds.feedburner.com/oshogbo: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://www.forrestthewoods.com/rss.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://osblog.stephenmarz.com/feed.rss: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://myrrlyn.net/blog.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://blog.burntsushi.net/index.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://interrupt.memfault.com/blog/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://duncan.bayne.id.au/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://groups.google.com/forum/feed/pagedout-notifications/msgs/rss.xml?num=15: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/simpledesktops: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://theredditblog.disqus.com/lesswrong_the_coolest_use_of_reddit_source_weve_found_to_date/latest.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.breakingthin.gs/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.evanmiller.org/news.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/danielmall-articles: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://duncan.bayne.id.au/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://feeds.feedburner.com/oshogbo: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.xkcd.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.sealiesoftware.com/blog/rss.xml: ParseError(InvalidDateTime(ParseError(OutOfRange)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.shervinemami.info/rss.xml: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://donmelton.com/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://cdevroe.com/status/feed: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://sourceforge.net/p/objectivelib/news/feed: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://blog.burntsushi.net/index.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://alex.amiran.it/rss.xml: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://myrrlyn.net/blog.rss: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://www.forrestthewoods.com/rss.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse https://interrupt.memfault.com/blog/feed.xml: ParseError(InvalidDateTime(ParseError(Invalid)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://osblog.stephenmarz.com/feed.rss: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://feeds.feedburner.com/silverback: ParseError(InvalidDateTime(ParseError(Impossible)))', src/libcore/result.rs:1165:5
thread 'main' panicked at 'unable to parse http://www.mechanicalgirl.com/feeds/all/: ParseError(InvalidDateTime(ParseError(NotEnough)))', src/libcore/result.rs:1165:5
Original issue: https://gitlab.com/news-flash/news_flash_gtk/-/issues/13
The problem here is a website offering different feeds (RSS 2.0) but all use the same link <link>https://3dnews.ru/</link>
.
A possible solution could be to not only hash the first available link but together with the title.
Version: 1.0.0
It fails to parse the content from https://blog.cloudflare.com/rss/. This appears to be using <content:encoded>
and it looks like feed_rs has support for that but for some reason it isn't getting picked up.
Example entry:
Entry {
id: "615d65eebd615902a7538284",
title: Some(
Text {
content_type: "text/plain",
src: None,
content: "Staging TLS Certificates: Make every deployment a safe deployment",
},
),
updated: Some(
2021-10-07T16:56:53Z,
),
authors: [
Person {
name: "Dina Kozlov",
uri: None,
email: None,
},
],
content: None,
links: [
Link {
href: "https://blog.cloudflare.com/staging-tls-certificate-every-deployment-safe-deployment/",
rel: None,
media_type: None,
href_lang: None,
title: None,
length: None,
},
],
summary: Some(
Text {
content_type: "text/plain",
src: None,
content: "We are excited to announce that Enterprise customers now have the ability to test custom uploaded certificates in a staging environment before pushing them to production. ",
},
),
categories: [
Category {
term: "TLS",
scheme: None,
label: None,
},
Category {
term: "SSL",
scheme: None,
label: None,
},
],
contributors: [],
published: Some(
2021-10-06T12:56:13Z,
),
source: None,
rights: None,
media: [
MediaObject {
title: None,
content: [
MediaContent {
url: Some(
Url {
scheme: "https",
cannot_be_a_base: false,
username: "",
password: None,
host: Some(
Domain(
"blog.cloudflare.com",
),
),
port: None,
path: "/content/images/2021/10/staging-tls-certificate-every-deployment-safe-deployment-OG-1.png",
query: None,
fragment: None,
},
),
content_type: None,
height: None,
width: None,
duration: None,
size: None,
rating: None,
},
],
duration: None,
thumbnails: [],
texts: [],
description: None,
community: None,
credits: [],
},
],
},
Use configuration to control:
E.g. https://github.com/feed-rs/feed-rs/releases.atom
ParseError(MissingContent("content.type"))
Appears other feed readers simply default to the current time when the timestamp is not able to be parsed.
Hi, (great library, I really appreciate the efforts in creating a unified feed parser!) after using feed-rs
for a bit and wanting to swap over to it in a project, I noticed that the entry bodies of certain feeds weren't being included, they always ended up unwrapping to None
. After looking a bit at the feed data, I think it's because the content:encoded
element isn't being parsed into the model as a body.
I looked at the tests to see if it was covered (and thus if I was doing something wrong) and noticed that while there is a sample file that has content:encoded
in it:
feed-rs/feed-rs/fixture/rss_2.0_example_4.xml
Lines 49 to 55 in 949c3ea
The body for it isn't being tested:
feed-rs/feed-rs/src/parser/rss2/tests.rs
Lines 116 to 126 in 949c3ea
Is it possible to determine the type of parsed feed (rss , atom , json) ?
Latest version of serde
made serde::export
module really private and now feed-rs
fails to build.
|
10 | use serde::export::Formatter;
| ^^^^^^ private module
|
Feed URL: https://alistapart.com/main/feed
XML is dirty, full of redundant '\t' and '\n', I think trimming fields of the white-space before processing should resolve most issues, but I'm not sure on that.
Youtube seems to use MediaRSS in an Atom feed and that is also what is supported by feed-rs, but looking at https://www.rssboard.org/media-rss, it says (my emphasis)
This is version 1.5.1 of the Media RSS specification, a namespace for RSS 2.0 published on Dec. 11, 2009
And indeed there are feeds like https://s.ch9.ms/Shows/Azure-Friday/feed/mp4high that have MediaRSS in RSS 2.0. It also has itunes tags, so finding some good way of merging those might be a bit tricky.
A workable approach might be to always prefer MediaRSS over itunes.
An issue reported to me some time ago: https://gitlab.com/news-flash/news_flash_gtk/-/issues/154#note_446893398
This unwrap()
can cause an error. In the case above the error is UnquotedValue(32)
.
My suggestion is to filter_map
instead and ignore attributes that currently unwrap to an error. If that is an acceptable solution I can prepare a PR.
Okay, I'm not sure if this can even be fixed without slowing down the parsing quite a bit. So possibly the solution is to close this issue and put a warning somewhere "Don't do this".
I'll use testurls
as an example since it also triggers the issue.
The text()
method of reqwest already tries to find the encoding in the HTTP headers and decodes the text to utf-8
. Now we have a utf-8 encoded xml string which contains something like <?xml version="1.0" encoding="ISO-8859-1"?>
.
So feed-rs
or rather quick-xml
does what it is told and interprets the bytes as ISO-8859-1
and tries to convert them to utf-8
. Sadly that just scrambles the nicely decoded characters again.
A stripped down version of testurls
fn main() -> Result<(), Box<dyn std::error::Error>> {
let xml = reqwest::blocking::get("https://rss.golem.de/rss.php?feed=RSS1.0")?.text()?;
match parser::parse(xml.as_bytes()) {
Ok(_feed) => println!("ok"),
Err(error) => println!("failed: {:?}\n{:?}\n-------------------------------------------------------------", error, xml),
}
Ok(())
}
results in:
Some(Text { content_type: "text/plain", src: None, content: "Cloud-Computing: Cloudical kündigt kompletten Cloud-Open-Source-Stack an" })
Using raw bytes()
instead of decoded text()
works as expected
fn main() -> Result<(), Box<dyn std::error::Error>> {
let xml = reqwest::blocking::get("https://rss.golem.de/rss.php?feed=RSS1.0")?.bytes()?;
match parser::parse(xml.as_ref()) {
Ok(_feed) => println!("ok"),
Err(error) => println!("failed: {:?}\n{:?}\n-------------------------------------------------------------", error, xml),
}
Ok(())
}
Some(Text { content_type: "text/plain", src: None, content: "Cloud-Computing: Cloudical kündigt kompletten Cloud-Open-Source-Stack an" })
Btw: that also explains why you couldn't reproduce #10 with testurls
. Everything is starting to make sense to me.
I'll switch to bytes()
in my code and see what opinions others have regarding this issue.
Feed URL: http://feeds.feedburner.com/RockPaperShotgun
This one I'm not sure why is not working. There is a description
tag inside XML, it contains escaped HTML, but feed-rs
returns None
for it.
Some feeds provide relative URIs for things like enclosures. I recently implemented some basic code to resolve these URIs in my Feed Reader based on feed-rs
. The bug report I got was refreshingly detailed and helpful. I got linked a step by step description how a big python feed parser handles this problem:
https://pythonhosted.org/feedparser/resolving-relative-links.html#how-relative-uris-are-resolved
But since the first few steps rely on knowledge of the feeds XML they probably should be implemented in feed-rs
. Later steps which use fields of the HTTP header however can't be implemented in feed-rs
and need to be handled in the library/program that makes use of it.
What is your opinion on the issue?
If URIs get resolved but some of them fail because there is not enough information in the XML itself should the resulting Link
struct indicate that it is a partial URL? Or should the calling library just watch for url::ParseError::RelativeUrlWithoutBase
?
Got a bug report for a particularly bad feed: https://gitlab.com/news-flash/news_flash_gtk/-/issues/213
https://feeds.feedburner.com/ingreso_dival
The issue is that the items don't have an ID or a Link. So random ids will be generated that break updating the feed.
A solution would be to combine the feed-URL with the title and has that to generate an ID. If you think that is an acceptable approach or have a better idea I can create a PR.
btw: is there ever a use-case for having randomly generated IDs? It caused more headaches for me than it solved problems. But that's just my experience.
On my machine if a create a new project with cargo new
and add feed-rs="0.4"
as my only dependency I get the following error:
Compiling feed-rs v0.4.1
error[E0603]: module `export` is private
--> C:\.cargo\registry\src\github.com-1ecc6299db9ec823\feed-rs-0.4.1\src\xml\mod.rs:10:12
|
10 | use serde::export::Formatter;
| ^^^^^^ private module
|
note: the module `export` is defined here
--> C:\.cargo\registry\src\github.com-1ecc6299db9ec823\serde-1.0.120\src\lib.rs:275:5
|
275 | use self::__private as export;
| ^^^^^^^^^^^^^^^^^^^^^^^^^
error: aborting due to previous error
For more information about this error, try `rustc --explain E0603`.
error: could not compile `feed-rs`
Maybe this was due an update with serde?
Reddit generates invalid rss feeds. Can you make parser skip invalid items?
date
which is used in RSS1 feeds such as Slashdot).https://www.tjrs.jus.br/site_php/noticias/news_rss.php
This feed fails with ParseError(NoFeedRoot)
I'm using feed_rs to parse content of BrotherBircks web site.
Items in their RSS feed are defined this way
<title>Cadet Thrawn outwits his opponents in the metallurgy lab</title> http://feedproxy.google.com/~r/TheBrothersBrick/~3/eWF3-ZnktaM/ https://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/#respond Fri, 08 Feb 2019 14:00:19 +0000<guid isPermaLink="false">https://www.brothers-brick.com/?p=171427</guid>
<description><![CDATA[This detailed scene by CRCT Productions depicts the famous Grand Admiral Thrawn in his early days as an Imperial cadet.]]></description>
<content:encoded><![CDATA[CONTENT IS REDACTED as GitHub interprets it as valid HTML]]></content:encoded>
<wfw:commentRss>https://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
171427 feedburner:origLinkhttps://www.brothers-brick.com/2019/02/08/cadet-thrawn-outwits-his-opponents-in-the-metallurgy-lab/</feedburner:origLink>
And the content can't be read. The full RSS feed looks like https://www.brothers-brick.com/feed/
Original report: https://gitlab.com/news-flash/news_flash_base/-/issues/24
The issue:
The date format is a little off
2014-12-29T14:53:35+0200
instead of
2014-12-29T14:53:35+02:00
Imo feed-rs
should try a few common formats until giving up to parse a date.
The atom feed is: https://feeds.feedburner.com/sallar
Maybe something like RFC2822_FIXES
for rfc3339 would make sense here. Sadly my knowledge of regular expressions is somewhat limited.
This is a follow up to #104 since this case was not covered by #107
As mentioned in the previous issue: some feeds have relative URLs as part of their HTML content. A good example is https://insanity.industries/index.xml with
<item xml:base="https://insanity.industries/post/pareto-optimal-compression/">
...
<content:encoded><![CDATA[
...
<figure>
<a href="example.svg">
<img src="example.svg"
alt="Compression results for different hypothetical compression algorithms, including the Pareto frontier indicated in blue."/>
</a><figcaption><p>Compression results for different hypothetical compression algorithms, including the Pareto frontier indicated in blue.</p>
</figcaption>
</figure>
...
]]></content:encoded>
...
</item>
@markpritchard responded
I'm comfortable switching all the URLs in the RSS/Atom content to absolute by applying xml:base but I wouldn't want to add an HTML parser to feed-rs by default. Might be worth playing around with as a feature (I've never done that in Rust ... might be interesting to learn).
I'm not sure if this is an issue with xml-rs
, but lets go one level deeper at a time.
Apparently feed-rs
trims the spaces before and after links for the planet.gnome.org
atom feed. I tracked the issue down to element_source.rs:L30
. Setting trim_whitespace(false)
"fixes" the issue.
Do you think turning whitespace trimming off is a sensible solution for this bug? Should this be reported to xml-rs
?
Original report with image illustrating the issue: https://gitlab.com/news-flash/news_flash_base/-/issues/9
Used in freedesktop sdk 19.08 per #30
We are currently tied to 1.39.0 due to it shipping in freedesktop 19.08.
Is this still a requirement @jangernert or can we bump to latest stable (1.48.0)?
Every feed item in https://spezialgelagert.de/feed/podcast/ has two MediaObject
s. One just contains the MP3 URL from the <enclosure>
the other MediaObject
has all the information from the itunes tags. I originally envisioned that those would be merged.
The RSS 1.0 parser uses the auto generated GUID from Entry::default()
. This means parsing a feed a second time for new articles results in duplicate articles.
One could use the url to check for duplicates. But imo it would be nicer if feed_rs
generated a consistent id for each item. Probably based on the url of the article.
I have an aging website whose backing store is a whole pile of Atom Entry Documents, and I'm contemplating replace the Python script that processes them with a Rust program. But it doesn't look like your library supports these kinds of documents: https://github.com/feed-rs/feed-rs/blob/master/feed-rs/src/parser/mod.rs#L180 doesn't look for entry
root elements.
The library is great!
I think for me the only missing piece is support for media namespaces handling.
The spec is here: https://www.rssboard.org/media-rss
Sadly lots of RSS/Atom feeds is using it, including Youtube atom feeds: https://www.youtube.com/feeds/videos.xml?user=kkszysiu
I think having support for it would be great.
I will try to introduce a PR that adds at least basic implementation in the comming days.
The library currently panics on the following input:
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
>
<channel>
<title>Reuters: Most Read Articles</title>
<link>https://www.reuters.com</link>
<description>Reuters.com is your source for breaking news, business, financial and investing news, including personal finance and stocks. Reuters is the leading global provider of news, financial information and technology solutions to the world's media, financial institutions, businesses and individuals.</description>
<image>
<title>Reuters News</title>
<width>120</width>
<height>35</height>
<link>https://www.reuters.com</link>
<url>https://www.reuters.com/resources_v2/images/reuters125.png</url>
</image>
<language>en-us</language>
<lastBuildDate>Sat, 21 Mar 2020 06:29:51 -0400</lastBuildDate>
<copyright>All rights reserved. Users may download and print extracts of content from this website for their own personal and non-commercial use only. Republication or redistribution of Reuters content, including by framing or similar means, is expressly prohibited without the prior written consent of Reuters. Reuters and the Reuters sphere logo are registered trademarks or trademarks of the Reuters group of companies around the world. © Reuters 2020</copyright>
<!--Property passed to loop is null--><!-- Exception rendering module on server -->
This is being served by the following url (at the time of writing): http://feeds.reuters.com/reuters/MostRead
The panic is due to an unwrap on src/util/element_source.rs:250:9
, and can be verified with the testurls example:
$ echo "http://feeds.reuters.com/reuters/MostRead" | cargo run --bin testurls
Finished dev [unoptimized + debuginfo] target(s) in 0.06s
Running `target/debug/testurls`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 22:1, kind: Syntax("Unexpected end of stream: still inside the root element") }', /home/guilherme/Projects/feed-rs/feed-rs/src/util/element_source.rs:250:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
http://feeds.reuters.com/reuters/MostRead
Other inputs that will also panic:
I am a user of NewsFlash GTK which uses feed-rs. I've reported an issue against their repo for not parsing the HEATED rss feed correctly. You can find the original issue here.
It seems that it tries to parse the content:encoded
as an image. This is the following output the NewsFlash dev sees when trying to parse an entry from HEATED.
Entry {
id: "https://heated.world/p/twitters-big-oil-ad-loophole",
title: Some(Text { content_type: "text/plain", src: None, content: "Twitter\'s Big Oil ad loophole" }),
updated: Some(2021-02-03T05:17:19Z),
authors: [Person { name: "Emily Atkin", uri: None, email: None }],
content: Some(Content {
body: None,
content_type: "image/jpeg",
length: Some(0),
src: Some(Link { href: "https://cdn.substack.com/image/fetch/h_600,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4a2557-4e1b-417f-8b87-6c954a50576c_1072x1102.png", rel: None, media_type: None, href_lang: None, title: None, length: None })
}),
links: [Link { href: "https://heated.world/p/twitters-big-oil-ad-loophole", rel: None, media_type: None, href_lang: None, title: None, length: None }],
summary: Some(Text {
content_type: "text/plain",
src: None,
content: "Climate groups can\'t pay Twitter to spread political content. But the oil industry can, and it\'s ramping up its efforts alongside Biden\'s climate push."
}),
categories: [],
contributors: [],
published: Some(2021-02-02T12:01:03Z),
source: None,
rights: None,
media: []
}
You can find the rss feed here.
It looks like there was some internal refactoring - the Cargo.toml specifies to include "LICENSE", but the license file was apparently renamed to "LICENSE-MIT":
https://github.com/feed-rs/feed-rs/blob/master/feed-rs/Cargo.toml#L9
I'm trying to use feed_rs to make some aggregator-like tool with some filtering capabilities.
For this to work I came up with the idea of getting sources from different feeds, merge them into a single "stream" and to apply the filters I need, but I got stuck because I can't concatenate the feed_rs::model.entries (which is a Vec) as the feed_rs::model::Entry does not implement the Clone trait.
Is this intentional? is there a good reason why this as is, or is it just an issue was never problematic?
Just stumbled across the trailers.apple.com
RSS 2.0 feed which doesn't provide a guid
for articles. So the same issue as #11 arises.
Sorry to open up so many issues. But I'm using this crate quite a lot :)
Sadly the 0.2.2
update introduced some build errors:
error[E0599]: no method named `map_or` found for type `std::result::Result<chrono::datetime::DateTime<chrono::offset::fixed::FixedOffset>, chrono::format::ParseError>` in the current scope
--> feed-rs/src/parser/util.rs:47:10
|
47 | .map_or(None, |t| Some(t.with_timezone(&Utc)))
| ^^^^^^ help: there is a method with a similar name: `map_err`
error[E0599]: no method named `map_or` found for type `std::result::Result<chrono::datetime::DateTime<chrono::offset::fixed::FixedOffset>, chrono::format::ParseError>` in the current scope
--> feed-rs/src/parser/util.rs:54:10
|
54 | .map_or(None, |t| Some(t.with_timezone(&Utc)))
| ^^^^^^ help: there is a method with a similar name: `map_err`
error: aborting due to 2 previous errors
<content:encoded>
seems to be supported for other formats already. Came across an RSS 1.0 feed that also uses it:
https://planet.freedesktop.org/rss10.xml
https://gitlab.com/news-flash/news_flash/-/issues/43
I'm not sure if this is something that should be implemented in feed-rs
itself or rather in xml-rs
.
I think we all know rust strings are utf-8 and XML documents of feeds can be in a number of different encodings. The text encoding used is mentioned in the xml tag like this:
<?xml version="1.0" encoding="ISO-8859-1" ?>
.
xml-rs
picks up that information, but doesn't do anything with it until now.
XML document encoding.
If XML declaration is not present or does not contain encoding attribute, defaults to "UTF-8". This field is currently used for no other purpose than informational.
So the question is: should the text be converted to uft-8 in xml-rs
already or is feed-rs
the right place to handle this? Even if it eventually should be part of xml-rs
's functionality, is a temporary solution in feed-rs
a good idea?
Obligatory link to the related bug report in an application using feed-rs
:
https://gitlab.com/news-flash/news_flash/-/issues/35
Can you please create a separate field in the feed model for feed url?
Currently, the feed model has links
field. It's not obvious in which position feed url is localed for json feed. Json feeds have feed_url
and homepage_url
Another option is to get rid of links
field and create link
field which will contain feed url
Would it be okay to cargo fmt
everything? Right now there is a relatively large difference for me locally.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.