Giter Site home page Giter Site logo

Comments (5)

mmcdole avatar mmcdole commented on June 2, 2024 5

I'm been kind of mulling over what options we might want in this config object, that have been requested in the past, or might be beneficial to give control over.

These are what I've come up with. Let me know your thoughts on these if you get a chance.

type ParseOptions struct {
	// MaxItems specifies the maximum number of feed items to parse.
	// The default value is 0, which means no limit.
	MaxItems int

	// ParseDates determines if the feed parser will attempt to parse dates into `time.Time` objects.
	// The default value is true.
	ParseDates bool

	// KeepOriginalFeed specifies if the parser should retain the raw feed in the `Feed` struct's `RawFeed` field.
	// The default value is false.
	KeepOriginalFeed bool

	// StrictMode determines if strict parsing rules should be applied.
	// If set to true, all strictness settings are enabled.
	// The default value is false, and all StrictnessOptions default to their least strict settings.
	StrictMode bool

	// StrictnessOptions holds the options for controlling the strictness of the parsing.
	StrictnessOptions StrictnessOptions

	// RequestOptions holds the options for the HTTP request.
	RequestOptions RequestOptions
}

type StrictnessOptions struct {
	// StripInvalidCharacters specifies if invalid feed characters should be stripped out.
	// The default value is true.
	StripInvalidCharacters bool

	// AutoCloseTags specifies if the parser should automatically close unclosed tags.
	// The default value is true.
	AutoCloseTags bool

	// AllowUndisclosedXMLNamespaces specifies if the parser should allow undisclosed XML namespaces.
	// The default value is true.
	AllowUndisclosedXMLNamespaces bool

	// AllowIncorrectDateFormats specifies if the parser should allow incorrect date formats.
	// The default value is true.
	AllowIncorrectDateFormats bool

	// AllowUnescapedMarkup specifies if the parser should allow unescaped / naked markup in feed elements.
	// The default value is true.
	AllowUnescapedMarkup bool
}

type RequestOptions struct {
	// Timeout is the maximum amount of time to wait for the request to complete.
	// The default value is 0, which means no timeout.
	Timeout time.Duration

	// UserAgent is the value of the `User-Agent` header to be sent in the request.
	// This header is used to identify the client software and version to the server.
	// The default value is "gofeed/2.0.0" (the current version of gofeed).
	UserAgent string

	// IfNoneMatch is the value of the `If-None-Match` header to be sent in the request.
	// This header is used to make a conditional request for a resource, and the server will return the resource only if it does not match the provided entity tag.
	// If the server has a matching `Etag`, it will not respond with the full feed, and instead return a `304 Not Modified` response.
	// The default value is an empty string, which means no `If-None-Match` header will be sent.
	IfNoneMatch string

	// IfModifiedSince is the value of the `If-Modified-Since` header to be sent in the request.
	// This header is used to make a conditional request for a resource, and the server will return the resource only if it has been modified since the provided time.
	// If the server determines that the resource has not been modified since the provided `If-Modified-Since` time, it will not respond with the full feed, and instead return a `304 Not Modified` response.
	// The default value is an empty string, which means no `If-Modified-Since` header will be sent.
	IfModifiedSince string
}

I also realized that the universal feed parser's Detector is reading in the entire feed. This is a bug/mistake, I want to only read in a small buffered amount and let the feed be parsed as a buffered reader. So, I'll need to clean that up. That might help in the scenario you describe above with a very large feed and your intention to only read a subset of it? If you were using the universal parser at least.

from gofeed.

imirzadeh avatar imirzadeh commented on June 2, 2024 3

I thought about the options. Overall, I like your proposal!

I have been working on extracting feeds for over a year now (for a personal project). Given the challenges I faced, I thought I share my suggestions:

1- This is very subjective, but I think the request fetching is out of the scope of gofeed, which is concerned with parsing. The fact that it can parse from urls is fantastic, but I think if someone needs a very advanced fetching mechanism, they can implement them themselves. For instance, some RSS feeds have huge body sizes and require special headers, auth, etc. Overall, I think parsing differs from fetching, and it may not be worth it if it makes the code/API more complex. But this is just my opinion, obviously :)

2- Regarding the ParseOptions and StrictnessOptions, I pretty much like the proposal. Another highly optional strictness field could be HTML sanitization for security reasons, but I think users can do this as a post-processing step, and I don't think is critical at all. One other suggestion could be using a skip list where some tags could be skipped if it improves the performance. I'm not sure if it does since given my benchmark, most of the gofeed's time is spent on decoding elements.

3- Related to the previous suggestion, I think gofeed can have separate utility functions that are helpful when working with feeds. For instance, we can detect RSS links based on some heuristics. Other utilities could be URL resolution and HTML sanitization. I may be able to help with these.

All in all, I think the proposal is very nice! Thank you again for sharing gofeed with us! :)

from gofeed.

mmcdole avatar mmcdole commented on June 2, 2024 3

Thanks @imirzadeh .

I think you are right. I'll leave the ParseURL functions, but will encourage people who want to have custom behavior to use their own client implementation.

from gofeed.

mmcdole avatar mmcdole commented on June 2, 2024 2

This is a great idea.

I also am keen on introducing a config object, and putting some other things in there as well like user agent, proxy, strictness settings.

We could put it behind a setter, as you suggested, or cut a v2 breaking change and clean up the interface a bit.

from gofeed.

imirzadeh avatar imirzadeh commented on June 2, 2024

Update

I did a little bit of digging and found out implementing this is actually not as easy at it seems.
For the NYTimes feed above, with some of hacks, I managed to improve the performance significantly when I only I stopped parsing after seeing 10 RSS items:


BenchmarkGoFeed-All  	               8	 139441776 ns/op	141147331 B/op	  819102 allocs/op
BenchmarkGoFeed-10+Skip   	      10	 107165758 ns/op	101558910 B/op	  526856 allocs/op
BenchmarkGoFeed-10+Stop   	     273	   4106181 ns/op	 64155609 B/op	    5919 allocs/op

So in the results above we compare:

  • BenchmarkGoFeed-All : The current/default implementation. parses everything (1752 items)
  • BenchmarkGoFeed-10+Skip: Modified implementation where I changed parseChannel() so that after length of items is more than 10, i p.Skip() for the item. Obviously still every element is going to be decoded so the performance gain (~20%) is negligible.
  • BenchmarkGoFeed-10+Stop: Modified implementation where after the length of parsed items is larger than 10, I stopped parsing altogether. For this, I had to disable some checks (expecting endtags for channel/rss) since I did not continue parsing xml elements/tokens. But this implementation is order-of-magnitude faster when we don't need all the items!

from gofeed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.