rinusser / feedtrough Goto Github PK
View Code? Open in Web Editor NEWa feed cache/aggregator with basic clustering support
License: GNU General Public License v3.0
a feed cache/aggregator with basic clustering support
License: GNU General Public License v3.0
Currently there is a Source implementation for RSS/Atom feeds and another one generating deterministic data for tests.
Add an example source that parses content from a public website, to show that's possible.
Make sure scraping doesn't violate the website's terms of usage.
Currently feeds are configured in sources.txt, with no way of managing them once the application is running.
Add a REST management API and a CLI management interface.
Requires #11 to be done first.
At this point there should be an application with testable parts. Add support for automated tests.
Make sure there's a way to test clustering on a single machine, including the hard shutdown of nodes during inopportune times.
Create the basic application structure:
Do NOT yet:
Currently there are a few container classes (e.g. domain/*) with public fields that need to be set after object instantiation.
Add constructors with optional parameters wherever makes sense, then update the rest of the code to use them.
A first example is already live (albeit not yet in widespread use) in domain/Feed.py.
At this point the application should work standalone. Add clustering support:
The basic framework should have in-memory storage at this point. Research local storage options, then implement one that supports clustering without too much admin overhead.
Do not yet implement clustering yet, but make sure the storage layer will support it.
Currently DummySource generates item IDs that are being checked in tests. Sources generating those IDs conflict with autogenerated storage IDs, resulting in the overwriting of other feeds' items.
Remove item IDs from DummySource-generated feed items: all sources should add items with item.id=None. Make sure to update tests, maybe use item.guid for deterministic checks instead.
Currently when the application is started feeds are read from sources.txt only. The feeds are assigned sequential IDs, without any comparisons against previously assigned (and stored) IDs. If sources.txt is changed and feeds are moved back for whatever reason the previously stored contents will be messed up.
Change the runner startup: stored feeds and sources.txt should be loaded/parsed separately. Then:
Currently the (only) scheduler waits for update intervals to pass, then reads each source in sequence.
Change this to a multithreaded model: each source should be read in a separate thread, so slow sources don't keep everyone else waiting. Make sure the new approach handles a mixture of slow and fast sources gracefully, maybe even allow fast sources to update while a slow source is still processing a previous iteration.
At this point the application should have some feed/item content stored, but no way of presenting it to a news reader. Implement feed generation.
Think about which type of feed to implement, is there any point of using Atom over RSS?
Currently there are a few print() statements spread around the code, especially in complex areas.
Since there are automated tests now these statements are no longer required, so replace them with proper logging. Make sure log output can be suppressed in the test runner.
The documentation could be improved:
Currently there are a few hard-code settings, like the log verbosity.
Add a CLI arguments parser, add at least log verbosity. See what other useful options can be added.
Currently SQLiteStorage's methods are being called from multiple threads. Only the scheduler thread performs writes, but this will change in the future.
Make the storage handler thread-safe, but don't change it to asynchronous writes. This probably requires a locking mechanism.
Keep in mind there will be multiple types of write accesses, even full content overhauls (on cluster sync). See if there will be problems making reads wait too if there's an existing write lock in place.
Currently there's almost no documentation in the code.
Fix that: at least document all the public parts of the API.
Currently there's a dummy source creating deterministic but useless feed data.
Add support for reading RSS/Atom feeds. This source needs to be configurable from outside Python code, so add some kind of application configuration mechanism.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.