rinusser / feedtrough Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 72 KB

a feed cache/aggregator with basic clustering support

License: GNU General Public License v3.0

Python 100.00%

python rss rss-downloader rss-generator

feedtrough's Issues

add example web scraper source

Currently there is a Source implementation for RSS/Atom feeds and another one generating deterministic data for tests.

Add an example source that parses content from a public website, to show that's possible.

Make sure scraping doesn't violate the website's terms of usage.

add management interface

Currently feeds are configured in sources.txt, with no way of managing them once the application is running.

Add a REST management API and a CLI management interface.

Requires #11 to be done first.

add unit/integration test framework

At this point there should be an application with testable parts. Add support for automated tests.

Make sure there's a way to test clustering on a single machine, including the hard shutdown of nodes during inopportune times.

create basic aggregation framework

Create the basic application structure:

implement data types for feeds, feed items etc.
implement service structure: interfaces for storage, scheduler, source etc. and basic implementations (memory storage, standalone scheduler and feed source)

Do NOT yet:

implement clustering
add HTTP support (maybe add some kind of temporary output instead)

add/use container class constructors

Currently there are a few container classes (e.g. domain/*) with public fields that need to be set after object instantiation.

Add constructors with optional parameters wherever makes sense, then update the rest of the code to use them.

A first example is already live (albeit not yet in widespread use) in domain/Feed.py.

add cluster support

At this point the application should work standalone. Add clustering support:

a new cluster scheduler should coordinate nodes, only one node needs to be active at any time
the storage layer should either replicate itself across the cluster, or the scheduler should facilitate node storage sync'ing
cluster should expect individual nodes not to be available most of the time
data should be synchronized on node reconnect - worst case should be supported: merging multiple nodes each having new data

add permanent local storage

The basic framework should have in-memory storage at this point. Research local storage options, then implement one that supports clustering without too much admin overhead.

Do not yet implement clustering yet, but make sure the storage layer will support it.

remove DummySource's item IDs

Currently DummySource generates item IDs that are being checked in tests. Sources generating those IDs conflict with autogenerated storage IDs, resulting in the overwriting of other feeds' items.

Remove item IDs from DummySource-generated feed items: all sources should add items with item.id=None. Make sure to update tests, maybe use item.guid for deterministic checks instead.

runner: read feeds from storage

Currently when the application is started feeds are read from sources.txt only. The feeds are assigned sequential IDs, without any comparisons against previously assigned (and stored) IDs. If sources.txt is changed and feeds are moved back for whatever reason the previously stored contents will be messed up.

Change the runner startup: stored feeds and sources.txt should be loaded/parsed separately. Then:

for each stored feed, only add it to the list of active feeds if feed type+URL are in sources.txt
for each feed from sources.txt: if type+URL not in stored feeds, add to active list

add multithreaded source reading

Currently the (only) scheduler waits for update intervals to pass, then reads each source in sequence.

Change this to a multithreaded model: each source should be read in a separate thread, so slow sources don't keep everyone else waiting. Make sure the new approach handles a mixture of slow and fast sources gracefully, maybe even allow fast sources to update while a slow source is still processing a previous iteration.

add feed presentation

At this point the application should have some feed/item content stored, but no way of presenting it to a news reader. Implement feed generation.

Think about which type of feed to implement, is there any point of using Atom over RSS?

add logging facility

Currently there are a few print() statements spread around the code, especially in complex areas.

Since there are automated tests now these statements are no longer required, so replace them with proper logging. Make sure log output can be suppressed in the test runner.

improve documentation

The documentation could be improved:

add build information (Sphinx dependency, Makefile explanations etc.)
set up gh-pages to host the generated HTML
add copyright/license headers to source files

add CLI arguments

Currently there are a few hard-code settings, like the log verbosity.

Add a CLI arguments parser, add at least log verbosity. See what other useful options can be added.

make SQLiteStorage thread-safe

Currently SQLiteStorage's methods are being called from multiple threads. Only the scheduler thread performs writes, but this will change in the future.

Make the storage handler thread-safe, but don't change it to asynchronous writes. This probably requires a locking mechanism.

Keep in mind there will be multiple types of write accesses, even full content overhauls (on cluster sync). See if there will be problems making reads wait too if there's an existing write lock in place.

add code documentation

Currently there's almost no documentation in the code.

Fix that: at least document all the public parts of the API.

add RSS/Atom feed source

Currently there's a dummy source creating deterministic but useless feed data.

Add support for reading RSS/Atom feeds. This source needs to be configurable from outside Python code, so add some kind of application configuration mechanism.

rinusser / feedtrough Goto Github PK

feedtrough's Issues

Recommend Projects

Recommend Topics

Recommend Org