rinusser / feedtrough Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 72 KB

a feed cache/aggregator with basic clustering support

License: GNU General Public License v3.0

Python 100.00%

python rss rss-downloader rss-generator

feedtrough's Introduction

Synopsis

A caching RSS/Atom proxy server with basic clustering support written in Python. Other data sources can be implemented.

The sources are hosted on GitHub.

This is a work in progress, upcoming changes are outlined in the repository's issues.

General

RSS/Atom feeds frequently don't offer items going back long enough. If you check a feed e.g. once a week but the feed only lists items from the last 24 hours then you're going to miss items. There are other aggregation services available, but they either raise privacy concerns or don't provide RSS/Atom feeds to news reader clients.

It's straightforward to implement other data sources like web scrapers, local monitoring etc.

Requirements

Python 3.5+ (tested with Python 3.5, 3.6 and 3.7)
feedparser (tested with 5.2.1)
PyRSS2Gen (tested with 1.1)

Installation

Just download the sources and make sure Python, feedparser (pip install feedparser) and PyRSS2Gen (pip install pyrss2gen) are installed.

Usage

Run run.py. The command-line interface supports a few arguments: running run.py -h will show the help screen.

Configuration

The list of feeds is read from sources.txt. There is an example configuration, along with documentation, in sources.txt.example.

Tests

This application includes a test suite, you can run it with:

run-tests.py

Tests are written with Python's built-in unittest package, there currently are no other dependencies.

Legal

Copyright

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

feedtrough's People

Contributors

Stargazers

Watchers

feedtrough's Issues

runner: read feeds from storage

Currently when the application is started feeds are read from sources.txt only. The feeds are assigned sequential IDs, without any comparisons against previously assigned (and stored) IDs. If sources.txt is changed and feeds are moved back for whatever reason the previously stored contents will be messed up.

Change the runner startup: stored feeds and sources.txt should be loaded/parsed separately. Then:

for each stored feed, only add it to the list of active feeds if feed type+URL are in sources.txt
for each feed from sources.txt: if type+URL not in stored feeds, add to active list

add logging facility

Currently there are a few print() statements spread around the code, especially in complex areas.

Since there are automated tests now these statements are no longer required, so replace them with proper logging. Make sure log output can be suppressed in the test runner.

add code documentation

Currently there's almost no documentation in the code.

Fix that: at least document all the public parts of the API.

add multithreaded source reading

Currently the (only) scheduler waits for update intervals to pass, then reads each source in sequence.

Change this to a multithreaded model: each source should be read in a separate thread, so slow sources don't keep everyone else waiting. Make sure the new approach handles a mixture of slow and fast sources gracefully, maybe even allow fast sources to update while a slow source is still processing a previous iteration.

add cluster support

At this point the application should work standalone. Add clustering support:

a new cluster scheduler should coordinate nodes, only one node needs to be active at any time
the storage layer should either replicate itself across the cluster, or the scheduler should facilitate node storage sync'ing
cluster should expect individual nodes not to be available most of the time
data should be synchronized on node reconnect - worst case should be supported: merging multiple nodes each having new data

add CLI arguments

Currently there are a few hard-code settings, like the log verbosity.

Add a CLI arguments parser, add at least log verbosity. See what other useful options can be added.

add permanent local storage

The basic framework should have in-memory storage at this point. Research local storage options, then implement one that supports clustering without too much admin overhead.

Do not yet implement clustering yet, but make sure the storage layer will support it.

add feed presentation

At this point the application should have some feed/item content stored, but no way of presenting it to a news reader. Implement feed generation.

Think about which type of feed to implement, is there any point of using Atom over RSS?

add example web scraper source

Currently there is a Source implementation for RSS/Atom feeds and another one generating deterministic data for tests.

Add an example source that parses content from a public website, to show that's possible.

Make sure scraping doesn't violate the website's terms of usage.

add RSS/Atom feed source

Currently there's a dummy source creating deterministic but useless feed data.

Add support for reading RSS/Atom feeds. This source needs to be configurable from outside Python code, so add some kind of application configuration mechanism.

add/use container class constructors

Currently there are a few container classes (e.g. domain/*) with public fields that need to be set after object instantiation.

Add constructors with optional parameters wherever makes sense, then update the rest of the code to use them.

A first example is already live (albeit not yet in widespread use) in domain/Feed.py.

make SQLiteStorage thread-safe

Currently SQLiteStorage's methods are being called from multiple threads. Only the scheduler thread performs writes, but this will change in the future.

Make the storage handler thread-safe, but don't change it to asynchronous writes. This probably requires a locking mechanism.

Keep in mind there will be multiple types of write accesses, even full content overhauls (on cluster sync). See if there will be problems making reads wait too if there's an existing write lock in place.

create basic aggregation framework

Create the basic application structure:

implement data types for feeds, feed items etc.
implement service structure: interfaces for storage, scheduler, source etc. and basic implementations (memory storage, standalone scheduler and feed source)

Do NOT yet:

implement clustering
add HTTP support (maybe add some kind of temporary output instead)

add management interface

Currently feeds are configured in sources.txt, with no way of managing them once the application is running.

Add a REST management API and a CLI management interface.

Requires #11 to be done first.

improve documentation

The documentation could be improved:

add build information (Sphinx dependency, Makefile explanations etc.)
set up gh-pages to host the generated HTML
add copyright/license headers to source files

remove DummySource's item IDs

Currently DummySource generates item IDs that are being checked in tests. Sources generating those IDs conflict with autogenerated storage IDs, resulting in the overwriting of other feeds' items.

Remove item IDs from DummySource-generated feed items: all sources should add items with item.id=None. Make sure to update tests, maybe use item.guid for deterministic checks instead.

add unit/integration test framework

At this point there should be an application with testable parts. Add support for automated tests.

Make sure there's a way to test clustering on a single machine, including the hard shutdown of nodes during inopportune times.