mediacloud / feed_seeker Goto Github PK

Find rss, atom, xml, and rdf feeds on webpages

License: MIT License

Python 100.00%

feed_seeker's Introduction

This is the source code for the Media Cloud core system. Media Cloud, a joint project of the Berkman Center for Internet & Society at Harvard University and the Center for Civic Media at MIT, is an open source, open data platform that allows researchers to answer complex quantitative and qualitative questions about the content of online media.

For more information on Media Cloud, go to mediacloud.org.

Note: Most users prefer to use Media Cloud's API and public tools to query our data instead of running their own Media Cloud instance.

The code in this repository will be of interest to those users who wish to run their own Media Cloud instance and users of the public tools who want to understand how Media Cloud is implemented.

The Media Cloud code here does three things:

Runs a web app that allows you to manage a set of media sources and their feeds.
Periodically crawls the feeds setup within the web app and downloads any new stories found within the downloaded feeds.
Extracts the substantive text from the downloaded story content (minus the ads, navigation, comments, etc.) and associates a set of tags with each story based on that extracted text.

For very brief installation instructions, see INSTALL.markdown.

Please send us a note at [email protected] if you are using any of this code or if you have any questions. We are very interested in knowing who's using the code and for what.

Build Status

History of the Project

Print newspapers are declaring bankruptcy nationwide. High-profile blogs are proliferating. Media companies are exploring new production techniques and business models in a landscape that is increasingly dominated by the Internet. In the midst of this upheaval, it is difficult to know what is actually happening to the shape of our news. Beyond one-off anecdotes or painstaking manual content analysis, there are few ways to examine the emerging news ecosystem.

The idea for Media Cloud emerged through a series discussions between faculty and friends of the Berkman Center. The conversations would follow a predictable pattern: one person would ask a provocative question about what was happening in the media landscape, someone else would suggest interesting follow-on inquiries, and everyone would realize that a good answer would require heavy number crunching. Nobody had the time to develop a huge infrastructure and download all the news just to answer a single question. However, there were eventually enough of these questions that we decided to build a tool for everyone to use.

Some of the early driving questions included:

Do bloggers introduce storylines into mainstream media or the other way around?
What parts of the world are being covered or ignored by different media sources?
Where do stories begin?
How are competing terms for the same event used in different publications?
Can we characterize the overall mix of coverage for a given source?
How do patterns differ between local and national news coverage?
Can we track news cycles for specific issues?
Do online comments shape the news?

Media Cloud offers a way to quantitatively examine all of these challenging questions by collecting and analyzing the news stream of tens of thousands of online sources.

Using Media Cloud, academic researchers, journalism critics, policy advocates, media scholars, and others can examine which media sources cover which stories, what language different media outlets use in conjunction with different stories, and how stories spread from one media outlet to another.

Collaborators

Past and present collaborators include Morningside Analytics, Betaworks, and Bit.ly.

License

Media Cloud is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Media Cloud is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with Media Cloud . If not, see <http://www.gnu.org/licenses/>.

feed_seeker's People

Stargazers

Watchers

Forkers

vishalbelsare pushshift carlosantagiustina web-work-tools wealthcreating robpotter89 worldie-com christinataft opa-labs fairhopeweb anjackson opme

feed_seeker's Issues

requests.exceptions.MissingSchema Error

When scanning www.nytimes.org (spider level 2), after discovery of http://takingnote.blogs.nytimes.com/feed, the script errors out with a missing schema error. It appears the script is making a request for a url without including the schema (http/https).

Traceback (most recent call last):
File "./scan_feeds.py", line 7, in
for url in generate_feed_urls('http://www.nytimes.com',spider=2):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 420, in generate_feed_urls
for feed in FeedSeeker(url, html).generate_feed_urls(spider=spider, max_links=max_links):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 200, in generate_feed_urls
for url, _ in self._generate_feed_urls(spider=spider, max_links=max_links):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 252, in _generate_feed_urls
for url, seen in spider_seeker._generate_feed_urls(**kwargs):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 252, in _generate_feed_urls
for url, seen in spider_seeker._generate_feed_urls(**kwargs):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 227, in _generate_feed_urls
if not self.html:
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 152, in html
self._html = self.fetcher(self.url)
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 104, in default_fetch_function
response = session.get(url)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 519, in request
prep = self.prepare_request(req)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 462, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 313, in prepare
self.prepare_url(url, params)
File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 387, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '//www.nytimes.com/2017/01/10/watching/faq.html': No schema supplied. Perhaps you meant http:////www.nytimes.com/2017/01/10/watching/faq.html

Add basic logging support

Add support for logging using the standard Python logging module.

Requirements txt error

Hi,

When I try to install the package, this is the error message I receive.

Collecting feed_seeker Using cached https://files.pythonhosted.org/packages/fa/5e/ec1666a581b15829bbf2d4f83013e47f8ad89d99de82f905c73f13befe15/feed_seeker-1.0.0.tar.gz Complete output from command python setup.py egg_info: Traceback (most recent call last): File "<string>", line 1, in <module> File "/private/var/folders/5w/s8hbw79j3vqc5cj4bg_gw7bh0000gn/T/pip-install-282Sdn/feed-seeker/setup.py", line 13, in <module> with open(requirements_file) as f: File "/Users/furkan/latestPY/lib/python2.7/codecs.py", line 896, in open file = __builtin__.open(filename, mode, buffering) IOError: [Errno 2] No such file or directory: '/private/var/folders/5w/s8hbw79j3vqc5cj4bg_gw7bh0000gn/T/pip-install-282Sdn/feed-seeker/requirements.txt'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/5w/s8hbw79j3vqc5cj4bg_gw7bh0000gn/T/pip-install-282Sdn/feed-seeker/

feed_seeker scans the same links repeatedly

The current implementation appears to have no memory of previously searched urls. This can cause large circular loops that never end and increases scan time and bandwidth.

evaluate and improve accuracy of feed discovery

This task is to develop a training and evaluation set that lets us validate the accuracy of the feed discovery and then to imporove the heuristics of the feed_seeker module to improve the accuracy.

The goal of feed discovery is to find the smallest possible set of feeds that return all of the stories from a given media source. So if a source has a single 'all stories' feed, we would ideally return just that feed. If a source has no 'all stories' feeds but has a bunch of feeds for independent sections, we should return all of those feeds. Getting a set of feeds that represents all stories for a source is the priority over returning the minimal number of feeds, but if there are shortcuts that return a single encompassing feed with high accuracy, we should use them.

To judge the accuracy of the feed discovery process, use this set of manually discovered feeds:

mediacloud/backend#333

Run the feed_seeker code to generate a set of feeds for each of the sources in the above set. Record how long it takes feed_seeker to discover the feeds for each media source.

Then do a best effort comparison of the manual vs. feed_seeker feeds for each source. For each source, indicate whether the feed_seeker feeds contain all, most, some, or none of the stories for the source. If there is no feed for the source, indicate 'no feed'. Separately indicate whether the feed_seeker results included feeds that return stories that do not belong to the source (for example of running feed_seeker on the nytimes.com returns a wapo rss feed). Generate precision and recall numbers for each of the above collections based on this evaluation.

Just use a best guess eyeball estimate to determine the all/most/some/none score for each source. We don't need to try to directly compare lists of individual stories. Just eyeball the set of stories in the manually discovered feeds vs. the stories in the feed_seeker return feeds and make your best estimate about the feed_seeker coverage.

After generating the initial accuracy metrics, try to improve feed_seeker to do a better job of discovering feeds. A couple of things that I suspect will improve the feed_seeker performance are:

try to guess using url semantics first instead of last (eg for nytimes.com guess nytimes.com/rss) and just use that single feed if it is parseable and is not empty (or maybe play with requiring some minimum number of stories to assume that it is a full feed).
use feedly.com to search for possible rss feeds. Last time I checked feedly provided this functionality in their free, un-authenticated api. And I commonly find in feedly rss feeds that are neither published anywhere I can see on a given site or found by the 'rss nytimes.com' google search. I assume that's because feedly is basically a crowd sourced discovery platform -- all it requires is one person to find the rss feed, and then it is in the sytem for everyone to find.

The weakness of this approach is that we will be overfitting the heuristics to the particular set of 50 feeds above. To get a true accuracy score, we'll need to repeat the evaluation process with a new set of randomly sampled sources. Let's just do the initial evaluation and improvements first and then consider whether we want to do another full evaluation.

Incorporate Feedly Support into feed_seeker

feedly.com offers a powerful API interface to discover RSS feeds based on query terms (site url, search term, etc.) This should be incorporated into feed_seeker to potentially increase the number of found RSS feeds for a specific source.

Info: https://developer.feedly.com/v3/search/

Example Search:

curl 'https://cloud.feedly.com/v3/search/feeds/?query=nytimes.com&count=500'

Note: This endpoint does not require an API key.

Errors out with "requests.exceptions.TooManyRedirects: Exceeded 30 redirects."

Code used:

from feed_seeker import find_feed_url, generate_feed_urls

fh = open("nytimes_test_spider2.csv","w")

for url in generate_feed_urls('https://www.nytimes.com',spider=2):
    fh.write(url + "\n")

Possible Fix:

https://github.com/mitmedialab/feed_seeker/blob/0644e61dd8413b5f6eda76284509b17d5f5b23c8/feed_seeker/feed_seeker.py#L111

Add:

TooManyRedirects

Error:

Traceback (most recent call last):
File "./scan_feeds.py", line 7, in
for url in generate_feed_urls('https://www.nytimes.com',spider=2):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 420, in generate_feed_urls
for feed in FeedSeeker(url, html).generate_feed_urls(spider=spider, max_links=max_links):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 200, in generate_feed_urls
for url, _ in self._generate_feed_urls(spider=spider, max_links=max_links):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 252, in _generate_feed_urls
for url, seen in spider_seeker._generate_feed_urls(**kwargs):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 252, in _generate_feed_urls
for url, seen in spider_seeker._generate_feed_urls(**kwargs):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 241, in _generate_feed_urls
if FeedSeeker(url).is_feed():
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 283, in is_feed
if any(self.soup.find(tag) for tag in invalid_tags):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 283, in
if any(self.soup.find(tag) for tag in invalid_tags):
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 159, in soup
self._soup = BeautifulSoup(self.html, 'lxml')
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 152, in html
self._html = self.fetcher(self.url)
File "/usr/local/lib/python3.6/dist-packages/feed_seeker/feed_seeker.py", line 104, in default_fetch_function
response = session.get(url)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 668, in send
history = [resp for resp in gen] if allow_redirects else []
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 668, in
history = [resp for resp in gen] if allow_redirects else []
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 165, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

fix timeout bug

the timeout parameter seems not to work. when run against the daily mail, the module ran for four hours without finishing. the timeout should be a hard stop on the length of the whole process.