ecprice / newsdiffs Goto Github PK

View Code? Open in Web Editor NEW

491.0 491.0 138.0 2.01 MB

Automatic scraper that tracks changes in news articles over time.

License: Other

Python 90.91% CSS 0.10% JavaScript 6.94% HTML 2.05%

newsdiffs's People

Contributors

Stargazers

Watchers

Forkers

henare minger jpyuda natematias freudenberg payload arnebab djon3s scollazo rafamerino wilson428 caseyg yuvadm flupzor jeremyjbowers bsravanin arnorb toyg mherdeg aburan28 eamonnbell thomasjoulin newsdiffs-fr ludovicdesuez plympton direnkod tupilabs mgersh chrismissal shahzadmajeed imclab sapir gamblegear ndarville sabraham igorbrigadir neka sethwoodworth mhuerster ictdcapilot kenyiu cllns hawkinsunlimited newsdiffsde macariojames davehoo cwage colaborati bjowi amandabee mgarciaisaia coxjonc eric-neg shirish93 k0rd tangdru wizardishungry exlupi rfalmeida numeroteca nitinalabur martenson ehawaii danilopopeye rmdes ericschles digital44 awong-dev netvisao maxbittker mcverter danrarr shihyuhang devlabmexico newsdiffs reb1995 swpflow hiperderecho carlgieringer traviata69 m-schiller yucheng82 steffensande vijay172 xlinkbd pavelzin rishirt shootingcharlie8 masterfast bentorkington serefercelik cjwinchester 0xrin1 jai2033shankar moorsalin zmudge3 ndb88 tgroshon baslerbebbi der-ofenmeister

newsdiffs's Issues

BBC scraper no longer works

Adding header information in http requests to avoid 403 errors

I'm trying to scrape various pages and some of them can't be accessed, it seems they're blocking non-browser requests. I've stumbled across this snippet, but I don't know how and where to put it:
https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden

Any pointers would be appreciated!

Best regards

/browse takes 20-30sec to respond

Also is 1.1MB of HTML, suggesting the fix is to limit the amount of results returned?

Add foxnews.com

Articles SQLite table

I'm new to python & may be missing something, but I have followed the installation instructions and can't run the crawler - I just get the following error:

DatabaseError: no such table: Articles

Update for "2020"

An item I noticed for the NYT scraper (and probably the CNN one also, although I don't use it). The feeder_pat line needs to be updated for 2020.

As of now it reads:
feeder_pat = '^https?://www.nytimes.com/201'

... to catch articles post Jan 1, it needs to be updated to:
feeder_pat = '^https?://www.nytimes.com/202'

I've updated this on my fork, but I have at least one other update to the parser that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles post Jan 1.

Thanks much,
Vishnu

Clean Ubuntu install get errors...

2013-06-08 22:05:50.307:ERROR:CalledProcessError when updating http://edition.cnn.com/2013/06/06/travel/worlds-best-beach-bars/index.html
2013-06-08 22:05:50.313:ERROR:"fatal: Invalid object name 'HEAD'.\n"
2013-06-08 22:05:50.315:ERROR:Traceback (most recent call last):
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 355, in update_versions
update_article(article)
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 301, in update_article
article)
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 208, in add_to_git_repo
previous = run_git_command(['show', 'HEAD:'+filename])
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 145, in run_git_command
stderr=subprocess.STDOUT)
File "/usr/lib/python2.7/subprocess.py", line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'show', u'HEAD:edition.cnn.com/2013/06/06/travel/worlds-best-beach-bars/index.html']' returned non-zero exit status 128

incomplete requirements.txt

requirements.txt is missing south and html5lib

TypeError: sequence item 0: expected string or Unicode, NoneType found

Poking around to build my own parser, I followed the tips in README, and keep getting TypeErrors when I attempt to run the BBC parser on a story, and ditto for CNN. (Others seem fine, although Tagesschau returns no story URLs from test_parser.py tagesschau.TagesschauParser.)

For BBC, which is the one used in the README, I tried it with the URL from README and with a fresh URL fetched by test_parser.py bbc.BBCParser, same error either way:

ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/uk-21649494
Traceback (most recent call last):
  File "parsers/test_parser.py", line 29, in <module>
    print unicode(parsed_article)
  File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
    self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found
ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/technology-34044506
Traceback (most recent call last):
  File "parsers/test_parser.py", line 29, in <module>
    print unicode(parsed_article)
  File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
    self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found
ryantate@ryantate:~/dist/python/newsdiffs$

CNN:

ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py cnn.CNNParser http://edition.cnn.com/2015/08/24/sport/vincenzo-nibali-tour-of-spain/index.html
Traceback (most recent call last):
  File "parsers/test_parser.py", line 29, in <module>
    print unicode(parsed_article)
  File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
    self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found

Update for NYT / User Agent change

In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"

This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) and retry = False is what has been added. Of course, you can add whatever user agents you want to here.

def grab_url(url, max_depth=5, opener=None):
    if opener is None:
        cj = cookielib.CookieJar()
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        user_agent_list = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:83.0) Gecko/20100101 Firefox/83.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
        'Googlebot/2.1 (+http://www.google.com/bot.html)',
        'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)',
        'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
        'Mozilla/5.0 (X11; Linux x86_64)',
        ]
        for i in range(1,10):
            user_agent = random.choice(user_agent_list)
        opener.addheaders = [('User-Agent', user_agent)]
    retry = False
    try:
        text = opener.open(url, timeout=5).read()
        if '<title>NY Times Advertisement</title>' in text:
            retry = True
    except socket.timeout:
        retry = True
    if retry:
        if max_depth == 0:
            raise Exception('Too many attempts to download %s' % url)
        time.sleep(0.5)
        return grab_url(url, max_depth-1, opener)
    return text

I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.

Thanks,
Vishnu

Roadmap...?

What's the actual roadmap for newsdiffs?
Are we supposed to submit parsers, so you will host them all and act as "universal" newsdiff, or are we supposed to fork and self-host, in particular for non-English sources?
I think a sort of official roadmap should be added to (or linked from) the readme.

Add theguardian.com

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)>

Appreciate this project is pretty old but hoping someone may be able to help!

I am running python ./website/manage.py scraper as outlined in the Readme however it is throwing an SSL error:

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)>

Heroku?

Has anyone successfully deployed Newsdiffs on Heroku?

NYT scraper no longer works

I noticed on/about May 8 2018, the NYT scraper no longer seems to work. I'm not so great with digesting HTML and Python, but it looks like the way NYT articles are encoded and how the different fields are tagged has changed. I will try to play with this and see if I can figure it out, but if anyone has any expertise here, any assistance would be much appreciated.

Breaks on Django 1.5

Perhaps the README should specify 1.4.

Or better yet, there could by a requirements.txt file for pip install -r requirements.txt

Docs fail to mention check on robots.txt

README.md forgets to specify the conditions used to determine if a paywall site with a restrictive robots.txt in place will be accessed by this 'bot'.

License?

I just found your program via netzpolitik.org, and I wondered which license you use.

Is your code available under a free license, like GPL, BSD or MIT?

As python-code that would make it a great base to build upon!

Add Canadian news

Canadian news sites tend to go trending easily because of the digital culture in Canada; it would be helpful to see revisions of Canadian news.

The CBC, Tyee, National Post, Globe and Mail, Toronto Star and Toronto Sun would be a good starting collection.

Support Memento-Datetime

It would be great to query articles by time via Memento-Datetime HTTP header. See Memento into and SiteStory for an implementation.

rate-limiting the scraper

I received an email from a local newspaper last night. They noticed the scraper (it only took them five months!), and they've asked me to rate-limit the process. I've put a time.sleep(0.3) at the end of baseparser.py's grab_url() , which should keep them happy.

I'm opening this issue so people know about the problem.

Track WikiLeaks changes

WikiLeaks deletes its tweets and makes small but significant changes to the text on its website not infrequently. These differences have news value[1, 2].

WikiLeaks (rightfully, in my opinion) presents itself as a highly ethical news organization building an objective historical record of what humanity does. In that light, Julian Assange has stated[3] his concern with news organizations changing the material on their websites without alerting readers and has also asked[4] US news outlets to make corrections about him on their websites more prominent.

The same attention should be afforded to WikiLeaks itself. Especially as WikiLeaks is a new media organization with global impact, tracking its changes would fit right into NewsDiff's mission.

[1] https://twitter.com/DouglasLucas/status/366712539612061697

[2] http://livewire.talkingpointsmemo.com/entry/wikileaks-spokesman-insists-snowden-statement-is-genuineh

[3] http://www.e-flux.com/journal/in-conversation-with-julian-assange-part-i/

[4] http://www.imediaethics.org/News/3130/_assange_not_charged__mcclatchy_corrects_record_in_4_newspapers__editors_not_informed.php

disambuguate URLs

Saw a tweet about this article
http://mobile.nytimes.com/2015/08/14/us/politics/joe-biden-on-beach-vacation-wades-further-into-16-bid.html?referrer=&_r=0 but Newsdiffs doesn't know anything about it. Because I'm clever, I tried this instead: http://www.nytimes.com/2015/08/14/us/politics/joe-biden-on-beach-vacation-wades-further-into-16-bid.html, which worked.

But .. the system should be smarter about suggesting similar URLs, no?

Change log pleading ignorance, but it knows.

I'm getting "Alas! We don't seem to know anything about this article. Sorry! :(" on pages for an article that definitely has recorded changes.

I'm not sure what information would be helpful for my own instance, but I have a local instance running, and at http://127.0.0.1:8000/article-history/www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php I'm seeing this:

Article Change Log
http: //www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php

Alas! We don't seem to know anything about this article. Sorry! :(

But, if I come in through browse: http://127.0.0.1:8000/diff/150/162/www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php I definitely see changes that I can compare.

I see the same thing happening to a URL that is tracked in my instance and on newsdiffs.org: http://127.0.0.1:8000/article-history/www.washingtonpost.com/local/montgomery-county-housing-application-moves-online/2015/08/20/05d93a2a-468f-11e5-846d-02792f854297_story.html gives the "Alas!" error, even though if I come in through browse I see the (relatively minor) changes. I'd love any help figuring out how I can troubleshoot this.

ModuleNotFoundError: No module named 'website'

ModuleNotFoundError: No module named 'website'
Windows OS

ImportError: Could not import settings 'website.settings' (Is it on sys.path?): No module named website.settings

For testing purposes i just cloned and changed all necessary database related settings. After all getting an import error:

(django_env)sh@sh--work:~/Repos/newsdiffs$ python website/manage.py syncdb
Traceback (most recent call last):
  File "website/manage.py", line 17, in <module>
    execute_from_command_line(sys.argv)
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 453, in execute_from_command_line
    utility.execute()
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 263, in fetch_command
    app_name = get_commands()[subcommand]
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 109, in get_commands
    apps = settings.INSTALLED_APPS
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 52, in __getattr__
    self._setup(name)
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 47, in _setup
    self._wrapped = Settings(settings_module)
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 132, in __init__
    raise ImportError("Could not import settings '%s' (Is it on sys.path?): %s" % (self.SETTINGS_MODULE, e))
ImportError: Could not import settings 'website.settings' (Is it on sys.path?): No module named website.settings

DatabaseError: table Articles has no column named git_dir

Fix DatabaseError
Fix scraper stalling

I’ve followed the guide (to my knowledge), but I get this error:

$ python website/manage.py scraper
DatabaseError: table Articles has no column named git_dir

Here is what syncdb returns:

$ python website/manage.py syncdb
Syncing...
Creating tables ...
Installing custom SQL ...
Installing indexes ...
Installed 0 object(s) from 0 fixture(s)

Synced:
 > django.contrib.contenttypes
 > django.contrib.sessions
 > django.contrib.sites
 > south

Not synced (use migrations):
 - frontend

$ python website/manage.py migrate frontend
Running migrations for frontend:
- Nothing to migrate.
 - Loading initial data for frontend.
Installed 0 object(s) from 0 fixture(s)

I’ve made sure $PYTHONPATH and $DJANGO_SETTINGS_MODULE now work, same error.

Any ideas?

newsdiffs.org isn't served over HTTPS

politico scraper fails on "print.cfm"

    print_link = soup.findAll('a', href=re.compile('http://dyn.politico.com/printstory.cfm.*'))[0].get('href')
IndexError: list index out of range

indeed politico's pages don't have this kind of print_link anymore

frontend/views.py REJECTING valid urls

I had to comment out the continue in the RJECTING case inside get_articles() because somehow my urls were arriving inside get_articles() with a leading and trailing space. I'm using a custom parser but the site works ok for me otherwise. This custom parser works fine and nytimes works fine with or without my change.