ecprice / newsdiffs Goto Github PK
View Code? Open in Web Editor NEWAutomatic scraper that tracks changes in news articles over time.
License: Other
Automatic scraper that tracks changes in news articles over time.
License: Other
I'm trying to scrape various pages and some of them can't be accessed, it seems they're blocking non-browser requests. I've stumbled across this snippet, but I don't know how and where to put it:
https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden
Any pointers would be appreciated!
Best regards
Also is 1.1MB of HTML, suggesting the fix is to limit the amount of results returned?
I'm new to python & may be missing something, but I have followed the installation instructions and can't run the crawler - I just get the following error:
DatabaseError: no such table: Articles
An item I noticed for the NYT scraper (and probably the CNN one also, although I don't use it). The feeder_pat line needs to be updated for 2020.
As of now it reads:
feeder_pat = '^https?://www.nytimes.com/201'
... to catch articles post Jan 1, it needs to be updated to:
feeder_pat = '^https?://www.nytimes.com/202'
I've updated this on my fork, but I have at least one other update to the parser that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles post Jan 1.
Thanks much,
Vishnu
2013-06-08 22:05:50.307:ERROR:CalledProcessError when updating http://edition.cnn.com/2013/06/06/travel/worlds-best-beach-bars/index.html
2013-06-08 22:05:50.313:ERROR:"fatal: Invalid object name 'HEAD'.\n"
2013-06-08 22:05:50.315:ERROR:Traceback (most recent call last):
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 355, in update_versions
update_article(article)
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 301, in update_article
article)
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 208, in add_to_git_repo
previous = run_git_command(['show', 'HEAD:'+filename])
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 145, in run_git_command
stderr=subprocess.STDOUT)
File "/usr/lib/python2.7/subprocess.py", line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'show', u'HEAD:edition.cnn.com/2013/06/06/travel/worlds-best-beach-bars/index.html']' returned non-zero exit status 128
requirements.txt is missing south and html5lib
Poking around to build my own parser, I followed the tips in README, and keep getting TypeErrors when I attempt to run the BBC parser on a story, and ditto for CNN. (Others seem fine, although Tagesschau returns no story URLs from test_parser.py tagesschau.TagesschauParser.)
For BBC, which is the one used in the README, I tried it with the URL from README and with a fresh URL fetched by test_parser.py bbc.BBCParser, same error either way:
ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/uk-21649494
Traceback (most recent call last):
File "parsers/test_parser.py", line 29, in <module>
print unicode(parsed_article)
File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found
ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/technology-34044506
Traceback (most recent call last):
File "parsers/test_parser.py", line 29, in <module>
print unicode(parsed_article)
File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found
ryantate@ryantate:~/dist/python/newsdiffs$
CNN:
ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py cnn.CNNParser http://edition.cnn.com/2015/08/24/sport/vincenzo-nibali-tour-of-spain/index.html
Traceback (most recent call last):
File "parsers/test_parser.py", line 29, in <module>
print unicode(parsed_article)
File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found
In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"
This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
and retry = False
is what has been added. Of course, you can add whatever user agents you want to here.
def grab_url(url, max_depth=5, opener=None):
if opener is None:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
user_agent_list = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:83.0) Gecko/20100101 Firefox/83.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
'Googlebot/2.1 (+http://www.google.com/bot.html)',
'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)',
'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (X11; Linux x86_64)',
]
for i in range(1,10):
user_agent = random.choice(user_agent_list)
opener.addheaders = [('User-Agent', user_agent)]
retry = False
try:
text = opener.open(url, timeout=5).read()
if '<title>NY Times Advertisement</title>' in text:
retry = True
except socket.timeout:
retry = True
if retry:
if max_depth == 0:
raise Exception('Too many attempts to download %s' % url)
time.sleep(0.5)
return grab_url(url, max_depth-1, opener)
return text
I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.
Thanks,
Vishnu
What's the actual roadmap for newsdiffs?
Are we supposed to submit parsers, so you will host them all and act as "universal" newsdiff, or are we supposed to fork and self-host, in particular for non-English sources?
I think a sort of official roadmap should be added to (or linked from) the readme.
Appreciate this project is pretty old but hoping someone may be able to help!
I am running python ./website/manage.py scraper
as outlined in the Readme however it is throwing an SSL error:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)>
Has anyone successfully deployed Newsdiffs on Heroku?
I noticed on/about May 8 2018, the NYT scraper no longer seems to work. I'm not so great with digesting HTML and Python, but it looks like the way NYT articles are encoded and how the different fields are tagged has changed. I will try to play with this and see if I can figure it out, but if anyone has any expertise here, any assistance would be much appreciated.
Perhaps the README should specify 1.4.
Or better yet, there could by a requirements.txt file for pip install -r requirements.txt
README.md forgets to specify the conditions used to determine if a paywall site with a restrictive robots.txt in place will be accessed by this 'bot'.
I just found your program via netzpolitik.org, and I wondered which license you use.
Is your code available under a free license, like GPL, BSD or MIT?
As python-code that would make it a great base to build upon!
Canadian news sites tend to go trending easily because of the digital culture in Canada; it would be helpful to see revisions of Canadian news.
The CBC, Tyee, National Post, Globe and Mail, Toronto Star and Toronto Sun would be a good starting collection.
It would be great to query articles by time via Memento-Datetime HTTP header. See Memento into and SiteStory for an implementation.
I received an email from a local newspaper last night. They noticed the scraper (it only took them five months!), and they've asked me to rate-limit the process. I've put a time.sleep(0.3)
at the end of baseparser.py's grab_url()
, which should keep them happy.
I'm opening this issue so people know about the problem.
WikiLeaks deletes its tweets and makes small but significant changes to the text on its website not infrequently. These differences have news value[1, 2].
WikiLeaks (rightfully, in my opinion) presents itself as a highly ethical news organization building an objective historical record of what humanity does. In that light, Julian Assange has stated[3] his concern with news organizations changing the material on their websites without alerting readers and has also asked[4] US news outlets to make corrections about him on their websites more prominent.
The same attention should be afforded to WikiLeaks itself. Especially as WikiLeaks is a new media organization with global impact, tracking its changes would fit right into NewsDiff's mission.
[1] https://twitter.com/DouglasLucas/status/366712539612061697
[3] http://www.e-flux.com/journal/in-conversation-with-julian-assange-part-i/
Saw a tweet about this article
http://mobile.nytimes.com/2015/08/14/us/politics/joe-biden-on-beach-vacation-wades-further-into-16-bid.html?referrer=&_r=0 but Newsdiffs doesn't know anything about it. Because I'm clever, I tried this instead: http://www.nytimes.com/2015/08/14/us/politics/joe-biden-on-beach-vacation-wades-further-into-16-bid.html, which worked.
But .. the system should be smarter about suggesting similar URLs, no?
I'm getting "Alas! We don't seem to know anything about this article. Sorry! :(" on pages for an article that definitely has recorded changes.
I'm not sure what information would be helpful for my own instance, but I have a local instance running, and at http://127.0.0.1:8000/article-history/www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php I'm seeing this:
Article Change Log
http: //www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php
Alas! We don't seem to know anything about this article. Sorry! :(
But, if I come in through browse: http://127.0.0.1:8000/diff/150/162/www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php I definitely see changes that I can compare.
I see the same thing happening to a URL that is tracked in my instance and on newsdiffs.org
: http://127.0.0.1:8000/article-history/www.washingtonpost.com/local/montgomery-county-housing-application-moves-online/2015/08/20/05d93a2a-468f-11e5-846d-02792f854297_story.html gives the "Alas!" error, even though if I come in through browse I see the (relatively minor) changes. I'd love any help figuring out how I can troubleshoot this.
ModuleNotFoundError: No module named 'website'
Windows OS
For testing purposes i just cloned and changed all necessary database related settings. After all getting an import error:
(django_env)sh@sh--work:~/Repos/newsdiffs$ python website/manage.py syncdb
Traceback (most recent call last):
File "website/manage.py", line 17, in <module>
execute_from_command_line(sys.argv)
File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 453, in execute_from_command_line
utility.execute()
File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 263, in fetch_command
app_name = get_commands()[subcommand]
File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 109, in get_commands
apps = settings.INSTALLED_APPS
File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 52, in __getattr__
self._setup(name)
File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 47, in _setup
self._wrapped = Settings(settings_module)
File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 132, in __init__
raise ImportError("Could not import settings '%s' (Is it on sys.path?): %s" % (self.SETTINGS_MODULE, e))
ImportError: Could not import settings 'website.settings' (Is it on sys.path?): No module named website.settings
I’ve followed the guide (to my knowledge), but I get this error:
$ python website/manage.py scraper
DatabaseError: table Articles has no column named git_dir
Here is what syncdb
returns:
$ python website/manage.py syncdb
Syncing...
Creating tables ...
Installing custom SQL ...
Installing indexes ...
Installed 0 object(s) from 0 fixture(s)
Synced:
> django.contrib.contenttypes
> django.contrib.sessions
> django.contrib.sites
> south
Not synced (use migrations):
- frontend
$ python website/manage.py migrate frontend
Running migrations for frontend:
- Nothing to migrate.
- Loading initial data for frontend.
Installed 0 object(s) from 0 fixture(s)
I’ve made sure $PYTHONPATH
and $DJANGO_SETTINGS_MODULE
now work, same error.
Any ideas?
print_link = soup.findAll('a', href=re.compile('http://dyn.politico.com/printstory.cfm.*'))[0].get('href')
IndexError: list index out of range
indeed politico's pages don't have this kind of print_link anymore
I had to comment out the continue in the RJECTING case inside get_articles() because somehow my urls were arriving inside get_articles() with a leading and trailing space. I'm using a custom parser but the site works ok for me otherwise. This custom parser works fine and nytimes works fine with or without my change.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.