Giter Site home page Giter Site logo

newsdiffs's People

Contributors

carlgieringer avatar cllns avatar ecprice avatar exlupi avatar freudenberg avatar gnprice avatar henare avatar iamvishnurajan avatar jenny8lee avatar mherdeg avatar mhuerster avatar payload avatar stevenmaude avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

newsdiffs's Issues

Articles SQLite table

I'm new to python & may be missing something, but I have followed the installation instructions and can't run the crawler - I just get the following error:

DatabaseError: no such table: Articles

Update for "2020"

An item I noticed for the NYT scraper (and probably the CNN one also, although I don't use it). The feeder_pat line needs to be updated for 2020.

As of now it reads:
feeder_pat = '^https?://www.nytimes.com/201'

... to catch articles post Jan 1, it needs to be updated to:
feeder_pat = '^https?://www.nytimes.com/202'

I've updated this on my fork, but I have at least one other update to the parser that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles post Jan 1.

Thanks much,
Vishnu

Clean Ubuntu install get errors...

2013-06-08 22:05:50.307:ERROR:CalledProcessError when updating http://edition.cnn.com/2013/06/06/travel/worlds-best-beach-bars/index.html
2013-06-08 22:05:50.313:ERROR:"fatal: Invalid object name 'HEAD'.\n"
2013-06-08 22:05:50.315:ERROR:Traceback (most recent call last):
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 355, in update_versions
update_article(article)
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 301, in update_article
article)
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 208, in add_to_git_repo
previous = run_git_command(['show', 'HEAD:'+filename])
File "/home/jdeer/website/frontend/management/commands/scraper.py", line 145, in run_git_command
stderr=subprocess.STDOUT)
File "/usr/lib/python2.7/subprocess.py", line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'show', u'HEAD:edition.cnn.com/2013/06/06/travel/worlds-best-beach-bars/index.html']' returned non-zero exit status 128

TypeError: sequence item 0: expected string or Unicode, NoneType found

Poking around to build my own parser, I followed the tips in README, and keep getting TypeErrors when I attempt to run the BBC parser on a story, and ditto for CNN. (Others seem fine, although Tagesschau returns no story URLs from test_parser.py tagesschau.TagesschauParser.)

For BBC, which is the one used in the README, I tried it with the URL from README and with a fresh URL fetched by test_parser.py bbc.BBCParser, same error either way:

ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/uk-21649494
Traceback (most recent call last):
  File "parsers/test_parser.py", line 29, in <module>
    print unicode(parsed_article)
  File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
    self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found
ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/technology-34044506
Traceback (most recent call last):
  File "parsers/test_parser.py", line 29, in <module>
    print unicode(parsed_article)
  File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
    self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found
ryantate@ryantate:~/dist/python/newsdiffs$ 

CNN:

ryantate@ryantate:~/dist/python/newsdiffs$ python parsers/test_parser.py cnn.CNNParser http://edition.cnn.com/2015/08/24/sport/vincenzo-nibali-tour-of-spain/index.html
Traceback (most recent call last):
  File "parsers/test_parser.py", line 29, in <module>
    print unicode(parsed_article)
  File "/home/ryantate/dist/python/newsdiffs/parsers/baseparser.py", line 138, in __unicode__
    self.body,)))
TypeError: sequence item 0: expected string or Unicode, NoneType found

Update for NYT / User Agent change

In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"

This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) and retry = False is what has been added. Of course, you can add whatever user agents you want to here.

def grab_url(url, max_depth=5, opener=None):
    if opener is None:
        cj = cookielib.CookieJar()
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        user_agent_list = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:83.0) Gecko/20100101 Firefox/83.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
        'Googlebot/2.1 (+http://www.google.com/bot.html)',
        'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)',
        'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
        'Mozilla/5.0 (X11; Linux x86_64)',
        ]
        for i in range(1,10):
            user_agent = random.choice(user_agent_list)
        opener.addheaders = [('User-Agent', user_agent)]
    retry = False
    try:
        text = opener.open(url, timeout=5).read()
        if '<title>NY Times Advertisement</title>' in text:
            retry = True
    except socket.timeout:
        retry = True
    if retry:
        if max_depth == 0:
            raise Exception('Too many attempts to download %s' % url)
        time.sleep(0.5)
        return grab_url(url, max_depth-1, opener)
    return text

I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.

Thanks,
Vishnu

Roadmap...?

What's the actual roadmap for newsdiffs?
Are we supposed to submit parsers, so you will host them all and act as "universal" newsdiff, or are we supposed to fork and self-host, in particular for non-English sources?
I think a sort of official roadmap should be added to (or linked from) the readme.

Heroku?

Has anyone successfully deployed Newsdiffs on Heroku?

NYT scraper no longer works

I noticed on/about May 8 2018, the NYT scraper no longer seems to work. I'm not so great with digesting HTML and Python, but it looks like the way NYT articles are encoded and how the different fields are tagged has changed. I will try to play with this and see if I can figure it out, but if anyone has any expertise here, any assistance would be much appreciated.

Breaks on Django 1.5

Perhaps the README should specify 1.4.

Or better yet, there could by a requirements.txt file for pip install -r requirements.txt

License?

I just found your program via netzpolitik.org, and I wondered which license you use.

Is your code available under a free license, like GPL, BSD or MIT?

As python-code that would make it a great base to build upon!

Add Canadian news

Canadian news sites tend to go trending easily because of the digital culture in Canada; it would be helpful to see revisions of Canadian news.

The CBC, Tyee, National Post, Globe and Mail, Toronto Star and Toronto Sun would be a good starting collection.

rate-limiting the scraper

I received an email from a local newspaper last night. They noticed the scraper (it only took them five months!), and they've asked me to rate-limit the process. I've put a time.sleep(0.3) at the end of baseparser.py's grab_url() , which should keep them happy.

I'm opening this issue so people know about the problem.

Track WikiLeaks changes

WikiLeaks deletes its tweets and makes small but significant changes to the text on its website not infrequently. These differences have news value[1, 2].

WikiLeaks (rightfully, in my opinion) presents itself as a highly ethical news organization building an objective historical record of what humanity does. In that light, Julian Assange has stated[3] his concern with news organizations changing the material on their websites without alerting readers and has also asked[4] US news outlets to make corrections about him on their websites more prominent.

The same attention should be afforded to WikiLeaks itself. Especially as WikiLeaks is a new media organization with global impact, tracking its changes would fit right into NewsDiff's mission.

[1] https://twitter.com/DouglasLucas/status/366712539612061697

[2] http://livewire.talkingpointsmemo.com/entry/wikileaks-spokesman-insists-snowden-statement-is-genuineh

[3] http://www.e-flux.com/journal/in-conversation-with-julian-assange-part-i/

[4] http://www.imediaethics.org/News/3130/_assange_not_charged__mcclatchy_corrects_record_in_4_newspapers__editors_not_informed.php

Change log pleading ignorance, but it knows.

I'm getting "Alas! We don't seem to know anything about this article. Sorry! :(" on pages for an article that definitely has recorded changes.

I'm not sure what information would be helpful for my own instance, but I have a local instance running, and at http://127.0.0.1:8000/article-history/www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php I'm seeing this:

Article Change Log
http: //www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php

Alas! We don't seem to know anything about this article. Sorry! :(

But, if I come in through browse: http://127.0.0.1:8000/diff/150/162/www.sfgate.com/bayarea/article/As-Napa-cheers-quake-recovery-some-question-6462726.php I definitely see changes that I can compare.

I see the same thing happening to a URL that is tracked in my instance and on newsdiffs.org: http://127.0.0.1:8000/article-history/www.washingtonpost.com/local/montgomery-county-housing-application-moves-online/2015/08/20/05d93a2a-468f-11e5-846d-02792f854297_story.html gives the "Alas!" error, even though if I come in through browse I see the (relatively minor) changes. I'd love any help figuring out how I can troubleshoot this.

ImportError: Could not import settings 'website.settings' (Is it on sys.path?): No module named website.settings

For testing purposes i just cloned and changed all necessary database related settings. After all getting an import error:

(django_env)sh@sh--work:~/Repos/newsdiffs$ python website/manage.py syncdb
Traceback (most recent call last):
  File "website/manage.py", line 17, in <module>
    execute_from_command_line(sys.argv)
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 453, in execute_from_command_line
    utility.execute()
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 263, in fetch_command
    app_name = get_commands()[subcommand]
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 109, in get_commands
    apps = settings.INSTALLED_APPS
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 52, in __getattr__
    self._setup(name)
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 47, in _setup
    self._wrapped = Settings(settings_module)
  File "/home/sh/django_env/local/lib/python2.7/site-packages/django/conf/__init__.py", line 132, in __init__
    raise ImportError("Could not import settings '%s' (Is it on sys.path?): %s" % (self.SETTINGS_MODULE, e))
ImportError: Could not import settings 'website.settings' (Is it on sys.path?): No module named website.settings

DatabaseError: table Articles has no column named git_dir

  • Fix DatabaseError
  • Fix scraper stalling

I’ve followed the guide (to my knowledge), but I get this error:

$ python website/manage.py scraper
DatabaseError: table Articles has no column named git_dir

Here is what syncdb returns:

$ python website/manage.py syncdb
Syncing...
Creating tables ...
Installing custom SQL ...
Installing indexes ...
Installed 0 object(s) from 0 fixture(s)

Synced:
 > django.contrib.contenttypes
 > django.contrib.sessions
 > django.contrib.sites
 > south

Not synced (use migrations):
 - frontend
$ python website/manage.py migrate frontend
Running migrations for frontend:
- Nothing to migrate.
 - Loading initial data for frontend.
Installed 0 object(s) from 0 fixture(s)

I’ve made sure $PYTHONPATH and $DJANGO_SETTINGS_MODULE now work, same error.

Any ideas?

politico scraper fails on "print.cfm"

    print_link = soup.findAll('a', href=re.compile('http://dyn.politico.com/printstory.cfm.*'))[0].get('href')
IndexError: list index out of range

indeed politico's pages don't have this kind of print_link anymore

frontend/views.py REJECTING valid urls

I had to comment out the continue in the RJECTING case inside get_articles() because somehow my urls were arriving inside get_articles() with a leading and trailing space. I'm using a custom parser but the site works ok for me otherwise. This custom parser works fine and nytimes works fine with or without my change.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.