Giter Site home page Giter Site logo

news-please's Introduction

news-please

PyPI version

news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website. news-please combines the power of multiple state-of-the-art libraries and tools, such as scrapy, Newspaper, and readability. news-please also features a library mode, which allows developers to use the crawling and extraction functionality within their own program.

Extracted information

  • headline
  • lead paragraph
  • main content (textual)
  • main image
  • author's name
  • publication date
  • language

Features

  • works out of the box: install with pip, add URLs of your pages, run :-)
  • execute it conveniently with the CLI or use it as a library within your own software
  • runs on your favorite Python version (2.7+ and 3+)

CLI mode

  • stores extracted results in JSON files or ElasticSearch (other storages can be added easily)
  • simple but extensive configuration (if you want to tweak the results)
  • revisions: crawl articles multiple times and track changes

Library mode

  • crawl and extract information for a list of article URLs.

Getting started

It's super easy, we promise!

Installation

$ pip install news-please

Use within your own code (as a library)

If you want to crawl articles continuously or use the full website extraction, you cannot use library mode but need to use the CLI mode.

from newsplease import NewsPlease
article = NewsPlease.from_url('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html?hp')
print(article.title)

A sample of an extracted article can be found here (as a JSON file).

or if you want to crawl multiple articles at a time

NewsPlease.from_urls([url1, url2, ...])

or if you have a file containing all URLs (each line containing a single URL)

NewsPlease.from_file(path)

or if you have a WARC file

NewsPlease.from_warc(warc_record)

In library mode, news-please will attempt to download and extract information from each URL. The previously described functions are blocking, i.e. will return once all URLs have been attempted. The resulting list contains all articles that have been extracted successfully.

Run the crawler (via the CLI)

$ news-please

news-please will then start crawling a few examples pages. To terminate the process simply press CTRL+C. news-please will then shutdown within 5-60 seconds. You can also press CTRL+C twice, which will immediately kill the process (not recommended, though).

The results are stored by default in JSON files in the data folder. In the default configuration, news-please also stores the original HTML files.

Crawl other pages

Of course, you want to crawl other websites. Simply go into the sitelist.hjson file and add the root URLs of the news outlets' webpages of your choice.

ElasticSearch

news-please also supports export to ElasticSearch. Using Elasticsearch will also enable the versioning feature. First, enable it in the config.cfg at the config directory, which is by default ~/news-please/config but can be changed also with the -c parameter to a custom location. In case the directory does not exist, a default directory will be created at the specified location.

[Scrapy]

ITEM_PIPELINES = {
                   'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
                   'newsplease.pipeline.pipelines.ElasticSearchStorage':350
                 }

That's it! Except, if your Elasticsearch database is not located at http://localhost:9200, uses a different username / password or CA-certificate authentication. In these cases, you will also need to change the following.

[Elasticsearch]

host = localhost
port = 9200	

...

# Credentials used  for authentication (supports CA-certificates):

use_ca_certificates = False'           #If True authentification is performed 
ca_cert_path = '/path/to/cacert.pem'  
client_cert_path = '/path/to/client_cert.pem'  
client_key_path = '/path/to/client_key.pem'  
username = 'root'  
secret = 'password' 

What's next?

We have collected a bunch of useful information for both users and developers. As a user, you will most likely only deal with two files: sitelist.hjson (to define sites to be crawled) and config.cfg (probably only rarely, in case you want to tweak the configuration).

Wiki and documentation

You can find more information on usage and development in our wiki!

Acknowledgements

This project would not have been possible without the contributions of the following students (ordered alphabetically):

  • Moritz Bock
  • Michael Fried
  • Jonathan Hassler
  • Markus Klatt
  • Kevin Kress
  • Sören Lachnit
  • Marvin Pafla
  • Franziska Schlor
  • Matt Sharinghousen
  • Claudio Spener
  • Moritz Steinmaier

How to cite

If you are using news-please, please cite our paper (ResearchGate):

@InProceedings{Hamborg2017,
  author     = {{H}amborg, {F}elix and {M}euschke, {N}orman and {B}reitinger, {C}orinna and {G}ipp, {B}ela},
  title      = {{news-please}: {A} {G}eneric {N}ews {C}rawler and {E}xtractor},
  year       = {2017},
  booktitle  = {{P}roceedings of the 15th {I}nternational {S}ymposium of {I}nformation {S}cience},
  location   = {Berlin},
  editor     = {Gaede, Maria and Trkulja, Violeta and Petra, Vivien},
  pages      = {218--223},
  month      = {March}
}

You can find more information on this and other news projects on our website.

Contribution

You want to contribute? Great, we are always happy for any support on this project! Simply send a pull request or drop us an email: [email protected]. By contributing to this project, you agree that your contributions will be licensed under the project's license (see below).

License

The project is licensed under the Apache License 2.0. Make sure that you use news-please in compliance with applicable law. The news-please logo is courtesy of Mario Hamborg.

Copyright 2016 The news-please team

news-please's People

Contributors

fhamborg avatar

Stargazers

Nyimbi Odero avatar

Watchers

James Cloos avatar Nyimbi Odero avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.