codelucas / newspaper Goto Github PK

View Code? Open in Web Editor NEW

13.7K 386.0 2.1K 17.92 MB

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Home Page: https://goo.gl/VX41yK

License: MIT License

Python 100.00%

python news crawler crawling scraper news-aggregator

newspaper's Introduction

Newspaper3k: Article scraping & curation

Inspired by requests for its simplicity and powered by lxml for its speed:

"Newspaper is an amazing python library for extracting & curating articles." -- tweeted by Kenneth Reitz, Author of requests

"Newspaper delivers Instapaper style article extraction." -- The Changelog

Newspaper is a Python3 library! Or, view our deprecated and buggy Python2 branch

A Glance:

>>> from newspaper import Article

>>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
>>> article = Article(url)

>>> article.download()

>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'

>>> article.parse()

>>> article.authors
['Leigh Ann Caldwell', 'John Honway']

>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)

>>> article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

>>> article.top_image
'http://someCDN.com/blah/blah/blah/file.png'

>>> article.movies
['http://youtube.com/path/to/link.com', ...]

>>> article.nlp()

>>> article.keywords
['New Years', 'resolution', ...]

>>> article.summary
'The study shows that 93% of people ...'

>>> import newspaper

>>> cnn_paper = newspaper.build('http://cnn.com')

>>> for article in cnn_paper.articles:
>>>     print(article.url)
http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/
http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html
...

>>> for category in cnn_paper.category_urls():
>>>     print(category)

http://lifestyle.cnn.com
http://cnn.com/world
http://tech.cnn.com
...

>>> cnn_article = cnn_paper.articles[0]
>>> cnn_article.download()
>>> cnn_article.parse()
>>> cnn_article.nlp()
...

>>> from newspaper import fulltext

>>> html = requests.get(...).text
>>> text = fulltext(html)

Newspaper can extract and detect languages seamlessly. If no language is specified, Newspaper will attempt to auto detect a language.

>>> from newspaper import Article
>>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'

>>> a = Article(url, language='zh') # Chinese

>>> a.download()
>>> a.parse()

>>> print(a.text[:150])
香港行政长官梁振英在各方压力下就其大宅的违章建
筑（僭建）问题到立法会接受质询，并向香港民众道歉。
梁振英在星期二（12月10日）的答问大会开始之际
在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的
意图和动机。 一些亲北京阵营议员欢迎梁振英道歉，
且认为应能获得香港民众接受，但这些议员也质问梁振英有

>>> print(a.title)
港特首梁振英就住宅违建事件道歉

If you are certain that an entire news source is in one language, go ahead and use the same api :)

>>> import newspaper
>>> sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh')

>>> for category in sina_paper.category_urls():
>>>     print(category)
http://health.sina.com.cn
http://eladies.sina.com.cn
http://english.sina.com
...

>>> article = sina_paper.articles[0]
>>> article.download()
>>> article.parse()

>>> print(article.text)
新浪武汉汽车综合 随着汽车市场的日趋成熟，
传统的“集全家之力抱得爱车归”的全额购车模式已然过时，
另一种轻松的新兴 车模式――金融购车正逐步成为时下消费者购
买爱车最为时尚的消费理念，他们认为，这种新颖的购车
模式既能在短期内
...

>>> print(article.title)
两年双免0手续0利率 科鲁兹掀背金融轻松购_武汉车市_武汉汽
车网_新浪汽车_新浪网

Support our library

It takes only one click

Docs

Check out The Docs for full and detailed guides using newspaper.

Interested in adding a new language for us? Refer to: Docs - Adding new languages

Features

Multi-threaded article download framework
News url identification
Text extraction from html
Top image extraction from html
All image extraction from html
Keyword extraction from text
Summary extraction from text
Author extraction from text
Google trending terms extraction
Works in 10+ languages (English, Chinese, German, Arabic, ...)

>>> import newspaper
>>> newspaper.languages()

Your available languages are:
input code      full name

  ar              Arabic
  be              Belarusian
  bg              Bulgarian
  da              Danish
  de              German
  el              Greek
  en              English
  es              Spanish
  et              Estonian
  fa              Persian
  fi              Finnish
  fr              French
  he              Hebrew
  hi              Hindi
  hr              Croatian
  hu              Hungarian
  id              Indonesian
  it              Italian
  ja              Japanese
  ko              Korean
  lt              Lithuanian
  mk              Macedonian
  nb              Norwegian (Bokmål)
  nl              Dutch
  no              Norwegian
  pl              Polish
  pt              Portuguese
  ro              Romanian
  ru              Russian
  sl              Slovenian
  sr              Serbian
  sv              Swedish
  sw              Swahili
  th              Thai
  tr              Turkish
  uk              Ukrainian
  vi              Vietnamese
  zh              Chinese

Get it now

Run ✅ pip3 install newspaper3k ✅

NOT ⛔ pip3 install newspaper ⛔

On python3 you must install newspaper3k, not newspaper. newspaper is our python2 library. Although installing newspaper is simple with pip, you will run into fixable issues if you are trying to install on ubuntu.

If you are on Debian / Ubuntu, install using the following:

Install pip3 command needed to install newspaper3k package:
```
$ sudo apt-get install python3-pip
```
Python development version, needed for Python.h:
```
$ sudo apt-get install python-dev
```

lxml requirements:

$ sudo apt-get install libxml2-dev libxslt-dev

For PIL to recognize .jpg images:

$ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev

NOTE: If you find problem installing libpng12-dev, try installing libpng-dev.

Download NLP related corpora:

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

Install the distribution via pip:
```
$ pip3 install newspaper3k
```

If you are on OSX, install using the following, you may use both homebrew or macports:

$ brew install libxml2 libxslt

$ brew install libtiff libjpeg webp little-cms2

$ pip3 install newspaper3k

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

Otherwise, install with the following:

NOTE: You will still most likely need to install the following libraries via your package manager

PIL: libjpeg-dev zlib1g-dev libpng12-dev
lxml: libxml2-dev libxslt-dev
Python Development version: python-dev

$ pip3 install newspaper3k

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

Donations

Your donations are greatly appreciated! They will free me up to work on this project more, to take on things like: adding new features, bug-fix support, addressing concerns with the library.

My PayPal link: https://www.paypal.me/codelucas
My Venmo handle: @Lucas-Ou-Yang

Development

If you'd like to contribute and hack on the newspaper project, feel free to clone a development version of this repository locally:

git clone git://github.com/codelucas/newspaper.git

Once you have a copy of the source, you can embed it in your Python package, or install it into your site-packages easily:

$ pip3 install -r requirements.txt
$ python3 setup.py install

Feel free to give our testing suite a shot, everything is mocked!:

$ python3 tests/unit_tests.py

Planning on tweaking our full-text algorithm? Add the fulltext parameter:

$ python3 tests/unit_tests.py fulltext

Demo

View a working online demo here: http://newspaper-demo.herokuapp.com

This is another working online demo: http://newspaper.chinazt.cc/

LICENSE

Authored and maintained by Lucas Ou-Yang.

Parse.ly sponsored some work on newspaper, specifically focused on automatic extraction.

Newspaper uses a lot of python-goose's parsing code. View their license here.

Please feel free to email & contact me if you run into issues or just would like to talk about the future of this library and news extraction in general!

newspaper's People

Contributors

Stargazers

Watchers

Forkers

skswbwt afthill charlie-cao michaelhood silky cretax shcalm karthikdwarakanath donnyzhang tomtaylor waseem18 sgallancy voidfiles strogo mpuig lifuzu ezioruan mysterier girasquid otemnov darkslategrey techaddict damilare sghosh73 whereswardy mcastilho hellcoderz billwiliams simp75 adamstac zhizhou idefine githubber mikkhait gabstehr netconstructor erikdies vibster mocyuto dotpot metakermit frandman 99plus2 bollwang cswanghan caxaria ilovejs adamramadhan dreamfrog bilalghalib zhoubug weimingzhen nordineb omarkhan watermars qingu gavinkao anty djrondon liqiang-ict culturalstudies pd520c kiwi4py wuzixiao redwind ksdthegr8 jeffnappi zzhaozeng pavelwen halved jimmy0000 techiev2 divya-ai chishaku chennacotla summerhq mrduongnv genba mohamadhussien joejean andy071001 klarahan rhartley wyrover eox03y handol artur hachiya captainjack100 justinleoye kungfucop mohheader parhammmm jwhite462 pkeeper karls adamlwgriffiths lizuyao2010 pr0hest tskatom

newspaper's Issues

article.text and keywords error

Hi,
I tested with article.url = http://www.windytan.com/2013/03/eavesdropping-on-wireless-keyboard.html

The text located to the right col. "A self-taught signals & electronics hacker from Helsinki, Finland. Fond of mysteries, codes and ciphers, and vintage tech. Absorptions is a blog about my hobbies.\n\n\n\nWorks in IT. Apart from electronics and signals, likes singing, kung fu, photography, and collecting G1 MLPs."

no document on how to add language

There is no documentation about adding a new language to your great newspaper application.

AttributeError: 'module' object has no attribute 'build'

import newspaper
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)

for article in cnn_paper.articles:
    print article.url

The code above works sometime and other times it doesn't. I'm working on a virtualenv however all the required libraries are also installed in the system.

Traceback (most recent call last):
  File "/Users/Shapath/Developer/Python/Newspaper/Newspaper/Parser/newspaperparser.py", line 1, in <module>
    import newspaper
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/__init__.py", line 10, in <module>
    from .article import Article, ArticleException
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/article.py", line 16, in <module>
    from . import images
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/images.py", line 20, in <module>
    from . import urls
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/urls.py", line 17, in <module>
    from .packages.tldextract import tldextract
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/packages/tldextract/__init__.py", line 1, in <module>
    from .tldextract import extract, TLDExtract
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/packages/tldextract/tldextract.py", line 37, in <module>
    import pkg_resources
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/pkg_resources.py", line 76, in <module>
    import parser
  File "/Users/Shapath/Developer/Python/Newspaper/Newspaper/Parser/parser.py", line 4, in <module>
    cnn_paper = newspaper.build('http://cnn.com')

Doesn't work with Arabic news sites

I tried newspaper 0.0.6 with a bunch of Arabic websites and it didn't seem to fetch any articles.

In [38]: newspaper.version.version_info
Out[38]: (0, 0, 6)

In [39]: alarabiya = newspaper.build('http://www.alarabiya.net/', language='ar')

In [40]: tahrirnews = newspaper.build('http://tahrirnews.com/', language='ar')

In [41]: ahram = newspaper.build('http://www.ahram.org.eg/', language='ar')

In [42]: almasryalyoum = newspaper.build('http://www.almasryalyoum.com/', language='ar')

In [43]: for src in (alarabiya, tahrirnews, ahram, almasryalyoum):
   ....:     print(src.size())
   ....:     
0
0
0
0

Cache folder settings should be optional/configurable.

Newspaper currently creates its cache-folder under ~/.newspaper_scraper. This should be a configurable option and should be able to be disabled altogether for those not using the 'memoized' functionality of newspaper.

article.movies missing 'http:'

I've noticed results from Article.movies are missing the protocol prefix.

>>> import newspaper
>>> url = 'http://www.rockpapershotgun.com/2014/07/24/top-down-tracy-third-eye-crime/'
>>> a = newspaper.Article(url)
>>> a.download()
>>> a.parse()
>>> a.movies
['//www.youtube.com/embed/LgNLRT6QyQE', '//www.youtube.com/embed/jsqVLa1yy1M']

$ pip show newspaper

---
Name: newspaper
Version: 0.0.7
Location: /home/adamgriffiths/.anaconda/envs/collected-redux/lib/python2.7/site-packages
Requires: lxml, requests, nltk, Pillow, cssselect, BeautifulSoup

Other language support.

Can you add a new section where describing how to add a new language support?

Huge internationalization / API revamp underway!

So far the newspaper library does a decent job with basic API calls but for a lot of the foreign language stuff, configuration details; it is still a bit clunky.. But do not worry, in the next 48 hours a HUGE revamp will be done on this library.

I'll make it very seamless to change languages, auto detect languages.

I will fix Chinese and Arabic extractions (right now they are broken due to the fact that I was incorrectly using the requests library (response.content vs response.text for foreign articles).

I will also add a few more languages to the suite.

Multithread & gevent framework built into newspaper

I will add this feature tonight or tomorrow. Opening an issue for it because it is so important. Multithreading has always existed in newspaper but there hasn't been a public API for it.

Downloading multiple articles concurrently is super useful and newspaper has an effective setup to do so.

Parsing Raw HTML

Is it possible to send raw html directly to the Article.parse() function without it being downloaded by Article.download()?

Typo in newspaper.build argument "memoize_articles"

Great work with this library, just a little typo I've noticed: I think the argument in newspaper.build is supposed to be "memorize_articles", not "memoize_articles".

Calling nlp() on an article causes 'tokenizers/punkt/english.pickle' Not Found Error

I know the fix to this, will wait for tomorrow to implement it, it's late. I'll have the setup.py install the required nltk tokenizers.

Can't install newspaper

It looks like something has broken during the refactor perhaps. Essentially, I'm unable to install newspaper either from a local directory or via git using pip.

setup.py specifies the newspaper.data package as a dependency, but the data/ directory doesn't exist any more and the install therefore fails.

Retain HTML markup for extracted article

I currently use Boilerpipe to do article extraction in order to generate Kindle MOBI files to send to my Kindle. I'm wondering if it's possible to feature-request the ability to do something similar in Newspaper: in that the article text extraction retains a minimal set of markup around it, enough to give the text structure as far as HTML is concerned. This makes forward conversion to other formats a lot easier, and allows the ability to retain certain markup that can only be expressed using HTML (such as images in situ and code fragments).

Having issues installing due to lxml

I'm not sure if this is an OS X 10.10 or possibly even Xcode 6.1 command line tools issue, but having some issues installing this.

Once I get to the pip install newspaper step, it errors out while building for lxml. Any thoughts on what could be going wrong?

Robs-MacBook-Air:~ rob$ pip install newspaper
Requirement already satisfied (use --upgrade to upgrade): newspaper in /Library/Python/2.7/site-packages
Downloading/unpacking lxml (from newspaper)
  Downloading lxml-3.4.0.tar.gz (3.5MB): 3.5MB downloaded
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py) egg_info for package lxml
    /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    Building lxml version 3.4.0.
    Building without Cython.
    Using build configuration of libxslt 1.1.28

    warning: no previously-included files found matching '*.py'
Downloading/unpacking requests (from newspaper)
  Downloading requests-2.4.3-py2.py3-none-any.whl (459kB): 459kB downloaded
Downloading/unpacking nltk (from newspaper)
  Downloading nltk-3.0.0.tar.gz (962kB): 962kB downloaded
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/nltk/setup.py) egg_info for package nltk

    warning: no files found matching 'Makefile' under directory '*.txt'
    warning: no previously-included files matching '*~' found anywhere in distribution
Downloading/unpacking Pillow (from newspaper)
  Downloading Pillow-2.6.1.tar.gz (7.3MB): 7.3MB downloaded
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/Pillow/setup.py) egg_info for package Pillow

    warning: no files found matching '*.yaml'
    warning: no files found matching '*.bdf' under directory 'Images'
    warning: no files found matching '*.fli' under directory 'Images'
    warning: no files found matching '*.gif' under directory 'Images'
    warning: no files found matching '*.icns' under directory 'Images'
    warning: no files found matching '*.ico' under directory 'Images'
    warning: no files found matching '*.jpg' under directory 'Images'
    warning: no files found matching '*.pbm' under directory 'Images'
    warning: no files found matching '*.pil' under directory 'Images'
    warning: no files found matching '*.png' under directory 'Images'
    warning: no files found matching '*.ppm' under directory 'Images'
    warning: no files found matching '*.psd' under directory 'Images'
    warning: no files found matching '*.tar' under directory 'Images'
    warning: no files found matching '*.webp' under directory 'Images'
    warning: no files found matching '*.xpm' under directory 'Images'
    warning: no files found matching 'README' under directory 'Sane'
    warning: no files found matching 'README' under directory 'Scripts'
    warning: no files found matching '*.icm' under directory 'Tests'
    warning: no files found matching '*.txt' under directory 'Tk'
Downloading/unpacking cssselect (from newspaper)
  Downloading cssselect-0.9.1.tar.gz
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/cssselect/setup.py) egg_info for package cssselect

    no previously-included directories found matching 'docs/_build'
Downloading/unpacking BeautifulSoup (from newspaper)
  Downloading BeautifulSoup-3.2.1.tar.gz
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/BeautifulSoup/setup.py) egg_info for package BeautifulSoup

Installing collected packages: lxml, requests, nltk, Pillow, cssselect, BeautifulSoup
  Running setup.py install for lxml
    /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    Building lxml version 3.4.0.
    Building without Cython.
    Using build configuration of libxslt 1.1.28
    building 'lxml.etree' extension
    cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/src/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
    cc -bundle -undefined dynamic_lookup -arch x86_64 -arch i386 -Wl,-F. -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.10-intel-2.7/lxml/etree.so
    ld: warning: directory not found for option '-F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks'
    ld: framework not found CrashReporterSupport
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    error: command 'cc' failed with exit status 1
    Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip-dac3OE-record/install-record.txt --single-version-externally-managed --compile:
    /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'

  warnings.warn(msg)

Building lxml version 3.4.0.

Building without Cython.

Using build configuration of libxslt 1.1.28

running install

running build

running build_py

creating build

creating build/lib.macosx-10.10-intel-2.7

creating build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/_elementpath.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/builder.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/cssselect.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/doctestcompare.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/ElementInclude.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/pyclasslookup.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/sax.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/usedoctest.py -> build/lib.macosx-10.10-intel-2.7/lxml

creating build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/includes

creating build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/_diffcommand.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/_html5builder.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/_setmixin.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/builder.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/clean.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/defs.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/diff.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/ElementSoup.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/formfill.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/html5parser.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/soupparser.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/usedoctest.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron

copying src/lxml/isoschematron/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron

copying src/lxml/lxml.etree.h -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/lxml.etree_api.h -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/includes/c14n.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/config.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/dtdvalid.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/etreepublic.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/htmlparser.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/relaxng.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/schematron.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/tree.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/uri.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xinclude.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xmlerror.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xmlparser.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xmlschema.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xpath.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xslt.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/etree_defs.h -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/lxml-version.h -> build/lib.macosx-10.10-intel-2.7/lxml/includes

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/rng

copying src/lxml/isoschematron/resources/rng/iso-schematron.rng -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/rng

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl

copying src/lxml/isoschematron/resources/xsl/RNG2Schtrn.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl

copying src/lxml/isoschematron/resources/xsl/XSD2Schtrn.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_abstract_expand.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_dsdl_include.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_message.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_skeleton_for_xslt1.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_svrl_for_xslt1.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/readme.txt -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

running build_ext

building 'lxml.etree' extension

creating build/temp.macosx-10.10-intel-2.7

creating build/temp.macosx-10.10-intel-2.7/src

creating build/temp.macosx-10.10-intel-2.7/src/lxml

cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/src/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace

cc -bundle -undefined dynamic_lookup -arch x86_64 -arch i386 -Wl,-F. -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.10-intel-2.7/lxml/etree.so

ld: warning: directory not found for option '-F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks'

ld: framework not found CrashReporterSupport

clang: error: linker command failed with exit code 1 (use -v to see invocation)

error: command 'cc' failed with exit status 1

----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip-dac3OE-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml
Storing debug log for failure in /Users/rob/Library/Logs/pip.log

Sites it doesn't work on

I've got a running list of URLs that newspaper doesn't work phenomenally against. Is there an open issue to catalogue these? In most cases, it's able to grab the list of articles from the home page, but completely unable to decipher each individual article into readable values.

For example, this link gets basically nothing:
http://www.empireonline.com/news/story.asp?NID=40344

article does not release_resources()

When running article.parse() I am running into memory issues with a large number of articles being processed.

Each time the function is called it eats up about 0.5MB of memory that is not released when the parsing is done.

I took a look at the parse() function in article.py and it looks like the release_resources() function still has a TODO to be properly implemented:

https://github.com/codelucas/newspaper/blob/master/newspaper/article.py#L355

I'm curious if you can give more detail about a proper implementation of this function so that parse() will release the memory once it is done with it.

Refactor codebase so newspaper is actually pythonic

Upon re-examining the code to this lib (which has many chunks taken from various locations of the open source community) i've come to the conclusion that it's totally shit and really needs to be refactored.

Will try to make something happen this weekend.

Python venv only?

Hey, so I downloaded the library and tried writing a small python program to do the scripts you mentioned in the tutorial. But nothing runs. Newspaper is found but beyond that, nothing.

Error for proof:

Traceback (most recent call last):
  File "/newspaper.py", line 1, in <module>
    import newspaper
  File "/newspaper.py", line 3, in <module>
    cnn_paper = newspaper.build('http://cnn.com')
AttributeError: 'module' object has no attribute 'build'
[Finished in 0.0s with exit code 1]

Multi-threading article downloads not working

I am following the example at http://newspaper.readthedocs.org/en/latest/user_guide/advanced.html#advanced. After calling news_pool.join(), I then attempt to call parse() on an article belonging to a paper from the pool. It fails with the ArticleException: "You must download() an article before parsing it!".

KeyError when calling newspaper.languages()

Hi,

It's a minor thing but I tried to see what languages are available by calling newspaper.languages() and it exited with KeyError: nb exception. It seems someone forgot to add this language in the language_dict inside print_available_languages() defined in newspaper/utils/__init__.py.

Don't work if content have <strong> tag.

It doesn't work on http://mil.news.sina.com.cn/2014-01-25/1318761762.html.

only strong character has been recognized, the result is recorded as below :

>>> print a.text
中新网北京1月24日电 (记者 孙自法)“2013**科学年度新闻人物”评选结果24日晚在北京揭晓，领衔科研团队在国际上首次实现“量子反常霍尔效应”的薛其坤院士、神舟载人飞船系统总设计师张柏楠、**首艘航母关键配套导航系统领航人张崇猛、运-20总设计师唐长红院士等10名科技专家，从40位候选人中脱颖而出、成功当选。

这次评出的十大**科学年度新闻人物包括基础研究领域科学家3名、技术创新和科技成果转化杰出者3名、科技企业领军人物3名、科技传播者1名，他们分别是：

――**科学院院士、清华大学副校长薛其坤。2013年，他带领研究团队，在国际上首次实现“量子反常霍尔效应”，让**科学界站在了下一次信息革命的战略制高点。

――中科院院士、清华大学生命学院院长施一公。这位知名结构生物学家的科研小组2013年研究进展不断，包括“运用X-射线晶体学手段在细胞凋亡研究领域做出突出贡献，为开发新型抗癌、预防老年痴呆的药物提供重要线索”等。

――量子世界“追梦人”、**科学技术大学教授陈宇翱。2013年，凭借在光子、冷原子量子操纵和量子信息、量子模拟等领域的杰出贡献，他荣获2013年度“菲涅尔奖”。

――**航天科技集团空间技术研究院载人飞船系统总设计师张柏楠。2013年，他带领团队突破一系列关键技术，实现天宫一号与神舟十号手控交会对接，完成**载人天地往返运输系统的首次应用性飞行。

Docs for adding category sources

I've been playing around with newspaper today. Looks awesome. I've had trouble picking up sufficient amount of categories on a number of sites... might be a good idea to add docs for add new categories to sources.

Not extracting UL LI text

Bulleted points in articles are not extracted and are totally missing from the extracted text

Problem in Brazilian sites

I got problems using the newspaper in Brazilian sites.
Following is an example:

import newspaper

info = newspaper.build('http://globoesporte.globo.com/futebol/times/sao-paulo')
len(info.artices)

It returned only 3 articles.

Sorry if I am using it wrongly.

Timegm error?

After installing the script, I get the following error when importing newspaper 👍
ImportError: cannot import name timegm

ANy idea why this is happening ?

Article.top_node == Article.clean_top_node

In Article.parse, top_node is over-written with the cleaned node.
Then Article.clean_top_node is copied from this.
Both nodes are equal. I'm not sure what the reasons are, but it prevents extraction by external tools by hiding the extracted article html.

Preferably, Article.top_node shouldn't be over-written, and existing code should be modified to use clean_top_node where required.

.nlp() could not work

I have been following the example in the README and I encountered this:

>>> article = cnn_paper.articles[1]
>>> article.download()
>>> article.parse()
>>> article.nlp()
Traceback (most recent call last):
zipfile.BadZipfile: File is not a zip file

article_html does not keep the img tags

When extracting the article node with the html using a.article_html, the <img tags are not kept. I noticed that in the clean_html(cls,node) function, 'img' is allowed but why is it not included in the article_html output?

    article_cleaner.allow_tags = ['a', 'span', 'p', 'br', 'strong', 'b',
            'em', 'i', 'tt', 'code', 'pre', 'blockquote', 'img', 'h1',
            'h2', 'h3', 'h4', 'h5', 'h6']
    article_cleaner.remove_unknown_tags = False

[Parse lxml ERR] Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

When I am trying newspaper.build('http://www.venturebeat.com')
It give me these errors
[Parse lxml ERR] Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
[Category parse ERR] http://feeds.venturebeat.com

Can you please help or let me know what might be the issue.

You must download and parse an article before parsing it

Here the stack trace:

[Parse lxml ERR] line 1045: Tag nav invalid
[Article parse ERR] http://www.cnet.com/products/apple-ipad-march-2012/
You must download and parse an article before parsing it!
Traceback (most recent call last):
  File "crawler.py", line 30, in <module>
    a.nlp()
  File "/root/.virtualenvs/cnet-crawler/local/lib/python2.7/site-packages/newspaper/article.py", line 276, in nlp
    raise ArticleException()
newspaper.article.ArticleException

I'm not using the concurrent version, I'm not building a newspaper from a url, but rather I have a list of all the articles and I build a new Article from them.

Add a BeautifulSoup4 parser.

BS4 should provide an extremely robust solution to parsing articles of questionable encoding, etc.

Memoize Articles - Not Printing

Articles not being parsed from Memoize?

import newspaper
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=True)

for article in cnn_paper.articles:
    print article.url

It runs for the first time as it is not cached and prints all the results, The second time nothing is printed, -- BLANK --

Doesn't work on http://www.le360.ma/fr

ve = build(" http://www.le360.ma/fr", memoize_articles=False)

links = dict()

for each in ve.articles:
    links[each.title] = each.url

-> Links is empty

python 3 support request

Python 3.4+Windows+ pip install =

SyntaxError: invalid syntax
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 16, in
\AppData\Local\Temp\pip_build_piyush\newspaper\setup.py", line 60
print ''

SyntaxError: invalid syntax

import newspaper
Traceback (most recent call last):
File "", line 1, in
File "newspaper/init.py", line 10, in
from .article import Article, ArticleException
File "newspaper/article.py", line 15, in
from . import nlp
File "newspaper/nlp.py", line 171
if (normalized > 1.0) #just in case

Bound for memory usage

First, thanks a lot for the great tool. I've been trying it out, and seems magic (except for some corner cases, websites for which it doesn't work, etc) but really cool :)

However, I tried it in a setting with scarce ressources (1G of RAM), and I have the impression that the memory keeps growing build after build until ... memory error. I deactivated the memoize articles, tried to empty the articles, dereference the sources, but looks like a bunch of other things are also memoized, and kept in memory, with no means to deactivate them. What is the best way to handle this? How does newspaper handle the increase of memory usage build after build? Is there a limit?

Thanks again for the magic tool :)
raspooti

Add extraction publishing date from article.

The publishing date of an article is critical. I believe a good way of extracting publishing dates is to use a set of regex patterns and/or some notable id/class names.

Hosted demo

Any chance of hosting a demo of newspaper in action so we can try it out before going through the setup steps?

It'd be nice to have a "try before you buy" before committing to the setup.

Character encoding detection

I noticed that to get unicode, you rely on the requests package's request.text attribute (in network.py->get_html). To get this, requests only uses the HTTP header encoding declaration (requests.utils.get_encoding_from_headers()), and reverts to ISO-8859-1 if it doesn't find one. This results in incorrect character encoding in a lot of cases.

You can use another function from requests to give you the encodings listed in the HTML: requests.utils.get_encodings_from_content() which will work to fill in the gaps. What I generally do is test the request object encoding first. If it's not ISO-8859-1, then it has been passed an encoding, and I return the request.text unicode. If it is, then I call the requests.utils.get_encodings_from_content() which parses via regex. It returns a list of suggested encodings from the content to try, which are generally correct.

In the final case, neither approach will work, an example is this page: http://boaforma.abril.com.br/fitness/todos-os-treinos/bikes-eletricas-759925.shtml

There is no HTTP header encoding, and an incorrect encoding declaration in the content: content="text/html; charset=uISO-8859-1. Here we could use chardet or fall back to the original ISO-8859-1 encoding that requests defaults to (it works in this case).

I'd be happy to add this to the code if desired so you can pull it. Would it be most appropriate to put this into the network.py file?

Edit: Also, I have a large collection of special snowflake links that provide decoding difficulties and edge cases that we could add to the test suite if necessary.

Port to Ruby

It would be lovely if I could port this to Ruby. I think one issue I'll have to deal with is replacing Beautiful Soup with Nokogiri. I need to sit down and go through all of the code to spot any issues that may arise.

Retain <a> tags in top article node?

I'm wondering if there's a way to retain a tags in article.top_node, or alternatively just extract all the urls from within the article's html. My hope is to be able to find when an article links to any number of other articles. I'm currently digging around in the source to find the best place to include this, but could use guidance. Thanks!

How to assign html content without downloading it?

Is it possible to assing already downloaded html string to the Article object without calling download() method?

Because I want to use it in Scrapy project and html page is already downloaded so I need simply to parse it.

issue with stopwords-tr.txt

Hello. I keep running into a error when parsing the downloaded articles. The error has to do with "Couldn't open file /home/.../newspaper/utils/../resources/text/stopwords-tr.txt". I am not sure where the issue is coming from, but my guess is that it might come from some updates in python-goose? Thanks.

Brazilian portuguese support

I would like to use newspaper in brazilian sites. I tested it but unsuccessfully.
The articles are not extracted correctly.

newspaper is internationalized and I am using it in a wrong way or can we make it able to brazilian portuguese?

Thank you.

DocumentCleaner is missing clean_body_classes

Add clean_body_classes from the python-goose upstream. Without this, there are cases where the body tag may get removed.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)

File "/home/tim/Workspace/Development/hacks/pressmonitor/pressmon/articles/management/commands/collect_articles.py", line 27, in handle
article.nlp()
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/article.py", line 307, in nlp
summary_sents = nlp.summarize(title=self.title, text=self.text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/nlp.py", line 34, in summarize
sentences = split_sentences(text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/nlp.py", line 146, in split_sentences
sentences = tokenizer.tokenize(text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)

Closing quotation mark is removed from title

I work mostly with Russian articles. In Russian, «angled» quotes are main variation of quotation marks. So, I noticed that if there's «angled» quotes in a title of an article, all closing quotation marks are removed from extracted title and it contains only the opening ones. I found that in happens here in ContentExtractor:

TITLE_REPLACEMENTS = ReplaceSequence().create(u"&raquo;").append(u"»")

...

return TITLE_REPLACEMENTS.replaceAll(title).strip()

As far as I understand, it's needed for removing » from titles where this character is used as a delimiter. Maybe it'll make sense to modify the regular expressions to not remove right quotes that have left quotes before them?

Here's an example of a page that have broken quotation marks in extracted title (Russian language).

Portuguese is misspelled

Portugease should be Portuguese

Add URL headers while building a "paper"

Is there an ability to add custom user agents while building the paper? It is possible to add it? or any tricks to use it right now?