manashmandal / newscrawler Goto Github PK

View Code? Open in Web Editor NEW

12.0 5.0 4.0 2.07 MB

News crawler

Home Page: https://manashmandal.github.io/NewsCrawler/

License: MIT License

Python 100.00%

newscrawler's Introduction

Bangladeshi Online Newspaper Crawler

Done

[Note: Dhaka Tribune website is under development so that crawler won't work]

How to

0. Before Beginning

1. Download and Install MongoDB

2. Download Stanford NER and configure it

3. Download and configure Elasticsearch & Kibana

1. Setting Up

1 (a) If you use a linux distro, install these packages first

Using this command,

sudo apt-get install build-essential autoconf libtool pkg-config python-opengl python-imaging python-pyrex python-pyside.qtopengl idle-python2.7 qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 python-dev virtualenv libjpeg-dev libxml2-dev libxslt1-dev


sudo easy_install greenlet

sudo easy_install gevent

1 (b) Cloning the repository and creating virtual environment

Cloning

Open a cmd or terminal and enter the following command

git clone -b spider_only https://github.com/manashmndl/NewsCrawler.git

After a successful clone this will be shown

Cloning into 'NewsCrawler'...
remote: Counting objects: 711, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 711 (delta 1), reused 0 (delta 0), pack-reused 704
Receiving objects: 100% (711/711), 2.03 MiB | 78.00 KiB/s, done.
Resolving deltas: 100% (376/376), done.
Checking connectivity... done.

Virtual Environment Creation

Enter the directory using cd command,

cd NewsCrawler

Create a virtual environment

[I'm using Python 2.7, not anaconda]

virtualenv -p /usr/bin/python2.7 venv

Successful prompt,

Running virtualenv with interpreter /usr/bin/python2.7
New python executable in /home/<username>/Downloads/project/NewsCrawler/venv/bin/python2.7
Also creating executable in /home/<username>/Downloads/project/NewsCrawler/venv/bin/python
Installing setuptools, pip, wheel...done.

Activate the environment

source venv/bin/activate

On successful execution, the environment will be visible

(venv) rubel@rubel-rig ~/Downloads/project/NewsCrawler $

1 (c) Installing Dependencies

After activating the virtual environment, enter the following command to install all dependencies,

pip install -r requirements.txt

1 (d) Download NLTK Corpus

Run python from the virtual environment and do the following:

import nltk
nltk.download()

Then from the menu download all the corpus used in the book

1 (e) Checking if the spiders are ready or not

Enter the following command,

scrapy list

If it echoes these two spiders then the spiders are ready to crawl! Turn on the MongoDB, Elasticsearch & Kibana Server and start crawling.

dailystar
prothomalo

2. Configuring API and StanfordNER Path

Indicoio API Configuration

At credentials_and_configs/keys.py file, change the API key. You can create an account here and get your own API Key.

Example,

INDICOIO_API_KEY = '8ee6432e7dc137740c40c0af8d7XXXXXX' # Replace the value with your own API Key

StanfordNER Path

At credentials_and_configs/stanford_ner_path.py file, change the paths according to the downloaded NER and CLASSIFIER paths.

Example,

STANFORD_NER_PATH = 'C:\StanfordParser\stanford-ner-2015-12-09\stanford-ner.jar' #Insert your path here
STANFORD_CLASSIFIER_PATH = 'C:\StanfordParser\stanford-ner-2015-12-09\classifiers\english.all.3class.distsim.crf.ser.gz' # Insert your path here

3. Running the spiders

Open a command window / terminal at the root of the folder. Run the following commands to start scraping.

4. Crawling Instructions

Spider Names

The Daily Star -> dailystar
Prothom Alo -> prothomalo
Dhaka Tribune -> dhakatribune

Crawl 'em all

For Daily Star

scrapy crawl dailystar

For Prothom Alo

scrapy crawl prothomalo

For Dhaka Tribune

scrapy crawl dhakatribune

Crawling bounded by date time

If I want to scrape all of the news between 1st January 2016 and 1st February 2016 my command will look like this,

Daily Star

scrapy crawl dailystar -a start_date="01-01-2016" -a  end_date="01-02-2016"

Prothom Alo

scrapy crawl prothomalo -a start_date="01-01-2016" -a  end_date="01-02-2016"

Crawling Dhaka Tribune by page range

Dhaka Tribune

scrapy crawl dhakatribune -a start_page=0 -a end_page=10

Crawling with CSV/JSON output

If you want to collect all crawled data in a csv or a json file you can run this command.

Daily Star [csv output]

scrapy crawl dailystar -a start_date="01-01-2016" -a end_date="01-02-2016" -o output_file_name.csv

Daily Star [json output]

scrapy crawl dailystar -a start_date="01-01-2016" -a end_date="01-02-2016" -o output_file_name.json

Dhaka Tribune [csv output]

scrapy crawl dhakatribune -a start_page=0 -a end_page=10 -o output_file_name.csv

Dhaka Tribune [json output]

scrapy crawl dhakatribune -a start_page=0 -a end_page=10 -o output_file_name.json

Prothom Alo [csv output]

scrapy crawl prothomalo -a start_date="01-01-2016" -a end_date="01-02-2016" -o output_file_name.csv

Prothom Alo [json output]

scrapy crawl prothomalo -a start_date="01-01-2016" -a end_date="01-02-2016" -o output_file_name.json

5. Data insertion into Elasticsearch and Kibana Visualization Instructions

Download and extract Kibana and Elasticsearch

Starting MongoDB Service

Open CMD/Terminal then type the following command

mongod

It should give the following output

2016-12-03T03:00:38.986+0600 I CONTROL  [initandlisten] MongoDB starting : pid=7204 port=27017 dbpath=C:\data\db\ 64-bit host=DESKTOP-4PR51E6
2016-12-03T03:00:38.986+0600 I CONTROL  [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2016-12-03T03:00:38.987+0600 I CONTROL  [initandlisten] db version v3.2.7
.............
.............
2016-12-03T03:00:39.543+0600 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-12-03T03:00:39.543+0600 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory 'C:/data/db/diagnostic.data'
2016-12-03T03:00:39.558+0600 I NETWORK  [initandlisten] waiting for connections on port 27017

Now you're all set to use the MongoDB service!

MongoDB Troubleshooting

Couldn't find the path

Add the MongoDB\Server\3.2\bin folder to your system path and then try again.

Data directory C:\data\db\ not found., terminating

Quite simple, all you need to do is to create a db folder there and you're good to go.

Starting Elasticsearch Server

Go to elasticsearch-5.0.0\bin folder then run the program elasticsearch.bat on windows.

Starting Kibana Server

Go to kibana-5.0.0-windows-x86\bin folder and run the program kibana.bat on windows.

All of your local server and services should be working properly. Start crawling using the scrapy crawl command and the data will be automatically inserted to mongo database, elasticsearch, and you can get the output as either csv or json format. You must start elasticsearch before kibana

Configuring Kibana for data acquisition and Visualization

Kibana server will listen to localhost:5601 by default. So open the url in your browser.

Go to Management

Click on Index Patterns and then Add New

Remove tick from Index contains time-based events and write news* on the Index name or pattern text input. Then click Create

Then go to Discover and select news* index

newscrawler's People

Contributors

Stargazers

Watchers

Forkers

s1s1ty hemel-cse mituvinci beicheng1989

newscrawler's Issues

ElasticSearch problems and solutions

http://stackoverflow.com/questions/11834238/curl-post-command-line-on-windows-restful-service

{ "syncheader" : {
    "servertimesync" : "20131126121749",
    "deviceid" : "testDevice"
  }
}
then in my case I issue:

curl localhost:9000/sync -H "Content-type:application/json" -X POST -d @json.txt

Hashing

http://pythoncentral.io/hashing-strings-with-python/

news_title, news_date, image_number

Image folder naming

It seems like the image folder now has a name newspaper_name_crawled_date_crawled_date but it should be newspaper_name_published_date_crawled_date.

API Key should not be exposed

Put xxxxxxx..... instead of the actual api key of indicoio.

Train StanfordNERTagger with custom words

Possible solution resources:
http://stackoverflow.com/questions/15609324/training-n-gram-ner-with-stanford-nlp

## [StanfordNERTagger Training]

Training Custom NER Tagger based on collected organization and Bangladeshi name dataset

Remove date field

We already have crawled_date and published_date. We don't need the date field anymore because it seems be the same date as crawled_date

Keep all newspapers in the same collection in mongodb

You are creating different collection for different newspapers. Put them into the same collection as you are doing in elasticsearch. The newspapers could easily by identified by the newspaper name inserted with every document.

## Configuring Elasticsearch

Elasticsearch install problem windows

Open the file,\elasticsearch-5.0.0\elasticsearch-5.0.0\config\jvm.options, add -Xsslm after
-Xms2g -Xmx2g
then save and run elasticsearch-service.bat again and it has been installed

# [Prothom Alo Spider]

Writing a crawler/spider for crawling prothom alo website [english]

Make the project virtual Env ready

How To Use Virtual Env

Strip in Title

2017-02-23 14:55:24 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.thedailystar.net/news-detail-17301> (referer: http
://www.thedailystar.net/newspaper?date=2008-01-01)
Traceback (most recent call last):
  File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, 
in iter_errback
    yield next(it)
  File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.p
y", line 29, in process_spider_output
    for x in result:
  File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.p
y", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength
.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py"
, line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/NewsCrawler/spiders/DailyStarCrawler.py", line 168, in parseNews
    news_item = self.getTitle(news_item, response)
  File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/NewsCrawler/spiders/DailyStarCrawler.py", line 278, in getTitle
    title = response.xpath("//h1/text()").extract_first().strip()
AttributeError: 'NoneType' object has no attribute 'strip'

[Updates]

Note down the dependencies
Stanford NER Tagger Location: config [DONE]
API Key : config [DONE]
Tell all the things in README
Both Unique and list [DONE]
Problem in datetime [DONE]
Add MongoDB

.gitignore
-> config file

Pipelining

https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/

Querying elasticsearch

http://okfnlabs.org/blog/2013/07/01/elasticsearch-query-tutorial.html

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [], u'max_score': None, u'total': 0},
 u'timed_out': False,
 u'took': 11}

In [26]: r['hits']
Out[26]: {u'hits': [], u'max_score': None, u'total': 0}

In [27]: q = {
    ...:     "query": {
    ...:         "term": {
    ...:             "newspaper" : "Daily Star"
    ...:         }
    ...:     }
    ...: }

In [28]: r = client.search(index="newspaper_index", doc_type="news", body=q)

In [29]: r
Out[29]:
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [], u'max_score': None, u'total': 0},
 u'timed_out': False,
 u'took': 2}

Script for collecting names from following webpages

https://en.wikipedia.org/wiki/Bangladeshi_name
http://www.studentsoftheworld.info/penpals/stats.php3?Pays=BNG
http://www.babynamesdirect.com/bengali-baby-names/

NoClassDefFoundError

Class is not being recognized.

Encountered the error in Stanford NER Tagger training process

Possible Solution: Downgrading the Stanford NER/POS/CoreNLP version from 3.6.0 to 3.5.2

Things I've found so far

https://gist.github.com/alvations/e1df0ba227e542955a8a
nltk/nltk#1237
http://stackoverflow.com/questions/20595351/exception-in-thread-main-java-lang-noclassdeffounderror-edu-stanford-nlp-time
http://stanfordnlp.github.io/CoreNLP/cmdline.html
http://stackoverflow.com/questions/6780678/run-class-in-jar-file
http://nlp.stanford.edu/software/crf-faq.html#a

Yellow pages

http://www.bangladeshyellowpages.com/

Assorted issues before first run

1. Update the description in readme.md after you verify them:

You might need to update the nltk corpuses to properly run the newspaper library
python -m nltk.downloader all

2. Update the description in readme.md after you verify it:

To encounter the lxml error while installing newspaper using pip in ubuntu:
sudo apt-get install python-dev libxml2-dev libxslt1-dev zlib1g-dev
http://stackoverflow.com/questions/5178416/pip-install-lxml-error

3. Update the description in readme.md after you verify it:

To encounter jpeg is required error while installing indicoio

sudo apt-get install libtiff5-dev libjpeg8-dev zlib1g-dev libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python-tk

4. Update the description in readme.md after you verify it:

You need to update the default nltk 2.x to 3.x so that nltk.tag works

pip install -U nltk

5. Update the description in readme.md after you verify it:

You forgot to add this in the requirements

pip install wget

6. Update the description in readme.md after you verify it:

make sure adding a single path name would suffice

here is another place Newcrawler/newscralwer/credentials where you need the pathname

7. Update the description in readme.md after you verify it:

Adding the path and api key in the root folder does not make any difference, it seems that the actual configurations files are. the files in NewsCrawler/NewsCrawler/credentials_and_configs directory

There is problem with indentation, in all the crawler files, it was resolved when I used sublime to convert all indentation to tabs then it is resolved

8. You forgot to `import re`,

but used re somewhere in the middle of the codes in the dailystar and prothomalo file, but it was imported in the dtribune crawler.

9. where are you saving the jpeg files?

10. DhakaTribune crawler not working

11. Most importantly you saved the published date as a string, it should be a date, not a string, and rename date variable to `date_crawled` and `published date to date_published`**

12. Change ner_list_location to ner_locations and change ner_location to ner_unique_locations.**

13. what does shoulder keyword stand for inside the data inserted in elasticsearch?

error: Setup script exited with error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Possible solution
http://stackoverflow.com/questions/26053982/error-setup-script-exited-with-error-command-x86-64-linux-gnu-gcc-failed-wit

Error creating the id

2017-02-23 14:57:36 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.thedailystar.net/news-detail-17367> (referer: http://www.thedailystar.net/newspaper?date=2008-01-01)
Traceback (most recent call last):
File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in
return (_set_referer(r) for r in result or ())
File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/venv/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/NewsCrawler/spiders/DailyStarCrawler.py", line 213, in parseNews
news_item = self.get_id(news_item, response)
File "/media/storage/syed_databases/newspaper_crawl/NewsCrawler/NewsCrawler/spiders/DailyStarCrawler.py", line 116, in get_id

news_title = str(news_item['title']).lower().strip().encode('ascii', 'ignore').replace(' ', '_')

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)

## [The Daily Star Scraper]

Need to write a spider for The Daily Star, English newspaper.

http://www.thedailystar.net/

Changing id of the news articles

Formula for the id of a news article

Newspaper Name + published date of the article + crawled date

This will be used as the image_folder name as well