This repository holds the code for scrapers built under the project "Scrape the Planet"
- Methods used for scraping : Scrapy
- Language used for scraping : Python3.X.X
Minutes of the meeting: http://bit.ly/scrapeThePlanet
Clone the repository (or download it). Then, follow the installation steps to run the spiders.
python3 -m venv VENV_NAME
Windows: VENV_NAME/Scripts/activate
Linux: source VENV_NAME/bin/activate
Navigate to repository: pip3 install -r requirements.txt
- Requirements:
- scrapy
- requests
- python-dateutil
Note: You can comment out the following code in settings.py to avoid using pipelines.
ITEM_PIPELINES = {
'scrapeNews.pipelines.ScrapenewsPipeline': 300
}
-
Installation in Debian:
sudo apt-get install postgresql postgresql-contrib
-
Configurations:
- config:
/etc/postgresql/9.5/main
- data:
/var/lib/postgresql/9.5/main
- socket:
/var/run/postgresql
- port:
5432
- config:
-
Make User: Note: Your USERNAME and PASSWORD must contain only smallcase characters.
sudo -i -u postgres
createuser YOUR_ROLE_NAME/YOUR_USERNAME --interactive --pwprompt
-
Setup Database:
- Create file a scrapeNews/envConfig.py; Inside it, Write:
USERNAME = 'YOUR_ROLE_NAME/YOUR_USERNAME' PASSWORD = 'YOUR_PASSWORD' SPIDER_DETAILS = [ # Add your Spiders in the following format. # {'site_id':'YOUR_SPIDER_ID','YOUR_SPIDER_NAME':'','site_url':'YOUR_SPIDER_URL'} {'site_id':100,'site_name':'EXAMPLE_SPIDER_NAME','site_url':'http://www.example.com'}, {'site_id':101,'site_name':'Indian Express','site_url':'http://indianexpress.com/section/technology/'}, {'site_id':102,'site_name':'India TV','site_url':'http://www.indiatvnews.com/business/tech/'}, {'site_id':103,'site_name':'Time','site_url':'http://time.com/section/tech/'}, {'site_id':104,'site_name':'NDTV','site_url':'https://www.ndtv.com/latest/page-1'}, {'site_id':105,'site_name':'Inshorts','site_url':'http://www.inshorts.com/en/read/'}, {'site_id':106,'site_name':'Zee News','site_url':'http://zeenews.india.com/india'}, {'site_id':107,'site_name':'News18','site_url':'http://www.news18.com/news/'}, {'site_id':108,'site_name':'moneyControl','site_url':'http://www.moneycontrol.com/news/business/'}, {'site_id':109,'site_name':'oneindia','site_url':'https://www.oneindia.com/india/'}, {'site_id':110,'site_name':'oneindia(hindi)','site_url':'https://hindi.oneindia.com/news/india/'}, {'site_id':111,'site_name':'firstpost(hindi)','site_url':'https://hindi.firstpost.com/category/latest/'}, {'site_id':112,'site_name':'firstpost(sports)','site_url':'http://www.firstpost.com/category/sports/'}, {'site_id':113,'site_name':'newsx','site_url':'http://www.newsx.com/latest-news/'}, {'site_id':114,'site_name':'hindustantimes','site_url':'http://www.hindustantimes.com/editors-pick/'}, {'site_id':115,'site_name':'asianage','site_url':'http://www.asianage.com/newsmakers'} ]
- Run setupDB.py
Note: Navigate to the folder containing scrapy.cfg
scrapy crawl SPIDER_NAME
-
SPIDER_NAME List:
- indianExpressTech
- indiaTv
- timeTech
- ndtv
- inshorts
- zeeNews
- News18Spider
- moneyControl
- oneindia
- oneindiaHindi
- firstpostHindi
- firstpostSports
- newsx
- hindustantimes
- asianage
- newsNation [In development]
-
Options:
- To set the number of pages to be scraped use
-a pages = X
(X = Number of pages). Applicable for:- indianExpressTech
- indiaTv
- timeTech
- moneyControl
- oneindia
- oneindiaHindi
- firstpostHindi
- firstpostSports
- newsx
- asianage
- ndtv
- To set the number of pages to be scraped use
Happy collaborating !! ย