vanangamudi / newspaper-crawler-scripts Goto Github PK

View Code? Open in Web Editor NEW

30.0 4.0 42.0 140 KB

set of scripts for crawling newspaper websites.

License: GNU General Public License v3.0

Python 100.00%

newspaper-crawler-scripts's Introduction

Newspaper Crawler Scripts

Set of scripts for crawling newspaper websites. Please find the available scripts below

Setup

pip3 install -r requirements.txt

Todo

[ ] Extract common code into a decorator

Contribute

Scripts for more news websites are welcome. Please save the text scraped in UTF-8 encoding. Please refer to the newspapers list file and pick one to scrape.

Latest Script

crawler-oneindia.py under malayalam has the latest code, you can use this a template for future crawlers.

Directory structure

<newspaper_name>
  title.list --> acts as a index for other directories.
  articles
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan
  abstracts
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan

Available scripts.

Tamil

Site	URL	script
Nakkheeran	http://nakkheeran.in/	tamil/crawler-nakkheeran.py
Dailythanthi	http://dailythanthi.com/	tamil/crawler-dailythanthi.py
Tamil The Hindu	http://tamil.thehindu.com/	tamil/crawler-tamil-hindu.py
Puthiyathalaimurai	http://puthiyathalaimurai.com/	tamil/crawler-puthiyathalaimurai.py
Dinamani	http://dinamani.com/	tamil/crawler-dinamani.py

Malayalam

Site	URL	script
Manorama	http://www.manoramaonline.com/	malayalam/crawler-manorama.py
Asianet News	https://www.asianetnews.com/	malayalam/crawler-asianet.py
One India	https://malayalam.oneindia.com/	malayalam/crawler-oneindia.py

Bengali

Site	URL	script
Ananadabazar	https://www.anandabazar.com	Bengali/crawler-anandabazar.py
Aajkal	https://www.aajkaal.in	Bengali/crawler-aajkal.py

Konkani

Site	URL	script
Konkani Kaniyo	http://konkani-kaniyo-in-nagri.blogspot.com	konkani/crawler-konkani-kaniyo.py

Marathi

Site	URL	script
Lokmat	http://www.lokmat.com/	marathi/crawler-lokmat.py
Maharashtratimes	https://maharashtratimes.indiatimes.com/	marathi/crawler-maharashtratimes.py
Loksatta	https://www.loksatta.com	marathi/crawler-loksatta.py
ABPmajha	https://abpmajha.abplive.in	marathi/crawler-abpmajha.py

newspaper-crawler-scripts's People

Contributors

Stargazers

Watchers

Forkers

yedhink mvenkata-rathnam subins2000 anoopmsivadas indicnlp nike47 simmranvermaa aswindinesh aswinshriramt hardipinders husain-zaidi iamprakash13 vipulchodankar srihari-palivela pythagaurang rudrakshk haridasssas devarakondapranav gowtha-dev jaseemck athj mmurugan utkarsh1800 krishnakatyal wernergomindes vhawk19 sayoni26 jibrinbabaisah awaisrafiq410 danimg95 adithyaanilkumar sunilsivadas met-hack findme1 abhidhariwal sanidutta36 jeevamutharasi

newspaper-crawler-scripts's Issues

No Urdu newspaper crawlers

We can have crawlers for Urdu newspapers

Add crawler for Punjabi

Currenly we do not have any crawlers for Punjabi. Need to add more crawlers.

Add NDTV Tamil website and a scrapper for it.

The link for NDTV Tamil.
https://www.ndtv.com/tamil

Directory creation in Windows: Error

makedir from linux uses '/' whereas Windows uses '\' for paths in the creation of files and folders in a directory.
So instead of hard coding it for Windows and Linux separately, We'd like to have os.join to detect the os and choose the right option in the python script.

no crawler for suprabatham Malayalam

No list of newpapers for Telugu

No list of Newspapers for Punjabi

Pycon Devsprints

No crawler for https://www.bbc.com/marathi

Starting work on a crawler for https://www.bbc.com/marathi

Should we migrate to scrapy?

The scrapy behaves like a scriptable/command line browser. The async support is a compelling feature for crawling large news sites. But our current scripts use BeautifulSoup4 which is sufficient for scraping, because we are not going to scrape 24x7. Once scraping is done, it is a done deal.

@adamshamsudeen : your thoughts??

How's the crawler script ran ?

I was trying to make a scraper for dheshabhimani.com

Copied this script from tamil folder, But running the script yields :

ModuleNotFoundError: No module named 'config'

How is the crawler scripts run ? config.py is in the root folder. A little bit of doc on this would be helpful.

Is there a way to limit the crawled URLs so that I can make sure the script's working ? I don't have the resources to do an entire crawl.

I'm using MultiThreadedCrawler2