Giter Site home page Giter Site logo

newspaper-crawler-scripts's Introduction

Newspaper Crawler Scripts

Set of scripts for crawling newspaper websites. Please find the available scripts below

Setup

pip3 install -r requirements.txt

Todo

[ ] Extract common code into a decorator

Contribute

Scripts for more news websites are welcome. Please save the text scraped in UTF-8 encoding. Please refer to the newspapers list file and pick one to scrape.

Latest Script

crawler-oneindia.py under malayalam has the latest code, you can use this a template for future crawlers.

Directory structure

<newspaper_name>
  title.list --> acts as a index for other directories.
  articles
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan
  abstracts
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan  

Available scripts.

Tamil

Site URL script
Nakkheeran http://nakkheeran.in/ tamil/crawler-nakkheeran.py
Dailythanthi http://dailythanthi.com/ tamil/crawler-dailythanthi.py
Tamil The Hindu http://tamil.thehindu.com/ tamil/crawler-tamil-hindu.py
Puthiyathalaimurai http://puthiyathalaimurai.com/ tamil/crawler-puthiyathalaimurai.py
Dinamani http://dinamani.com/ tamil/crawler-dinamani.py

Malayalam

Site URL script
Manorama http://www.manoramaonline.com/ malayalam/crawler-manorama.py
Asianet News https://www.asianetnews.com/ malayalam/crawler-asianet.py
One India https://malayalam.oneindia.com/ malayalam/crawler-oneindia.py

Bengali

Site URL script
Ananadabazar https://www.anandabazar.com Bengali/crawler-anandabazar.py
Aajkal https://www.aajkaal.in Bengali/crawler-aajkal.py

Konkani

Site URL script
Konkani Kaniyo http://konkani-kaniyo-in-nagri.blogspot.com konkani/crawler-konkani-kaniyo.py

Marathi

Site URL script
Lokmat http://www.lokmat.com/ marathi/crawler-lokmat.py
Maharashtratimes https://maharashtratimes.indiatimes.com/ marathi/crawler-maharashtratimes.py
Loksatta https://www.loksatta.com marathi/crawler-loksatta.py
ABPmajha https://abpmajha.abplive.in marathi/crawler-abpmajha.py

newspaper-crawler-scripts's People

Contributors

adamshamsudeen avatar anoopmsivadas avatar aswindinesh avatar athj avatar danimg95 avatar dependabot[bot] avatar hardipinders avatar husain-zaidi avatar jaseemck avatar meain avatar nike47 avatar pranshul972 avatar pythagaurang avatar rudrakshk avatar sayoni26 avatar simmranvermaa avatar srihari-palivela avatar subins2000 avatar utkarsh1800 avatar vanangamudi avatar vipulchodankar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

newspaper-crawler-scripts's Issues

Directory creation in Windows: Error

makedir from linux uses '/' whereas Windows uses '\' for paths in the creation of files and folders in a directory.
So instead of hard coding it for Windows and Linux separately, We'd like to have os.join to detect the os and choose the right option in the python script.

Should we migrate to scrapy?

The scrapy behaves like a scriptable/command line browser. The async support is a compelling feature for crawling large news sites. But our current scripts use BeautifulSoup4 which is sufficient for scraping, because we are not going to scrape 24x7. Once scraping is done, it is a done deal.

@adamshamsudeen : your thoughts??

How's the crawler script ran ?

I was trying to make a scraper for dheshabhimani.com

Copied this script from tamil folder, But running the script yields :

ModuleNotFoundError: No module named 'config'

How is the crawler scripts run ? config.py is in the root folder. A little bit of doc on this would be helpful.

Limiting URLs for testing - Make MAX_COUNT configurable via cli arguments

I need to test the script to see whether it works. The extraction of date and headlines etc.

But it seems to download everything before the extraction part is done. It's been going around for more than 30 minutes now.

Is there a way to limit the crawled URLs so that I can make sure the script's working ? I don't have the resources to do an entire crawl.

I'm using MultiThreadedCrawler2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.