Giter Site home page Giter Site logo

nit-in / pib Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 2.0 179.75 MB

Download articles by the Press Information Bureau, India follow the instructions or download by month from the releases section

Python 100.00%
python3 scrapy selenium pib sebi rbi upsc ssc news india

pib's Introduction

pib

PIB articles for the year - 2023


Dec, 2023
PIB_Daily - Dec 2023 PIB_Monthly - Dec 2023 PIB_Text - Dec 2023 PIB_Links - Dec 2023
Nov, 2023
PIB_Daily - Nov 2023 PIB_Monthly - Nov 2023 PIB_Text - Nov 2023 PIB_Links - Nov 2023
Oct, 2023
PIB_Daily - Oct 2023 PIB_Monthly - Oct 2023 PIB_Text - Oct 2023 PIB_Links - Oct 2023
Sep, 2023
PIB_Daily - Sep 2023 PIB_Monthly - Sep 2023 PIB_Text - Sep 2023 PIB_Links - Sep 2023
Aug, 2023
PIB_Daily - Aug 2023 PIB_Monthly - Aug 2023 PIB_Text - Aug 2023 PIB_Links - Aug 2023
Jul, 2023
PIB_Daily - Jul 2023 PIB_Monthly - Jul 2023 PIB_Text - Jul 2023 PIB_Links - Jul 2023
Jun, 2023
PIB_Daily - Jun 2023 PIB_Monthly - Jun 2023 PIB_Text - Jun 2023 PIB_Links - Jun 2023
May, 2023
PIB_Daily - May 2023 PIB_Monthly - May 2023 PIB_Text - May 2023 PIB_Links - May 2023
Apr, 2023
PIB_Daily - Apr 2023 PIB_Monthly - Apr 2023 PIB_Text - Apr 2023 PIB_Links - Apr 2023
Mar, 2023
PIB_Daily - Mar 2023 PIB_Monthly - Mar 2023 PIB_Text - Mar 2023 PIB_Links - Mar 2023
Feb, 2023
PIB_Daily - Feb 2023 PIB_Monthly - Feb 2023 PIB_Text - Feb 2023 PIB_Links - Feb 2023
Jan, 2023
PIB_Daily - Jan 2023 PIB_Monthly - Jan 2023 PIB_Text - Jan 2023 PIB_Links - Jan 2023

PIB articles for the year - 2022


Dec, 2022
PIB_Daily - Dec 2022 PIB_Monthly - Dec 2022 PIB_Text - Dec 2022 PIB_Links - Dec 2022
Nov, 2022
PIB_Daily - Nov 2022 PIB_Monthly - Nov 2022 PIB_Text - Nov 2022 PIB_Links - Nov 2022
Oct, 2022
PIB_Daily - Oct 2022 PIB_Monthly - Oct 2022 PIB_Text - Oct 2022 PIB_Links - Oct 2022
Sep, 2022
PIB_Daily - Sep 2022 PIB_Monthly - Sep 2022 PIB_Text - Sep 2022 PIB_Links - Sep 2022
Aug, 2022
PIB_Daily - Aug 2022 PIB_Monthly - Aug 2022 PIB_Text - Aug 2022 PIB_Links - Aug 2022
Jul, 2022
PIB_Daily - Jul 2022 PIB_Monthly - Jul 2022 PIB_Text - Jul 2022 PIB_Links - Jul 2022
Jun, 2022
PIB_Daily - Jun 2022 PIB_Monthly - Jun 2022 PIB_Text - Jun 2022 PIB_Links - Jun 2022
May, 2022
PIB_Daily - May 2022 PIB_Monthly - May 2022 PIB_Text - May 2022 PIB_Links - May 2022
Apr, 2022
PIB_Daily - Apr 2022 PIB_Monthly - Apr 2022 PIB_Text - Apr 2022 PIB_Links - Apr 2022
Mar, 2022
PIB_Daily - Mar 2022 PIB_Monthly - Mar 2022 PIB_Text - Mar 2022 PIB_Links - Mar 2022
Feb, 2022
PIB_Daily - Feb 2022 PIB_Monthly - Feb 2022 PIB_Text - Feb 2022 PIB_Links - Feb 2022
Jan, 2022
PIB_Daily - Jan 2022 PIB_Monthly - Jan 2022 PIB_Text - Jan 2022 PIB_Links - Jan 2022

PIB articles for the year - 2021


Dec, 2021
PIB_Daily - Dec 2021 PIB_Monthly - Dec 2021 PIB_Text - Dec 2021 PIB_Links - Dec 2021
Nov, 2021
PIB_Daily - Nov 2021 PIB_Monthly - Nov 2021 PIB_Text - Nov 2021 PIB_Links - Nov 2021
Oct, 2021
PIB_Daily - Oct 2021 PIB_Monthly - Oct 2021 PIB_Text - Oct 2021 PIB_Links - Oct 2021
Sep, 2021
PIB_Daily - Sep 2021 PIB_Monthly - Sep 2021 PIB_Text - Sep 2021 PIB_Links - Sep 2021
Aug, 2021
PIB_Daily - Aug 2021 PIB_Monthly - Aug 2021 PIB_Text - Aug 2021 PIB_Links - Aug 2021
Jul, 2021
PIB_Daily - Jul 2021 PIB_Monthly - Jul 2021 PIB_Text - Jul 2021 PIB_Links - Jul 2021
Jun, 2021
PIB_Daily - Jun 2021 PIB_Monthly - Jun 2021 PIB_Text - Jun 2021 PIB_Links - Jun 2021
May, 2021
PIB_Daily - May 2021 PIB_Monthly - May 2021 PIB_Text - May 2021 PIB_Links - May 2021
Apr, 2021
PIB_Daily - Apr 2021 PIB_Monthly - Apr 2021 PIB_Text - Apr 2021 PIB_Links - Apr 2021
Mar, 2021
PIB_Daily - Mar 2021 PIB_Monthly - Mar 2021 PIB_Text - Mar 2021 PIB_Links - Mar 2021
Feb, 2021
PIB_Daily - Feb 2021 PIB_Monthly - Feb 2021 PIB_Text - Feb 2021 PIB_Links - Feb 2021
Jan, 2021
PIB_Daily - Jan 2021 PIB_Monthly - Jan 2021 PIB_Text - Jan 2021 PIB_Links - Jan 2021

PIB articles for the year - 2020


Dec, 2020
PIB_Daily - Dec 2020 PIB_Monthly - Dec 2020 PIB_Text - Dec 2020 PIB_Links - Dec 2020
Nov, 2020
PIB_Daily - Nov 2020 PIB_Monthly - Nov 2020 PIB_Text - Nov 2020 PIB_Links - Nov 2020
Oct, 2020
PIB_Daily - Oct 2020 PIB_Monthly - Oct 2020 PIB_Text - Oct 2020 PIB_Links - Oct 2020
Sep, 2020
PIB_Daily - Sep 2020 PIB_Monthly - Sep 2020 PIB_Text - Sep 2020 PIB_Links - Sep 2020
Aug, 2020
PIB_Daily - Aug 2020 PIB_Monthly - Aug 2020 PIB_Text - Aug 2020 PIB_Links - Aug 2020
Jul, 2020
PIB_Daily - Jul 2020 PIB_Monthly - Jul 2020 PIB_Text - Jul 2020 PIB_Links - Jul 2020
Jun, 2020
PIB_Daily - Jun 2020 PIB_Monthly - Jun 2020 PIB_Text - Jun 2020 PIB_Links - Jun 2020
May, 2020
PIB_Daily - May 2020 PIB_Monthly - May 2020 PIB_Text - May 2020 PIB_Links - May 2020
Apr, 2020
PIB_Daily - Apr 2020 PIB_Monthly - Apr 2020 PIB_Text - Apr 2020 PIB_Links - Apr 2020
Mar, 2020
PIB_Daily - Mar 2020 PIB_Monthly - Mar 2020 PIB_Text - Mar 2020 PIB_Links - Mar 2020
Feb, 2020
PIB_Daily - Feb 2020 PIB_Monthly - Feb 2020 PIB_Text - Feb 2020 PIB_Links - Feb 2020
Jan, 2020
PIB_Daily - Jan 2020 PIB_Monthly - Jan 2020 PIB_Text - Jan 2020 PIB_Links - Jan 2020

Download articles from Press Information Bureau, India. This might be helpful for candidates preparing for different govt examinations.

Downloading Articles:

Join the telegram channel to get PIB Articles daily, click here                       or
You can either use the spider by cloning this repo and following the instructions given below
                      or
You can download the articles direcly from the release section or by clicking on the badges above for the year and month.

There are 4 different kind of zips in the release section for every month

  1. Day wise PIB_Daily | MMM_YYYY : These zips contain the PIB articles for the date DD/MMM/YYYY
  2. Month wise PIB_Monthly | MMM_YYYY : These zips contain the PIB articles for the whole month MMM/YYYY
  3. Text files PIB_Text | MMM_YYYY : These zips contain the text extracted from pdf of the whole month MMM/YYYY
  4. Article Link files PIB_LINKS | MMM_YYYY : These text files contains links for the article for the date DD/MMM/YYYY

How to use:

#Clone the repo with:
git clone https://github.com/nit-in/pib
#cd to the cloned repo
cd pib
#installing required packages
pip install -r requirements.txt
#when these steps are done,you are ready to run the spider and download the articles.

#source the env file
source .env

#if you are using shell other than bash then 
bash --init-file .env

#run the spider
pib start_date end_date #(date format: yyyy-mm-dd)
#example => to download the articles from June 1st, 2021 to June 15th, 2021; use
pib 2021-06-01 2021-06-15

For an Entire Month

#For the month of Jan, 2021
pib_month Jan 2021

#For the month of Dec, 2020
pib_month Dec 2020

#For present day
pib_today

#For last day
pib_last_day

You can now fork the repo and Go to (You may have add repo secret named PIB) check the process

Actions > Select "PIB_INDIA_MINISTRY" > Run workflow
and Select appropriate options and run the workflow

IMG_20231106_101656

To run locally

source .env
pib_min start_date end_date ministry_code

For ministry code check ministries.txt file for desired ministry

Read this if you are forking the repo

you may have trouble with github tokens when you are running it after forking it click on profile icon > Settings > Developer settings > Personal access tokens > generate new token(classic) Name it, no expiration in scopes > select repo, workflow, write and delete packages, then generate and copy it IMG_20231117_204211.jpg

then go to pib repo Settings > Secrets and variables > Actions here new repository secret in name enter PIB ( all in caps) and in secret > paste the token you copied IMG_20231117_204101.jpg

Any suggestions and improvements are welcome.

pib's People

Contributors

dependabot[bot] avatar nit-in avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

pib's Issues

facing a issue

after running error show

RajatTanwar@X MINGW64 ~
$ cd pib

RajatTanwar@X MINGW64 ~/pib (master)
$ source .env

             To download PIB articles for a range of date do the following...

                             pib startDate endDate (format: yyyy-mm-dd)

                             for example to download articles from June 1st,2021 to Jun 15th,2021 enter

                             pib 2021-06-01 2021-07-01


            for Entire Month of 2021 Enter (for example month of January)
                             pib_month Jan 2021

RajatTanwar@X MINGW64 ~/pib (master)
$ pib 2021-06-01 2021-06-15
2021-06-01
Unhandled error in Deferred:

Traceback (most recent call last):
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\crawler.py", line 192, in crawl
return self._crawl(crawler, *args, **kwargs)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\crawler.py", line 196, in _crawl
d = crawler.crawl(*args, **kwargs)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\twisted\internet\defer.py", line 1656, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\twisted\internet\defer.py", line 1571, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- ---
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\twisted\internet\defer.py", line 1445, in _inlineCallbacks
result = current_context.run(g.send, result)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\crawler.py", line 87, in crawl
self.engine = self._create_engine()
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\crawler.py", line 101, in _create_engine
return ExecutionEngine(self, lambda : self.stop())
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\core\engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\core\downloader_init
.py", line 83, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\middleware.py", line 53, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\middleware.py", line 35, in from_settings
mw = create_instance(mwcls, settings, crawler)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\utils\misc.py", line 167, in create_instance
instance = objcls.from_crawler(crawler, *args, **kwargs)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy_selenium\middlewares.py", line 71, in from_crawler
browser_executable_path=browser_executable_path
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy_selenium\middlewares.py", line 51, in init
self.driver = driver_klass(**driver_kwargs)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\selenium\webdriver\chrome\webdriver.py", line 73, in init
self.service.start()
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\selenium\webdriver\common\service.py", line 76, in start
stdin=PIPE)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\subproce
ss.py", line 775, in init
restore_signals, start_new_session)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\subproce
ss.py", line 1178, in _execute_child
startupinfo)
builtins.OSError: [WinError 193] %1 is not a valid Win32 application

2021-06-02
Unhandled error in Deferred:

Traceback (most recent call last):
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\crawler.py", line 192, in crawl
return self._crawl(crawler, *args, **kwargs)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\crawler.py", line 196, in _crawl
d = crawler.crawl(*args, **kwargs)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\twisted\internet\defer.py", line 1656, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\twisted\internet\defer.py", line 1571, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- ---
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\twisted\internet\defer.py", line 1445, in _inlineCallbacks
result = current_context.run(g.send, result)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\crawler.py", line 87, in crawl
self.engine = self._create_engine()
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\crawler.py", line 101, in _create_engine
return ExecutionEngine(self, lambda : self.stop())
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\core\engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\core\downloader_init
.py", line 83, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\middleware.py", line 53, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\middleware.py", line 35, in from_settings
mw = create_instance(mwcls, settings, crawler)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy\utils\misc.py", line 167, in create_instance
instance = objcls.from_crawler(crawler, *args, **kwargs)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy_selenium\middlewares.py", line 71, in from_crawler
browser_executable_path=browser_executable_path
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\scrapy_selenium\middlewares.py", line 51, in init
self.driver = driver_klass(**driver_kwargs)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\selenium\webdriver\chrome\webdriver.py", line 73, in init
self.service.start()
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\site-pac
kages\selenium\webdriver\common\service.py", line 76, in start
stdin=PIPE)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\subproce
ss.py", line 775, in init
restore_signals, start_new_session)
File "c:\users\rajattanwar\appdata\local\programs\python\python37\lib\subproce
ss.py", line 1178, in _execute_child
startupinfo)
builtins.OSError: [WinError 193] %1 is not a valid Win32 application

executing in Git-2.32.0.2-64-bit.exe

info

Filter ministry wise and specific dates

Hi Nitin,
Good work in scraping PIB news articles. Possible to scrape ministry wise and filter for specific dates?
An example: Ministry of Finance from 1 September 2023 to 1 October 2023? It will be a helpful feature. Thank you!

Issue in downloading the PDF's

i tired to run the command pib_min 2023-01-01 2023-01-31 31 and i got the logs as

downloading /home/users/akhilvssg/pib/2023/Jan/31/Ministry_of_Health_and_Family_Welfare/Union_Minister_of_Health__Family_Welfare_Dr_Mansukh_Mandaviya_unveils_FSSAIs_Calendar__1895167.pdf ....

but when i had seen the folder there are no pdf downloaded its empty
Screenshot 2024-01-31 at 4 52 39 PM

i had followed the method mentioned in the readme file but iam not able to get the pdfs nor information for the specific data.
Please look into this issue and resolve as early as possible
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.