Giter Site home page Giter Site logo

solertis / parliament-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from opntec/parliament-scraper

0.0 2.0 0.0 111.55 MB

Public Data Scraper for Parliament Data for the EU and other Parliaments

License: MIT License

Ruby 12.75% Python 67.06% Scala 20.18%

parliament-scraper's Introduction

parliament-scaper

Public Data Scraper for Parliament Data for the EU and other Parliaments

Ruby Based Crawler Setup

  1. Install git (if not present already)
  2. Clone project using git clone https://github.com/fossasia/parliament-scaper.git
  3. Install Ruby (version >= 2.1) and Bundler
  4. Run bundle install to install the required gems
  5. Run the script using ruby eu_scraper.rb or ./eu_scraper.rb
  6. Find the scraped questions in the docs/ folder

Technologies Used in Ruby crawler:

  1. Ruby - The Language
  2. Nokogiri - For HTML Parsing

Scala-based Asynchronous crawler Setup

  1. Install sbt, git and latest version of scala(sbt will do the update for you)
  2. git clone https://github.com/DengYiping/parliament-scaper.git
  3. sbt run
  4. sbt will first automatically download the necessary dependencies, and it will run the script.

Technologies Used in Scala crawler:

  1. Scala: a functional programming language on JVM
  2. Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
  3. Spray-client: a light-weighted HTTP client based on Akka Actor model.

Python Based Crawler Setup

  1. Install the requirements for this crawler pip install -r requirements.txt
  2. Run $ python eu_scraper.py

Technologies Used in Python Crawler:

  1. Requests library
  2. lxml library for DOM traversal

Python-async parser setup

  1. Create a virtual environment inside python-async folder with virtualenv --python=python3.4 venv
  2. Activate you virtual environment with source venv/bin/activate
  3. Install all appropriate requirements with pip install -r requirements.txt
  4. Run the parser with $ python parser.py

Changing the parser behavior

  • Change YEARS_TO_PARSE in order to parse data from different years
  • Change FOLDER_TO_DOWNLOAD in order to change the name of the folder to download the data into.

Technologies Used in Python-async parser:

  1. Requests + requests-futures for async requests
  2. threading for async downloading
  3. beautifulsoup4 for DOM parsing
  4. tqdm for progress bar

Python-Based Scraper (pol's scraper)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of questions to be scraped.

  1. Install the requirements pip install -r requirements.txt
  2. Run $ python scraper.py

Scrape it all - Generic Scraper(pol's scraper 2)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of docs to be scraped.

Generic Scraper - All years, All languages. Scrapes entire database.

  1. Install the requirements pip install -r requirements.txt
  2. Run $ python scrape_it_all.py

parliament-scraper's People

Contributors

dengyiping avatar jigyasa-grover avatar mariobehling avatar polbaladas avatar pythad avatar rhnvrm avatar sampritipanda avatar slayerjain avatar tabesin avatar yasoob avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.