Giter Site home page Giter Site logo

hansard_data's Introduction

Singapore Hansard Data

  • Parliament records scraped from 2005-01-12 to 2021-03-05 and stored in hansard_full.zip
  • Sessions are captured in a JSON file
    • JSON format changes before 2012-09-10, where the entire HTML file is stored in the JSON instead :/
    • From 2012-09-10, JSON file contains the following:
      • Metadata including date of session, start time, session name
      • Attendance List
      • Permission for MPs to be absent
      • Assent to Bills Passed
      • Actual discussion by MPs with appropriate title and rich formatting
  • Some example Bills, Motions and Oral Answers included

Cleaning of data

A format_hansard.py file is included to extract speeches and standardise the formatting of the JSON files. The formatted JSON files are written into an output folder. By default, all peripheral text is removed (without the need for any flags). Specific flags are introduced for partial Hansard sessions (e.g. Bills, Motions and Oral Answers) as well as more granular settings (with the -g flag).


Annotation of data

A formatted_to_txt.py is included to turn the formatted JSON file into text for annotation.

The Annotated folder consists of a config.json file to be used with ner-annotator. The annotated files are present and labelled with _annotated.json.

The workflow for annotation was done with the following steps:

  1. Download JSON file
  2. Format JSON file using format_hansard.py: python format_hansard.py step_1_file.json -f, which will give you a step_2_formatted.json file
  3. Convert to text: python formatted_to_txt.py step_2_formatted.json, which will give you a step_3_text.txt file
  4. Use ner_annotator: ner_annotator step_3_text.txt -c config.json -m hansard

hansard_data's People

Contributors

nigelnnk avatar

Stargazers

Eugene Siow avatar

Watchers

James Cloos avatar  avatar Kingsley Kuan avatar

hansard_data's Issues

Request for scraper

Hi, I'm a Singaporean civil servant working on a data project to analyse parliamentary questions and replies for work. I happened upon this project while trying to build my own scraper, and was wondering if you could guide me on how you scraped Hansard? I would like to scrape to an updated period (until 2023) and was hoping I could borrow your old code and update it myself. Looked through the files but the format_hansard file didn't seem to be the one scraping? Appreciate your response! I can also drop my private/work email if it's easier to communicate there. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.