hansard_data's Introduction

Singapore Hansard Data

Parliament records scraped from 2005-01-12 to 2021-03-05 and stored in hansard_full.zip
Sessions are captured in a JSON file
- JSON format changes before 2012-09-10, where the entire HTML file is stored in the JSON instead :/
- From 2012-09-10, JSON file contains the following:
  - Metadata including date of session, start time, session name
  - Attendance List
  - Permission for MPs to be absent
  - Assent to Bills Passed
  - Actual discussion by MPs with appropriate title and rich formatting
Some example Bills, Motions and Oral Answers included

Cleaning of data

A format_hansard.py file is included to extract speeches and standardise the formatting of the JSON files. The formatted JSON files are written into an output folder. By default, all peripheral text is removed (without the need for any flags). Specific flags are introduced for partial Hansard sessions (e.g. Bills, Motions and Oral Answers) as well as more granular settings (with the -g flag).

Annotation of data

A formatted_to_txt.py is included to turn the formatted JSON file into text for annotation.

The Annotated folder consists of a config.json file to be used with ner-annotator. The annotated files are present and labelled with _annotated.json.

The workflow for annotation was done with the following steps:

Download JSON file
Format JSON file using format_hansard.py: python format_hansard.py step_1_file.json -f, which will give you a step_2_formatted.json file
Convert to text: python formatted_to_txt.py step_2_formatted.json, which will give you a step_3_text.txt file
Use ner_annotator: ner_annotator step_3_text.txt -c config.json -m hansard

hansard_data's People

Contributors

Stargazers

Watchers

hansard_data's Issues

Request for scraper

Hi, I'm a Singaporean civil servant working on a data project to analyse parliamentary questions and replies for work. I happened upon this project while trying to build my own scraper, and was wondering if you could guide me on how you scraped Hansard? I would like to scrape to an updated period (until 2023) and was hoping I could borrow your old code and update it myself. Looked through the files but the format_hansard file didn't seem to be the one scraping? Appreciate your response! I can also drop my private/work email if it's easier to communicate there. Thank you!

Recommend Projects

nus-cs3244-ml-singapore-7 / hansard_data Goto Github PK

hansard_data's Introduction

Singapore Hansard Data

Cleaning of data

Annotation of data

hansard_data's People

Contributors

Stargazers

Watchers

hansard_data's Issues

Request for scraper

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent