Giter Site home page Giter Site logo

wri-dssg-omdena / policy-data-analyzer Goto Github PK

View Code? Open in Web Editor NEW
32.0 4.0 10.0 231.57 MB

Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.

License: Other

Makefile 0.01% Jupyter Notebook 91.93% Python 1.22% R 0.08% CSS 5.10% JavaScript 0.02% HTML 1.61% Shell 0.01% Dockerfile 0.02%
nlp sbert sentence-transformers huggingface machine-learning text-classification document-classification scraping policy environmental

policy-data-analyzer's Introduction

The Policy Accelerator

This project contains the code for the paper Accelerating Incentives: Identifying economic and financial incentives for forest and landscape restoration in Latin American policy using Machine Learning, accepted at ICCP5.

In the long term, we are building a tool that can be extended to any use case related to policy analysis. More information on the architecture and implementation below.

Table of contents

About

DSSG Solve and Omdena are collaborating with the World Resources Institute to create a tool that can assist policy analysts in understanding regulations and incentives relating to forest and landscape restoration, how these policies are applied in practice, and the degree of alignment across ministries and levels of government.

So far, we have successfully built an end-to-end pipeline containing a model that can identify financial and economic incentives in policy documents from 5 Latin American countries: Chile, El Salvador, Guatemala, Mexico, and Peru. We presented our project to government officials from these countries and have received support and input from stakeholders in El Salvador and Chile. Going forward, we will receive additional input from stakeholders in other countries, including Mexico and India.

The modeling side has yielded promising results, and we will be presenting this progress at the 5th Conference on International Public Policy. The potential impact of this framework is quite large, as it can be extended to multiple countries and to different types of policy analysis. Very little has been done to apply ML to restoration, so this project is a great opportunity to pioneer a new application of data science to environmental efforts. More information in the Background, Motivation and Impact section.

Architecture

General Pipeline

Human-in-the-loop Annotation Pipeline

Classifier Pipeline

Results

Incentive Detection

Incentive Instrument Classification

Development

Getting Started

Requirements

  • Python >= 3.6
  • Miniconda or virtualenv (or any type of virtual environment tool)
  • pip

Contribution Guidelines

Steps to contribute to the master branch

On Github

  1. Create an issue for each new bug/feature/update that you want to contribute. In the issue description, be as detailed as possible with what the expected inputs and outputs should be, and if possible what the process to solve the issue will be.
  2. Assign someone, as well as apply the respective tags (documentation, enhacement, etc.)

On your local machine

  1. If you haven't already, accept the invite to be a member of wri-dssg! Then clone the repository using git clone https://github.com/wri-dssg/policy-data-collector.git
  2. If you're going to work on issue #69 which is about extracting text, then create a branch for that issue:
git checkout -b i69_text_extraction
  1. Once work is done, commit and push:
git push --set-upstream origin i69_text_extraction

Back on Github

  1. Once issue is solved, make a Pull Request (PR) on Github to merge to the master branch, and link the issue in the PR description and assign people to review. If possible, do one PR once a week to avoid merge conflicts.
  2. If the PR gets approved and merged, you can close the issue and delete the branch!

Docker, reproducibility and development

  • The project's Dockerfile can be used to set up a development environment which encapsulates all dependencies necessary to run each project component. The purpose of this environment is to facilitate collaboration and reproducibility, while being able to develop and work on the project locally.
  • Future dependencies should be added either to the Dockerfile or the requirements.txt with a comment on the purpose of the specific package.

Build the Docker image:

$ docker build -f Dockerfile -t policy_container . 

Create a Docker container by running the image:

$ docker run -ti --rm -p 8888:8888 --mount source=$(pwd),target=/app,type=bind policy_container:latest  
# $(pwd) should give you the absolute path to the project directory

Launch a jupyter notebook from within the container

$ jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root

FAQs

  • I want to create a new branch starting from an old branch, how do I do that?
    • Say you want to create branch_2 based on branch_1 (in other words, with branch_1 as a starting point), then you would:
    $ git checkout -b branch_2 branch_1    
    
  • I want to bring the changes from one branch into mine, to keep mine updated, how do I do that?
    • Say you want to merge branch_1 INTO branch_2, then you would:
    $ git checkout branch_2   # if you aren't in branch 2 already
    $ git merge branch_1
    
  • If I'm working with someone in the same issue, can I contribute/push to their branch?
    • Technically yes, but it would be safer if you would work on yours first (maybe divide the issue in smaller issues) and then open a PR to theirs once you feel ready to merge code. Alternatively you could pair program and not worry about overwritting someone else's code :)

Project Organization

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
|
├── src                <- Source code for use in this project. Code used across tasks.
│
├── tasks              <- Top level folder for all tasks and code
│   └── <task_name>        <- Folder to contain materials for one single task
│       ├── src              <- Source code for use in this task.
│       ├── input            <- Input files for this task
│       ├── output           <- Output files from the task
│       ├── notebooks        <- Place to store jupyter notebooks/R markdowns or any prototyping files (the drafts)
│       └── README.md        <- Basic instructions on how to replicate the results from the output/run the code in src
│
└── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
                         generated with `pip freeze > requirements.txt` (we will probably need to change this to include R information in the future)

Project structure based on the cookiecutter data science project template and the task as a quantum of workflow project template.


Background, Motivation and Impact

We are on the verge of the United Nations Decade for Ecosystem Restoration. The Decade starts in 2021 and ushers in a global effort to drive ecosystem restoration to support climate mitigation and adaptation, water and food security, biodiversity conservation and livelihood development. In order to prepare for the decade, we must understand the enabling environment. However, to understand policies involves reading and analyzing thousands of pages of documentation across multiple sectors. Using NLP to mine policy documents, would promote knowledge sharing between stakeholders and enable rapid identification of incentives, disincentives, perverse incentives and misalignment between policies. If a lack of incentives or disincentives were discovered, this would provide an opportunity to advocate for positive change. Creating a systematic analysis tool using NLP would enable a standardized approach to generate data that can support evidence-based change.

The viability of Nature Based Solutions projects is often impeded by the lack of positive incentives to adopt practices that conserve or restore land. Perverse incentives also encourage business-as-usual practices that have a heavy carbon footprint, degrade ecosystems, exploit workers or fail to generate decent livelihoods for rural communities.

Shifting incentives in a specific jurisdiction begins with a diagnosis of the country’s existing regulations, incentives and mandates across agencies. The aim is to gain a thorough understanding of current regulations and incentives that are relevant to forest and landscape restoration, the reality of how they are applied in practice and the degree of alignment or conflict across ministries and different levels of government. Shifting incentives at international level, may require such diagnostics across multiple countries, or voluntary standards and business practices. For this purpose, natural language processing technologies are needed to expedite systematic review of the legal and policy context in the relevant jurisdictions, as well as examples of innovative incentives from other contexts.

Success will be achieved as governments or market platforms create aligned incentives across sectoral silos, remove administrative bottlenecks, or reorient incentives in line with recommendations. To advocate for change, a systematic process of analyzing incentives is needed beyond manual policy analysis. Currently manual policy analysis is the only method utilized to understand incentives. This is inadequate when considering the scale of the task.

Description taken from: DSSG Solve Project Description

Citation

This repository has been developed as part of the following paper. Please cite the following paper if you found the repository useful:

@article{DBLP:journals/corr/abs-2201-07105,
  author    = {Jordi Planas and
               Daniel Firebanks{-}Quevedo and
               Galina Naydenova and
               Ramansh Sharma and
               Cristina Taylor and
               Kathleen Buckingham and
               Rong Fang},
  title     = {Beyond modeling: {NLP} Pipeline for efficient environmental policy
               analysis},
  journal   = {CoRR},
  volume    = {abs/2201.07105},
  year      = {2022},
  url       = {https://arxiv.org/abs/2201.07105},
  eprinttype = {arXiv},
  eprint    = {2201.07105},
  timestamp = {Fri, 21 Jan 2022 13:57:15 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2201-07105.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

policy-data-analyzer's People

Contributors

bcjg23 avatar danncalle avatar dfhssilva avatar galiusha avatar mattesweeney avatar rongfang323 avatar rsmath avatar thefirebanks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

policy-data-analyzer's Issues

Extract text from Cristina's documents

  1. Read pdf files directly from OneDrive or from zip file
  2. Use OCR if needed for extracting text from image pdf files
  3. Optional: Structure data into single file (or database) for future reading

Create a general evaluator for the models

Script that:

  • Takes in as input the results from a given model run (as a JSON file containing sentences and their labels) and a dataset of labeled sentences
  • Compares the differences between the model outputs and the ground truth
  • Has different metrics (cosine similarity, accuracy, precision-recall curve, etc.)

Improve preprocessing component

The current method that we are using to split sentences yields a great amount of wrongly splitt sentences.
We need to improve it so as to have a good final version when we want to use the fine-tunned transformers.

Automate hyperparameter optimization

  • Look for information on automatic hyperparameter tuning optimization and its viability for our project
  • Define hyperparameters to be optimized for
  • Test new methods like population-based optimization or bayesian optimization

Bayesian Optimization explained

Population Based Optimization

Huggingface + Ray + W&B implementation of hyperparameter tuning

Explore query strategies with sBERT

We have a initial setup of sBERT to be able to get sentence embeddings and then find the cosine similarity between two sentences.
This allows for using this setup as a search engine to look for sentences which are similar to a certain query.
In this issue we want to analyse the output of the search as we use different query approaches.
There is a more sophisticated approach in https://towardsdatascience.com/building-a-search-engine-with-bert-and-tensorflow-c6fdc0186c8a that we will also explore to see if the performance improves. We will set a new branch Antyukhov-search-engine

SBERT for classification

Find a way of using SBERT for label prediction without using cosine similarity - i.e Find another mapping function from sentence embedding to label

Refactor preprocessing script

We can refactor src/preprocessing.py and make it slightly more readable/time efficient, as well as easy to add more transformations to the text

Edit README

Make changes to the README so that it contains updated information about the architecture, results and description of the project. The end goal is to spread the link to this repo as much as possible, and we need to have a good and presentable description of the project.

data loading refactoring and new functions

This is an issue to improve the data loading tools. You can list your changes here:

  • Rename the function to load json from "load_file" to "load_json" in the src/utils.py
  • Add a funtion to list file names from a directory

Set up AWS S3 general pipelines

According to the process diagrams, for each language, there will be many different databases, in the main bucket wri-nlp-policy.

GOAL: For each of the folders/content inside, we need to create functions that allow for easy access and manipulation.

An example structure for the english documents would be:

  • /english_documents/raw_pdf/: Original/raw documents
  • /english_documents/text_files/: Text file version of the documents
    • /english_documents/text_files/new/: New documents ready to be processed (read)
    • /english_documents/text_files/processed/: Processed documents that have already gone through sentence extraction (write)
  • /english_documents/sentences/: JSON file containing sentences per documents (read AND write)
  • /english_documents/assisted_labeling/: Excel/CSV files for the assisted labeling part (read AND write)
  • /english_documents/metadata/: CSV files containing metadata for each country (file names, title of document, etc.) (write)
  • /english_documents/abbreviations/: Text files containing common abbreviations for each language (read)
  • Extra separate files:
    • /english_documents/english_queries.xlsx: Queries (Excel) (read)

There are more databases to add, such as the one for actual embeddings (if needed) and the highlights per each document. Since we haven't created them yet, these are not necessary to create links to.

Weights & Biases experiments

Setup for Weights & Biases for our notebook and further experiments

We can use wandb for hyperparameter tuning and most importantly keeping track of experiments. With W&B free hosted service we can set this up efficiently.

Since the team version of wandb is paid (30 day free trial is there however), we will use the following project

https://wandb.ai/ramanshsharma/WRI

Please find the API key to write to the public project in Slack.

Goals

  • Create a shared project on weights & biases for the team to work on.
  • Set up training and validation accuracy/loss, weighted/macro F1 score plotting
  • Set up automatic hyperparameter tuning

Helpful links

  1. Intro to Weights & Biases
  2. Official examples
  3. Organizing Hyperparameter Sweeps in PyTorch with W&B

Fix training loops and sentence transformer

  • Evaluation code should be refactored to only take care of calculating results, and storing of results should be done in a different area
  • Evaluation should be done on validation set, not test set
  • Add method to evaluate on test set

Identify best set of keywords and search terms to find relevant documents from Ecolex

After the text has been extracted from the policy PDFs:

  1. Find word embeddings suitable for Spanish documents, or any type of Spanish language model
  2. Use keyword analysis/topic modeling to gather insights from the text and improve further searches
  3. If possible, come up with a "similarity" or "distance" metric among relevant documents for easier filtering from non-relevant

Setup binary classifier

Fine-tune BERT with a multilingual transformer to discriminate is_incentive from is_not_incentive

Spiders for US legislation

  • Rethink the database schema
  • Improve the is_enforced field update
  • Extend the is_enforced field to all spiders
  • Include the SHA1 code as the file name
  • Check the publication date field so that we have a uniformized format whether date-time or txt
  • Centralize the loading of the dictionaries in a specific folder
  • Centralize the output csv files into a specific folder.
  • Create spider for the US federal policies
  • Create spider for the US state official bulletins (Selected states)
  • Update dictionaries in English
  • Implement the SQL database again
  • Move date transformations to a function in init.py

Code refactoring for the data augmentation notebook.

One important estep in the project pipeline is to find a batch of prelabeled sentences that can be easily curated manually to be later used for model fine-tuning.
In a first step this was done in a notebook where different strategies where evaluated in different experimental setups.
Now this code should be cleaned and refactored to be integrated in the final pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.