Giter Site home page Giter Site logo

qingfengwuhen / causal-text-embeddings Goto Github PK

View Code? Open in Web Editor NEW

This project forked from blei-lab/causal-text-embeddings

0.0 1.0 0.0 4.22 MB

Software and data for "Using Text Embeddings for Causal Inference"

License: MIT License

Python 71.30% Shell 2.12% Jupyter Notebook 26.58%

causal-text-embeddings's Introduction

Introduction

This repository contains software and data for "Using Text Embeddings for Causal Inference" (arxiv.org/abs/1905.12741). The paper describes a method for causal inference with text documents. For example, does adding a theorem to a paper affect its chance of acceptance? The method adapts deep language models to address the causal problem.

This software builds on

  1. Bert: github.com/google-research/bert, and on
  2. PeerRead: github.com/allenai/PeerRead

We include pre-processed PeerRead arxiv data for convenience.

There is also a reference implementation in pytorch.

Tensorflow 2

For new projects, we recommend building on the reference tensorflow 2 implementation.

Requirements and setup

  1. You'll need to download a pre-trained BERT model (following the above github link). We use uncased_L-12_H-768_A-12.
  2. Install Tensorflow 1.12

Data

  1. We include a pre-processed copy of PeerRead data for convenience. This data is a collection of arXiv papers submitted to computer science conferences, the accept/reject decisions for these papers, and their abstracts. The raw PeerRead data contains significantly more information. You can get the raw data by following instructions at github.com/allenai/PeerRead. Running the included pre-processing scripts in the PeerRead folder will recreate the included tfrecord file.

  2. The reddit data can be downloaded at archive.org/details/reddit_posts_2018. This data includes all top-level reddit comments where the gender of the poster was annotated in some fashion. Each post has meta information (score, date, username, etc.) and includes the text for the first reply. The processed data used in the paper can be recreated by running the pre-processing scripts in the reddit folder.

You can also re-collect the data from Google BigQuery. The SQL command to do this is in reddit/data_cleaning/BigQuery_get_data. Modifying this script will allow you to change collection parameters (e.g., the year, which responses are included)

Reproducing the PeerRead experiments

The default settings for the code match the settings used in the software. These match the default settings used by BERT, except

  1. we reduce batch size to allow training on a Titan X, and
  2. we adjust the learning rate to account for this.

You'll run the from src code as ./PeerRead/submit_scripts/run_model.sh Before doing this, you'll need to edit run_classifier.sh to change BERT_BASE_DIR=../../bert/pre-trained/uncased_L-12_H-768_A-12 to BERT_BASE_DIR=[path to BERT_pre-trained]/uncased_L-12_H-768_A-12.

The flag --treatment=theorem_referenced controls the experiment. The flag --simulated=real controls whether to use the real effect or one of the semi-synthetic modes.

The effect estimates can be reproduced by running python -m result_processing.compute_ate. This takes in the predictions of the bert model (in tsv format) and passes them into downstream estimators of the causal effect.

To reproduce the baselines, you'll need to produce a tsv for each simulated dataset you want to test on. To do this, you can run python -m PeerRead.dataset.array_from_dataset from src. The flag --beta1=1.0 controls the strength of the confounding. (The other flags control other simulation parameters not used in the paper.)

Misc.

The experiments in the paper use a version of BERT that was further pre-trained on the PeerRead corpus using an unsupervised objective. This can be replicated with ./PeerRead/submit_scripts/run_classifier.sh. This takes about 24 hours on a single Titan Xp. To use a pre-trained BERT, uncomment the INIT_DIR options in run_classifier.sh.

Reproducing the Reddit experiment

  1. First, get the data following instructions above and save it as dat/reddit/2018.json
  2. Run data pre-processing with python -m reddit.data_cleaning.process_reddit
  3. Once the data is processed, instructions for running the experiments are essentially the same as for PeerRead

Maintainers

Dhanya Sridhar and Victor Veitch

causal-text-embeddings's People

Contributors

vveitch avatar rpryzant avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.