Giter Site home page Giter Site logo

primal-beliefs-modeling's Introduction

Modeling Latent Dimensions of Human Beliefs

Documents for the codes and data reproducing results in paper "Modeling Latent Dimensions of Human Beliefs".

A. Codes and data structures

(1) Source codes:

  • preprocess.py: steps for preprocessing tweets data. For example, reducing duplicates for long and highly repeatitive texts, processing symbols (such as "<newline>", "&amp", "&lt"), adding tweets ids, calculating data statistics, etc.
  • extract_BERT_embeddings.py: codes for extracting BERT embeddings of tweets using the pre-trained BERT model. Each tweet in the data will be mapped to a BERT embeddings vector of size 1024.
  • nmf_algorithm.py: codes for running NMF factorization, factorizing tweets embeddings into compressed latent embeddings of size 50.
  • Experimentally, we proposed 3 models architecture as described in paper, implemented in these files:
    • gpt2_wrapper_arch1.py: architecture 1 described in paper
    • gpt2_wrapper_arch2.py: architecture 2 described in paper
    • gpt2_wrapper.py: architecture 3 described in paper (also the main architecture)
  • train_gpt2.py: codes for training the modified GPT-2 decoder model
  • inference_gpt2.py: codes for generating texts of beliefs from latent dimensions

(2) Data:

  • the_world_is_tweets.csv: collected tweets data describing people's views about the world, which starting with "the world is...". The file has two columns "id" and "sentence" referring to the index and the tweet. Due to data privacy policy, this data is not made public.
  • tweets_labels.csv: file containing annotated tweets with primals classes by experts, used for prediction evaluation experiment.

(3) Other relevant files:

B. Steps to reproduce proposed model

Below are the steps to build the model described in the paper.

(1) Preprocessing texts:

  • Details: Preprocessing step in which we lowercase all tweets (to have them work most efficiently with large-uncased-BERT model), filtering out repeated quotes (to avoid unoriginal tweets) and cleaning data (replacing invalid symbols with the correct ones, e.g. "&amp" or "&lt"). The following command specifies the arguments to run the code.
  • Command:
python3 preprocess.py \
	--input_file the_world_is_tweets.csv \
	--output_file tweets_processed.csv

(2) Extracting BERT embeddings:

  • Details: In this step, we feed the processed data to BERT model to extract BERT embeddings. We used the large version of BERT, which has 24 transformers layers, 16 attention heads, 1024 hidden dimensions. Details of the BERT model used can be found here in its original paper at [https://arxiv.org/abs/1810.04805]. The implementation and pretrained weights of the original BERT were downloaded from [https://huggingface.co/docs/transformers/model_doc/bert] (a widely used library for NLP transformers models). The embeddings of each sentence are computed by taking the mean of the last 4 layers of BERT, across all words in a sentence.
  • Command:
python3 extract_BERT_embeddings.py \
	--input_file tweets_processed.csv \
	--output_file tweets_bert_embeddings.json \
	--model_type bert \
	--model_name_or_path bert-large-uncased \
	--do_lower_case \
 	--batch_size 8 \
	--block_size 32 \
	--text_column 1 \
	--id_column 0 \
	--layers "-1 -2 -3 -4" \
	--header True \
	--layers_aggregation mean 

(3) Running NMF factorization:

  • Details: The code below preprocesses data and run NMF algorithm to factorize embeddings data into 50 latent dimensions. We used the scikit-learn implementation of NMF at [https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html]. The hyper-parameters of NMF used are: tol=1eโˆ’4, max_iter=200, init="random", n_components = 5, other hyperparameters were kept as defaults.
  • Command:
python3 nmf_algorithm.py \
	--embeddings_file tweets_bert_embeddings.csv \
	--output_file tweets_nmf50.csv \
	--algorithm nmf \
	--n 50

(4) Training the modified GPT-2 model:

  • Details: Our method builds upon GPT-2 model, a widely-known text generative model with high performance. The GPT-2 version used in our paper is the base model, having 12 transformers layers, 12 attention heads, 768 hidden dimensions. As described in the main paper, we modified this model by adding transformation matrices that match the input vector size of 50 to the hidden vector size of 768. The total number of trainable parameters is 125 millions. Details of GPT-2 can be found here in its original paper at [https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf]. The implementation and pretrained weights of the original GPT-2 were downloaded from [https://huggingface.co/docs/transformers/model_doc/gpt2] (a widely used library for NLP transformers models). The modification of GPT-2 is implemented in gpt2_wrapper.py file. The train_gpt2.py file is used to train the model using the following command.
  • Command:
python3 train_gpt2.py \
	--train_data_file tweets_nmf50.csv \
	--output_dir trained_models/ \
	--gpt2_model_type gpt2 \
	--gpt2_model_name_or_path gpt2 \
	--latent_size 50 \
	--block_size 32 \
	--per_gpu_train_batch_size 16 \
	--gradient_accumulation_steps 1 \
	--do_train \
	--save_steps 500 \
	--num_train_epochs 5 \
	--overwrite_output_dir \
	--overwrite_cache

(5) Inferencing (generating texts) with the trained model:

  • Details: The inference_gpt2.py file is used to generate texts from latent dimesions using the trained model from the step above. The argument generate_num indicates how many sentences to generate for each dimension. The arguments temperature, top_p and top_k are for nucleus sampling when generating texts, which control how broad or narrow the generated texts would be satistically.
  • Command:
python3 inference_gpt2.py \
	--train_data_file tweets_nmf50.csv \
	--output_dir trained_models/ \
	--gpt2_model_type gpt2 \
	--gpt2_model_name_or_path gpt2 \
	--latent_size 50 \
	--generate_num 5 \
	--generate_length 32 \
	--temperature 0.2 \
	--top_p 0.9 \
	--top_k 10 \
	--overwrite_cache

primal-beliefs-modeling's People

Contributors

huyvu0508 avatar

Watchers

James Cloos avatar  avatar

Forkers

smangalik

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.