yashsmehta / personality-prediction Goto Github PK

Experiments for automated personality detection using Language Models and psycholinguistic features on various famous personality datasets including the Essays dataset (Big-Five)

License: MIT License

Python 100.00%

language-model personality-predicting pytorch bert-fine-tuning personality-detection

personality-prediction's Introduction

Automated Personality Prediction using Pre-Trained Language Models

This repository contains code for the paper Bottom-Up and Top-Down: Predicting Personality with Psycholinguistic and Language Model Features, published in IEEE International Conference of Data Mining 2020.

Here are a set of experiments written in tensorflow + pytorch to explore automated personality detection using Language Models on the Essays dataset (Big-Five personality labelled traits) and the Kaggle MBTI dataset.

Setup

Pull the repository from GitHub, followed by creating a new virtual environment (conda or venv):

git clone https://github.com/yashsmehta/personality-prediction.git
cd personality-prediction
conda create -n mvenv python=3.10

Install poetry, and use that to install the dependencies required for running the project:

curl -sSL https://install.python-poetry.org | python3 -
poetry install

Usage

First run the LM extractor code which passes the dataset through the language model and stores the embeddings (of all layers) in a pickle file. Creating this 'new dataset' saves us a lot of compute time and allows effective searching of the hyperparameters for the finetuning network. Before running the code, create a pkl_data folder in the repo folder. All the arguments are optional and passing no arguments runs the extractor with the default values.

python LM_extractor.py -dataset_type 'essays' -token_length 512 -batch_size 32 -embed 'bert-base' -op_dir 'pkl_data'

Next run a finetuning model to take the extracted features as input from the pickle file and train a finetuning model. We find a shallow MLP to be the best performing one

python finetune_models/MLP_LM.py

Results Table	Language Models vs Psycholinguistic Traits

Predicting personality on unseen text

Follow the steps below for predicting personality (e.g. the Big-Five: OCEAN traits) on a new text/essay:

python finetune_models/MLP_LM.py -save_model 'yes'

Now use the script below to predict the unseen text:

python unseen_predictor.py

Running Time

LM_extractor.py

On a RTX2080 GPU, the -embed 'bert-base' extractor takes about ~2m 30s and 'bert-large' takes about ~5m 30s

On a CPU, 'bert-base' extractor takes about ~25m

python finetune_models/MLP_LM.py

On a RTX2080 GPU, running for 15 epochs (with no cross-validation) takes from 5s-60s, depending on the MLP architecture.

Literature

Deep Learning based Personality Prediction [Literature REVIEW] (Springer AIR Journal - 2020)

@article{mehta2020recent,
  title={Recent Trends in Deep Learning Based Personality Detection},
  author={Mehta, Yash and Majumder, Navonil and Gelbukh, Alexander and Cambria, Erik},
  journal={Artificial Intelligence Review},
  pages={2313–2339},
  year={2020},
  doi = {https://doi.org/10.1007/s10462-019-09770-z},
  url = {https://link.springer.com/article/10.1007/s10462-019-09770-z}
  publisher={Springer}
}

Language Model Based Personality Prediction (ICDM - 2020)

If you find this repo useful for your research, please cite it using the following:

@inproceedings{mehta2020bottom,
  title={Bottom-up and top-down: Predicting personality with psycholinguistic and language model features},
  author={Mehta, Yash and Fatehi, Samin and Kazameini, Amirmohammad and Stachl, Clemens and Cambria, Erik and Eetemadi, Sauleh},
  booktitle={2020 IEEE International Conference on Data Mining (ICDM)},
  pages={1184--1189},
  year={2020},
  organization={IEEE}
}

License

The source code for this project is licensed under the MIT license.

personality-prediction's People

Contributors

Stargazers

Watchers

Forkers

senticnet dpicca dowell666 mdurland y570pc zhangyins1 bateman hrk777 christinataft worldie-com jtfields sathishpaloju zeroqiaoba pp-code15 daedalus1427 echoyinke atoposnemo rinad5 kevinqz lemontreeeee arvind-india crismunoz sakramentas ga92xug hngu178 davionwu2018 shenlh bellyfat thesekyi vinsuka mohaiyang6 caiyishu caiyishu-private-repositories yueban1111 srenzo pranshu-5123 conytu piethonista boostarcher edersoncorbari senticnet yuta555 aiforkorea darshjasani 233papertiger tazeek

personality-prediction's Issues

Python library version

Hello! Thank you for sharing!
I ran ‘pip -r requirements.txt’ but still cannot run the project successfully.
What are the specific versions of these python libraries?
Thanks!

finetuneNet.py

Hi...I am unable to find the file "finetuneNet.py".

Could you please tell me where it is.

Thanks.

Getting the accuracy scores

@yashsmehta @saminfatehir

Hi,

I wanted to know how the accuracy score was calculated for each traits.

In the output explog, I get accuracy score for 10 fold for each traits. I am trying to reproduce the result from the paper, so I wanted to know are these scores for each traits calculate, are they averaged or chosen from the maximum score from the 10 accuracy scores?

Thanks 😊

Errors when running LM_extractor.py

Hi,

I encountered some errors when running the LM_extractor.py. The first one is the following:

Traceback (most recent call last):
  File "LM_extractor.py", line 111, in <module>
    map_dataset = MyMapDataset(dataset, tokenizer, token_length, DEVICE, mode)
  File "C:\Users\alexa\Desktop\code\personality-prediction-master\utils\data_utils.py", line 25, in __init__
    author_ids = torch.from_numpy(np.array(author_ids)).long().to(DEVICE)
UnboundLocalError: local variable 'author_ids' referenced before assignment

I think I fixed it by defining input_ids, targets, author_ids = [] in data_utils.py. However, after this error, I had a second one:

Traceback (most recent call last):
  File "LM_extractor.py", line 115, in <module>
    shuffle=False,
  File "C:\Users\alexa\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\dataloader.py", line 272, in __init__
    batch_sampler = BatchSampler(sampler, batch_size, drop_last)
  File "C:\Users\alexa\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\sampler.py", line 217, in __init__
    "but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=32

I quite don't know why is this happening, since 32 is indeed an integer? Could you help me to solve this issue?

EDIT: So I fixed it as well, setting batch_size=batch_size to batch_size=int(batch_size) in LM_Extactor.py apparently solve the issue.
Yet I have another issue, that is no pickle file is created. The only output of LM_Extactor when running python LM_extractor.py -dataset_type 'essays' -token_length 512 -datafile 'data/essays/essays.csv' -batch_size 32 -embed 'bert-base' -op_dir 'pkl_data' is

usage: LM_extractor.py [-h] [-dataset_type DATASET_TYPE]
                       [-token_length TOKEN_LENGTH] [-batch_size BATCH_SIZE]
                       [-embed EMBED] [-op_dir OP_DIR] [-mode MODE]
                       [-embed_mode EMBED_MODE]
LM_extractor.py: error: unrecognized arguments: -datafile data/essays/essays.csv

Strange, since I followed the advice suggested in another issue , and it would not work either. Looking into LM_Extactor.py, it appears that the parameter for datafile has been deleted.

Note that running python LM_extractor.py -dataset_type 'essays' -token_length 512 -batch_size 32 -embed 'bert-base' -op_dir 'pkl_data' without the datafile parameter indeed create a .pkl file, however at the wrong place (in the root folder, and not in the supposedly targeted folder), which is only 1Ko (so I think something's wrong here).

EDIT2: Okay, I found a solution. The inputs paths needed to be fixed, which is something @bateman has done in his fork. Remains the issue of pickle files not being created at the right place.

where the essays dataset came from

Hi~
Thanks for your sharing, which helped me a lot.

As stated in your paper "In psychometric personality trait assessments, personality is measured in continuous scores, yet the available benchmark datasets mostly provide personality traits scores in artificially binned form only. Future studies should aim to use datasets that provide continuous scores on personality traits".

I followed the quote in your paper to read another paper named Linguistic styles: Language use as an individual difference. However, the original data is not provided in that paper. According to the description of the paper, personality is supposed to be measured in continuous scores, yet the personality is only measured by yes or no in essays. I was wondering where the essays dataset came from and whether continuous scores are used to measure personality in the original dataset.

Thank you so much!!! Looking forward to your reply !!

Question of prediction result

Thank you your sharing!

I run your project with my data successfully. But how can I see the prediction results of "O" "C" "E" "A" "N" like TableⅡ in your paper. If it is convenient to you, please explain how to interpret the results generated by the codes in 'finetune_models'.

Thank you! Looking forward to your reply !!

Can't clone from repo

I am trying to clone from repo using google colab but it shows this error:
`Cloning into 'personality'...
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.`

Is the keras model's constructor placed in a wrong place?

I found that, your code placed the model=tf.keras.model.sequential inside the K-fold level. I really think it results that it will generate too many new model every time your code go into K-fold, rather than one model for one traits.

test prediction

How to get a personality prediction of an unseen text after training?

Installing requirements has typo

it is like pip -r requirments.txtit should be pip install -r requirements.txt

Issue with selecting the best model in MLP_LM.py

In the script MLP_LM.py, variables best_model and best_accuracy should be defined inside of the loop on line 69. Otherwise, it may store the best-performing model from the previous personality trait.

Add an option to display the personality of the essays' users

Hi,

First of all, thank you for your program, which is quite interesting and helpful.

I was wondering if it was possible for you to add an option as to print out (as a csv for instance) the predicted personality for each user of the essays.csv? For now, the model (I'm using the MLP_combined_features.py, but I guess it would be feasible for all of them) only display the accuracy score, and would be of great use to add an option to visualize the predicted personalities. It would also be helpful for people that would like to get the predicted personality of their own dataset's users.

If you could implement this feature, it would be amazing. Thank you!

FileNotFoundError

Cloned from GH, installed requirements, executed LM_extractor.py as specified in the README.md file. Got this.

$ python LM_extractor.py -dataset_type 'essays' -token_length 512 -batch_size 32 -embed 'bert-base' -op_dir 'pkl_data'

running on cpu
essays : bert-base : 512 : 512_head : cls
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 433/433 [00:00<00:00, 101kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████| 440M/440M [00:13<00:00, 33.8MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 492kB/s]
Traceback (most recent call last):
  File "LM_extractor.py", line 111, in <module>
    map_dataset = MyMapDataset(dataset, tokenizer, token_length, DEVICE, mode)
  File "/Users/fabio/git/personality-prediction/utils/data_utils.py", line 18, in __init__
    author_ids, input_ids, targets = dataset_processors.essays_embeddings(datafile, tokenizer, token_length, mode)
  File "/Users/fabio/git/personality-prediction/utils/dataset_processors.py", line 59, in essays_embeddings
    df = load_essays_df(datafile)
  File "/Users/fabio/git/personality-prediction/utils/dataset_processors.py", line 27, in load_essays_df
    with open(datafile, "rt") as csvf:
FileNotFoundError: [Errno 2] No such file or directory: '../data/essays/essays.csv'

Yet a small file is present in data/essays/essays.csv. Please, advise.

A question about affectivespace and senticnet of MBTI

Thank you for opening code.

I want to extract affectivespace and Senticnet 5 features of kaggleMBTI dataset, however, I found that I don't know how to generate “kaggle_concept_count_final.p”. This .p file is used in "load_features(dir, dataset)".

I visit https://www.sentic.net/downloads/, however, I didn't find any related tutorial. Please how can I extract these two features？