The cls_sentence_representations from ajitrajasekharan

(1) Sentence representation using [CLS] vectors

Sentence representation using the [CLS] vector of a pre-trained model without fine-tuning.

(2) Vector clustering (independent of how the vectors were created)

This repository also contains code for clustering vectors. Input to clustering is two files - vector file (text file) and corresponding descriptor file naming the vectors. Additional paramters are zscore and max tail pick to control how much to pick from distribution tails.

Top k neighbor clustering for each vector is also an option.

Installation

Setup pytorch environment with/without GPU support using link https://github.com/ajitrajasekharan/multi_gpu_test

pip install gdown

./fetch_model.sh

Usage

phase1.sh test.txt output.txt

This will output the neighbors for each sentence in test.txt (see Note below on modification to HunggingFace pytorch code for [CLS] vector harvesting) Example below

Output files for phase1
- sent_vectors.npy [CLS] vectors for each input sentence
- user specified output file contains the neighborhood of each input sentence (example shown above)

phase2.sh

This script essentially invokes

python sentence_dist.py -terms $input -vectors sent_vectors.npy -zscore 4

this script can be used to either examine the sentence vectors (option 0) or create clusters (option 1). The stats of the clusters are also output

The vectors could be created by any mechanism - it need not be the [CLS] vectors created by phase1. All one need to provide a text file which serves as a description of each vector and a numpy array of sentence vectors. The z-score paramters influences how many standard deviations from the mean is used as a criterion for clustering (even the distribution is not normal, the z score serves as a reasonable threshold to pick elements of a cluster)

Output files for option 0
- cum_dist.txt Cumulative histogram of distribution
- zero_vec_counts.txt orthogonal vector count
- tail_counts.txt tail count of vectors
Output files for option 1
- sent_cluster_pivots.txt Sentence clusters (just indices of sentences)
- pivots.json Pivots of clusters
- inv_pivots.json Inverted pivots
- cluster_stats.json cluster stats
- desc_clusters.txt descriptive clusters (shows sentences for each cluster element)

Examples clusters

Cluster stats

Note.

Phase1 CLS vector generation requires a code patch to transformer file modeling_bert.py in order to work. This is to harvest [CLS] from head where this is a transform (the bias value is not used).

Misc experiments

graph_test.py Confirm the predictions of model can be recapitulated using the MLM head's transform and bias. This is just to illustrate harvesting any vector from the top layer without including the head transform is not the same. This is particularly relevant when harvesting [CLS] vector from the topmost layer as opposed to from the head (which includes an additional transform).
extract_head_bias.py MLM head bias carries information analogous to a tfidf score for the vocab terms. This utility extracts bias from a model. This could have been done by examine_model.py too.
mag_test.py This examines to see if vector magnitudes carry any information. They dont seem to be unlike the bias values in the head which do carry information - a tfidf of sorts for the vocab terms
att_mask.py this examines if the the attention weights of terms in a sentence has a pattern in its dependency on other terms. This examines it for all layers.
single_aminoseq_compare.py This is used to create representations for amino acid sequences. Use phase1.sh with compare option skipped. Then use this to compare amnino acid sequences with [CLS] vector. Example invocation python single_aminoseq_compare.py -input ribosomal.txt -ref_input run1/ref_seqs.txt -ref_vecs run1/sent_vectors.txt -output final.txt -ngram 4

License

MIT License

ajitrajasekharan / cls_sentence_representations Goto Github PK

cls_sentence_representations's Introduction

(1) Sentence representation using [CLS] vectors

(2) Vector clustering (independent of how the vectors were created)

Installation

Usage

Note.

Misc experiments

License

cls_sentence_representations's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent