Giter Site home page Giter Site logo

cls_sentence_representations's Introduction

(1) Sentence representation using [CLS] vectors

Sentence representation using the [CLS] vector of a pre-trained model without fine-tuning.

(2) Vector clustering (independent of how the vectors were created)

This repository also contains code for clustering vectors. Input to clustering is two files - vector file (text file) and corresponding descriptor file naming the vectors. Additional paramters are zscore and max tail pick to control how much to pick from distribution tails.

Top k neighbor clustering for each vector is also an option.

Installation

Setup pytorch environment with/without GPU support using link https://github.com/ajitrajasekharan/multi_gpu_test

pip install gdown

./fetch_model.sh

Usage

phase1.sh test.txt output.txt

This will output the neighbors for each sentence in test.txt (see Note below on modification to HunggingFace pytorch code for [CLS] vector harvesting) Example below

  • Output files for phase1
    • sent_vectors.npy [CLS] vectors for each input sentence
    • user specified output file contains the neighborhood of each input sentence (example shown above)

phase2.sh

This script essentially invokes

python sentence_dist.py -terms $input -vectors sent_vectors.npy -zscore 4

this script can be used to either examine the sentence vectors (option 0) or create clusters (option 1). The stats of the clusters are also output

The vectors could be created by any mechanism - it need not be the [CLS] vectors created by phase1. All one need to provide a text file which serves as a description of each vector and a numpy array of sentence vectors. The z-score paramters influences how many standard deviations from the mean is used as a criterion for clustering (even the distribution is not normal, the z score serves as a reasonable threshold to pick elements of a cluster)

  • Output files for option 0

    • cum_dist.txt Cumulative histogram of distribution
    • zero_vec_counts.txt orthogonal vector count
    • tail_counts.txt tail count of vectors
  • Output files for option 1

    • sent_cluster_pivots.txt Sentence clusters (just indices of sentences)
    • pivots.json Pivots of clusters
    • inv_pivots.json Inverted pivots
    • cluster_stats.json cluster stats
    • desc_clusters.txt descriptive clusters (shows sentences for each cluster element)

Examples clusters

Cluster stats

Note.

Phase1 CLS vector generation requires a code patch to transformer file modeling_bert.py in order to work. This is to harvest [CLS] from head where this is a transform (the bias value is not used). patch

Misc experiments

  • graph_test.py Confirm the predictions of model can be recapitulated using the MLM head's transform and bias. This is just to illustrate harvesting any vector from the top layer without including the head transform is not the same. This is particularly relevant when harvesting [CLS] vector from the topmost layer as opposed to from the head (which includes an additional transform).
  • extract_head_bias.py MLM head bias carries information analogous to a tfidf score for the vocab terms. This utility extracts bias from a model. This could have been done by examine_model.py too.
  • mag_test.py This examines to see if vector magnitudes carry any information. They dont seem to be unlike the bias values in the head which do carry information - a tfidf of sorts for the vocab terms
  • att_mask.py this examines if the the attention weights of terms in a sentence has a pattern in its dependency on other terms. This examines it for all layers.
  • single_aminoseq_compare.py This is used to create representations for amino acid sequences. Use phase1.sh with compare option skipped. Then use this to compare amnino acid sequences with [CLS] vector. Example invocation python single_aminoseq_compare.py -input ribosomal.txt -ref_input run1/ref_seqs.txt -ref_vecs run1/sent_vectors.txt -output final.txt -ngram 4

License

MIT License

cls_sentence_representations's People

Contributors

ajitrajasekharan avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

ajitvr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.