ajitrajasekharan / bert_vector_clustering Goto Github PK

Clustering learned BERT vectors for downstream tasks like unsupervised NER, unsupervised sentence embeddings etc.

License: MIT License

Python 71.90% Shell 1.24% Jupyter Notebook 26.86%

bert_vector_clustering's Issues

Label Cluster for new NER-Type in other Language

I don't quite get the directions in the readme for Creating a new boostrapped Labeling for Unsupervised NER in a different Language and for Different Labels/Terms.

Step 1:
I emptied the files "labels.txt" and "bootstrap_entities.txt". Then i tried both for new boostrapped labeling:

a) just run with an empty seedword list
b) created a new bootstrap_entities.txt with new seed words (all part of my vocab.txt)

Then i called run.sh with Option=1 and Threshold = 0 for vector generation + labeling them according to my seed words.

Upon finishing a LOT of files are written/updated. E.g. adaptive_debug_pivots.txt, inferred.txt, labels.txt, pivots.json, pivots.txt

In the Readme it says:
"Cluster (run.sh with option 1 followed by 0) and then examine cluster pivots to label them.
Then rerun clustering and select candidates from inferred.txt. "

So its not clear which file is meant here by "examine cluster pivots" to me.

Firstly i assumed i have to look at the adaptive_debug_pivots.txt.
So i started to correct Labels in the file adaptive_debug_pivots.txt.

When i restart clusting again (with the same options as above - run.sh with option 1 followed by 0)
the same outputs as in Step 1 are just regenerated identically again.
So all my editing was simply overwritten.
Inferred.txt basically always contains no entries at all.
So i must be doing something wrong.

Then i checked the run.sh

python dist_v2.py pwd 0 vocab.txt bert_vectors.txt 0 results/labels.txt results/stats_dict.txt preserve_1_2_grams.txt glue_words.txt bootstrap_entities.txt

and figured that basically the bootstrap_entities.txt contains the pivot clusters. So im pretty much lost now.

Could you please specify more precisely how i can iteratively improve the labeling for the generated clusters?

Expecting different labels.txt

I'm getting the following error running run.sh:

Tokenize is set to : False
count of tokens in vocab.txt : 28996
Invalid line: ['GLU', 'the', 'the', '0.6', '0.06']
Traceback (most recent call last):
File "dist_v2.py", line 878, in
main()
File "dist_v2.py", line 835, in main
b_embeds =BertEmbeds(sys.argv[1],sys.argv[2],sys.argv[3],sys.argv[4],True,True,sys.argv[6],sys.argv[7],sys.argv[8],sys.argv[9],sys.argv[10]) #True - for cache embeds; normalize - True
File "dist_v2.py", line 160, in init
self.labels_dict,self.lc_labels_dict = read_labels(labels_file)
File "dist_v2.py", line 98, in read_labels
assert(0)
AssertionError

for some reason the code expects to see 3 items in each line of labels.txt but there are 5

error during ./run.sh

I executed the run.sh step, and got an error
inquired 0.51 0.0 ['inquired', 'asks']
Processing 28118 of 28996
***Singleton arr for term: MacKenzie

Has anyone seen this before?
the debug_pivots.txt file size currently sits at 554231.

What is the correct size of debug_pivots.txt, when run to completion?

ajitrajasekharan / bert_vector_clustering Goto Github PK

bert_vector_clustering's People

Contributors

Stargazers

Watchers

Forkers

bert_vector_clustering's Issues

Label Cluster for new NER-Type in other Language

Expecting different labels.txt

error during ./run.sh

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent