junkyul / allenel Goto Github PK

View Code? Open in Web Editor NEW

3.0 1.0 0.0 114.88 MB

entity linking in allennlp

Python 100.00%

allenel's Introduction

allenel

entity linking in allennlp
- Original project in TensorFlow by Nitish Gupta
- Nitish Gupta, Sameer Singh, and Dan Roth, “Entity Linking via Joint Encoding of Types, Descriptions, and Context”, EMNLP 2017

evaluation & demo

evaluation

$ allennlp evaluate <path_to_model.tar.gz> <input test file> --include-package allenel --output-file <output path>
$ allennlp evaluate /home/junkyul/conda/allenel/models/Cmodel_15/model.tar.gz /home/junkyul/conda/neural-el_test/wiki.txt --include-package allenel --outpu-file /home/junkyul/conda/allenel/outputs/wiki.log

demo

$ python -m allennlp.service.server_simple --archive-path ./tests/fixtures/model.tar.gz --predictor el_linker --include-package allenel --title "EL Demo" --field-name context

google drive link to the archived models

allenel's People

Contributors

Stargazers

Watchers

allenel's Issues

use ListField instead of TextField

allenai/allennlp#2719 (comment)

fixes

trained weight or vocab.tar.gz
A field that importing vectors from a file or other sources
read pickle file and save as another text file (extract tar.gz or see load, save )
this might also help sparse processing?

model.tar.gz
see load_archive

Embedding in CPU/GPU [bring to GPU only relevant parts...]
Bag of word Embedding

SRL model prediction; do additional processings while predicting labels or in a demonstration

minor things about reading data set and vocab, cuda memory error logs

create glove vocab file from pickle file

format
token space separated numbers

len(glove_pickle_vocab) is 2196016

Read pickle file (indexer is the enumeration of tokens in a text file, one per line)
re-write vocab;
the vocab built by make-vocab size: 3826216 (much larger)

fix some tokens

allennlp version?

Hi,

Thanks for the great repo! I'm wondering what's the allennlp version you've tested with the code?

model with modified data set reader

forward receives

        fields = {
            "wid": wid_field,                                   # label
            "title": title_field,                               # label
            "types": type_field,                                # multi label for one hot
            "sentence": sentence_field,                         # text
            "sentence_left": sentence_left_field,               # text
            "sentence_right": sentence_right_field,             # text
            "mention": mention_surface_field,                   # meta
            "mention_normalized": mention_normalized_field,     # meta
            "coherences": coherences_field                      # multi label
        }

all are padded & batched tensors except meta field.
Use pre-trained embedding, GloVe
Embeddings with One Hot Vector from MultiLabels (match dimensions + 1 for UNK)
DenseSparseAdam encountered numerical error
initializer in config how?
regularizer in config how?

make-vocab on multiple name fields

name fields:
mention_surface: mention words shown in a sentence (context)
mention_sentence: a sentence
coherehce_labels: mention_surfaces shown together in a wiki page
type_labels: typically 1 or 2 of the 113 FIGER types for the mention_surface

DatasetReader reads training examples and returns Instance(fields) per example
Instance is a Map of fields with the key given inside the reader
(Instance is created by passing a dict of fields)
each field is some container (TextField, LabelField, MetaField, etc)
Vocabulary is also a mapping of dicts between text and index, where
each namespace (key of a field) is the key to each field.
There is only 1 Vocabulary object in the program.
once it is created for Fields supporting vocabularies (Almost all except Meta?)
Vocabulary object should contain vocabularies for
mention_surface, mention_sentence, coherehce_labels, type_labels, etc

$ allennlp make-vocab -h
usage: allennlp make-vocab [-h] -s SERIALIZATION_DIR [-o OVERRIDES]
[--include-package INCLUDE_PACKAGE]
param_path

Create a vocabulary from the specified dataset.

positional arguments:
param_path path to parameter file describing the model and its inputs

optional arguments:
-h, --help show this help message and exit
-s SERIALIZATION_DIR, --serialization-dir SERIALIZATION_DIR
directory in which to save the vocabulary directory
-o OVERRIDES, --overrides OVERRIDES
a JSON structure used to override the experiment
configuration
--include-package INCLUDE_PACKAGE
additional packages to include

testing on small data set

run C model;
check model;
learn vocab;
now don't use pre-trained glove embedding.

how long?
entity dense coherence dense vs.
sparse sparse

type model

predictor from learned mode? evaluate?

epoch5 results

Cmodel accuracy % (prior / posterior)
wiki: 84 (78.1 / 86.1)
ace2004: 91.9 (- / 93.1)
ace2005: 79.87 (81.1 / 83.7)
conll2012dev: 71.77 (70,9 / 83.4)
conll2012test: 0.686 (68.5 / 81.4)

CTEmodel accuracy
wiki: ??? (78.1 / 86.1)
ace2004: ??? (- / 93.1)
ace2005: ??? (81.1 / 83.7)
conll2012dev: ??? (70,9 / 83.4)
conll2012test: ??? (68.5 / 81.4)

Vocabulary related issues

pre-build vocabulary from training data
when testing, what happens if a token is not in the vocab?
should be replaced to @@unknown@@ by indexer?

how this happens? does it touches existing vocabulary?
do I must turn off exend to false?

use of pre-trained embedding like glove
how to use pre-trained word embedding together with data from instances?
simply turn on extend to true make things works?
what happens to the embedding (the dimension of the matrix is controlled by allennlp)

-unknown, padding? those tokens are in pre-trained embedding?

advantage of pre-trained embeddings... earlier convergence?; if a special symbol is missing what happens? start maps to unknown end maps to unknown? when extend is off?

html demo

from json dictionary view --> provide html demo

1 input context -> many test case

for each case
rewrite sentence
highlight mention surface with index?
top 3 hyper links wiki page https://en.wikipedia.org/?curid=<>

epoch 15 results

Cmodel accuracy % (prior / posterior)
wiki: 0.8447937131630648 (78.1 / 86.1)
ace2004: 0.9233870967741935 (- / 93.1)
ace2005: 0.7951219512195122 (81.1 / 83.7)
conll2012dev: 0.7198788665368808 (70.9 / 83.4)
conll2012test: 0.6947565543071161 (68.5 / 81.4)

CTEmodel accuracy
wiki: 0.8447937131630648 (78.1 / 86.1)
ace2004: 0.9233870967741935 (- / 93.1)
ace2005: 0.8048780487804879 (81.1 / 83.7)
conll2012dev: 0.7196625567813109 (70.9 / 83.4)
conll2012test: 0.6952247191011236 (68.5 / 81.4)

pass prior probability from reader

when testing the model (by evaluate)

The ``evaluate`` subcommand can be used to
evaluate a trained model against a dataset
and report any metrics calculated by the model

it is not calling prediction but running forward of model and report metrics;
therefore, if metric does not compute the probability reflecting prior
there's no way to reproduce the result.

MultiprocessDatasetReader Memory Issue

MultiprocessDatasetReader does not free memory in CentOs Linux 7.

iterator: cache_instances : False
iterator: max_instances_in_memory : default (1000)
dataset_reader: output_queue_size: default (1000?)
dataset_reader: num_workers: 1

Now use DatasetReader instead after cat all training 41 files (~27 GB)

A similar issue was reported; it might be related to OS

Errors while fixing

Embedding config file and name _embedding fix

bucket iterator;
after switching to ListField[LabelField],
num_tokens key error; ["coherences", "num_tokens"] because... sorting_keys

inside model.. forward

metric computation

prediction

run time or performance issues

CUDA out of memory; no cleaver memory management; use larger gpu
optimizer dense/sparse ADAM numerical error -> gradient clipping?
when sparse embedding is better?
I am now passing TextFields so it essentially passing sparse indices of coherence;
setting sparse embedding matrix --> set sparse true
optimizer dense sparse adam
possible to run epochs from pre-saved model?
1 epoch takes > 10 hr; better if model can be trained incrementally from the previous runs;
fine-tune? how about vocab?

evaluation

how to run
$ allennlp evaluate model_archive evaluation_data_path
--output-file FILE --cuda-device 0 --include-package
how to customize using model.tar.gz
crosswiki pickle file is loading because of the original config -> override that option?
what are the weights?

embedding config issues

TextFieldEmbedder, TokenEmbedder, Embedding configuration file rules are complicated;
what is correct way of using them?