Giter Site home page Giter Site logo

allenel's Introduction

allenel

evaluation & demo

  • evaluation
$ allennlp evaluate <path_to_model.tar.gz> <input test file> --include-package allenel --output-file <output path>
$ allennlp evaluate /home/junkyul/conda/allenel/models/Cmodel_15/model.tar.gz /home/junkyul/conda/neural-el_test/wiki.txt --include-package allenel --outpu-file /home/junkyul/conda/allenel/outputs/wiki.log
  • demo
$ python -m allennlp.service.server_simple --archive-path ./tests/fixtures/model.tar.gz --predictor el_linker --include-package allenel --title "EL Demo" --field-name context

allenel's People

Contributors

junkyul avatar

Stargazers

Jacob Danovitch avatar Sihao Chen avatar izuna385 avatar

Watchers

James Cloos avatar

allenel's Issues

fixes

trained weight or vocab.tar.gz
A field that importing vectors from a file or other sources
read pickle file and save as another text file (extract tar.gz or see load, save )
this might also help sparse processing?

model.tar.gz
see load_archive

Embedding in CPU/GPU [bring to GPU only relevant parts...]
Bag of word Embedding

SRL model prediction; do additional processings while predicting labels or in a demonstration

minor things about reading data set and vocab, cuda memory error logs

create glove vocab file from pickle file

format
token space separated numbers

len(glove_pickle_vocab) is 2196016

Read pickle file (indexer is the enumeration of tokens in a text file, one per line)
re-write vocab;
the vocab built by make-vocab size: 3826216 (much larger)

fix some tokens

allennlp version?

Hi,

Thanks for the great repo! I'm wondering what's the allennlp version you've tested with the code?

model with modified data set reader

forward receives

        fields = {
            "wid": wid_field,                                   # label
            "title": title_field,                               # label
            "types": type_field,                                # multi label for one hot
            "sentence": sentence_field,                         # text
            "sentence_left": sentence_left_field,               # text
            "sentence_right": sentence_right_field,             # text
            "mention": mention_surface_field,                   # meta
            "mention_normalized": mention_normalized_field,     # meta
            "coherences": coherences_field                      # multi label
        }
  • all are padded & batched tensors except meta field.

  • Use pre-trained embedding, GloVe

  • Embeddings with One Hot Vector from MultiLabels (match dimensions + 1 for UNK)

  • DenseSparseAdam encountered numerical error

  • initializer in config how?

  • regularizer in config how?

make-vocab on multiple name fields

name fields:
mention_surface: mention words shown in a sentence (context)
mention_sentence: a sentence
coherehce_labels: mention_surfaces shown together in a wiki page
type_labels: typically 1 or 2 of the 113 FIGER types for the mention_surface

  • DatasetReader reads training examples and returns Instance(fields) per example

  • Instance is a Map of fields with the key given inside the reader
    (Instance is created by passing a dict of fields)
    each field is some container (TextField, LabelField, MetaField, etc)

  • Vocabulary is also a mapping of dicts between text and index, where
    each namespace (key of a field) is the key to each field.
    There is only 1 Vocabulary object in the program.

  • once it is created for Fields supporting vocabularies (Almost all except Meta?)
    Vocabulary object should contain vocabularies for
    mention_surface, mention_sentence, coherehce_labels, type_labels, etc

$ allennlp make-vocab -h
usage: allennlp make-vocab [-h] -s SERIALIZATION_DIR [-o OVERRIDES]
[--include-package INCLUDE_PACKAGE]
param_path

Create a vocabulary from the specified dataset.

positional arguments:
param_path path to parameter file describing the model and its inputs

optional arguments:
-h, --help show this help message and exit
-s SERIALIZATION_DIR, --serialization-dir SERIALIZATION_DIR
directory in which to save the vocabulary directory
-o OVERRIDES, --overrides OVERRIDES
a JSON structure used to override the experiment
configuration
--include-package INCLUDE_PACKAGE
additional packages to include

testing on small data set

run C model;
check model;
learn vocab;
now don't use pre-trained glove embedding.

how long?
entity dense coherence dense vs.
sparse sparse

type model

predictor from learned mode? evaluate?

epoch5 results

Cmodel accuracy % (prior / posterior)
wiki: 84 (78.1 / 86.1)
ace2004: 91.9 (- / 93.1)
ace2005: 79.87 (81.1 / 83.7)
conll2012dev: 71.77 (70,9 / 83.4)
conll2012test: 0.686 (68.5 / 81.4)

CTEmodel accuracy
wiki: ??? (78.1 / 86.1)
ace2004: ??? (- / 93.1)
ace2005: ??? (81.1 / 83.7)
conll2012dev: ??? (70,9 / 83.4)
conll2012test: ??? (68.5 / 81.4)

Vocabulary related issues

  • pre-build vocabulary from training data
    when testing, what happens if a token is not in the vocab?
    should be replaced to @@unknown@@ by indexer?

how this happens? does it touches existing vocabulary?
do I must turn off exend to false?

  • use of pre-trained embedding like glove
    how to use pre-trained word embedding together with data from instances?
    simply turn on extend to true make things works?
    what happens to the embedding (the dimension of the matrix is controlled by allennlp)

-unknown, padding? those tokens are in pre-trained embedding?

  • advantage of pre-trained embeddings... earlier convergence?; if a special symbol is missing what happens? start maps to unknown end maps to unknown? when extend is off?

html demo

image

from json dictionary view --> provide html demo

1 input context -> many test case

for each case
rewrite sentence
highlight mention surface with index?
top 3 hyper links wiki page https://en.wikipedia.org/?curid=<>

epoch 15 results

Cmodel accuracy % (prior / posterior)
wiki: 0.8447937131630648 (78.1 / 86.1)
ace2004: 0.9233870967741935 (- / 93.1)
ace2005: 0.7951219512195122 (81.1 / 83.7)
conll2012dev: 0.7198788665368808 (70.9 / 83.4)
conll2012test: 0.6947565543071161 (68.5 / 81.4)

CTEmodel accuracy
wiki: 0.8447937131630648 (78.1 / 86.1)
ace2004: 0.9233870967741935 (- / 93.1)
ace2005: 0.8048780487804879 (81.1 / 83.7)
conll2012dev: 0.7196625567813109 (70.9 / 83.4)
conll2012test: 0.6952247191011236 (68.5 / 81.4)

pass prior probability from reader

when testing the model (by evaluate)

The ``evaluate`` subcommand can be used to
evaluate a trained model against a dataset
and report any metrics calculated by the model

it is not calling prediction but running forward of model and report metrics;
therefore, if metric does not compute the probability reflecting prior
there's no way to reproduce the result.

MultiprocessDatasetReader Memory Issue

MultiprocessDatasetReader does not free memory in CentOs Linux 7.

iterator: cache_instances : False
iterator: max_instances_in_memory : default (1000)
dataset_reader: output_queue_size: default (1000?)
dataset_reader: num_workers: 1

Now use DatasetReader instead after cat all training 41 files (~27 GB)

A similar issue was reported; it might be related to OS

Errors while fixing

Embedding config file and name _embedding fix

bucket iterator;
after switching to ListField[LabelField],
num_tokens key error; ["coherences", "num_tokens"] because... sorting_keys

inside model.. forward

metric computation

prediction

run time or performance issues

  • CUDA out of memory; no cleaver memory management; use larger gpu

  • optimizer dense/sparse ADAM numerical error -> gradient clipping?

  • when sparse embedding is better?
    I am now passing TextFields so it essentially passing sparse indices of coherence;
    setting sparse embedding matrix --> set sparse true
    optimizer dense sparse adam

  • possible to run epochs from pre-saved model?
    1 epoch takes > 10 hr; better if model can be trained incrementally from the previous runs;
    fine-tune? how about vocab?

evaluation

  • how to run
    $ allennlp evaluate model_archive evaluation_data_path
    --output-file FILE --cuda-device 0 --include-package

  • how to customize using model.tar.gz
    crosswiki pickle file is loading because of the original config -> override that option?
    what are the weights?

embedding config issues

TextFieldEmbedder, TokenEmbedder, Embedding configuration file rules are complicated;
what is correct way of using them?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.