junkyul / allenel Goto Github PK
View Code? Open in Web Editor NEWentity linking in allennlp
entity linking in allennlp
format
token space separated numbers
len(glove_pickle_vocab) is 2196016
Read pickle file (indexer is the enumeration of tokens in a text file, one per line)
re-write vocab;
the vocab built by make-vocab size: 3826216 (much larger)
fix some tokens
trained weight or vocab.tar.gz
A field that importing vectors from a file or other sources
read pickle file and save as another text file (extract tar.gz or see load, save )
this might also help sparse processing?
model.tar.gz
see load_archive
Embedding in CPU/GPU [bring to GPU only relevant parts...]
Bag of word Embedding
SRL model prediction; do additional processings while predicting labels or in a demonstration
minor things about reading data set and vocab, cuda memory error logs
when testing the model (by evaluate)
The ``evaluate`` subcommand can be used to
evaluate a trained model against a dataset
and report any metrics calculated by the model
it is not calling prediction but running forward of model and report metrics;
therefore, if metric does not compute the probability reflecting prior
there's no way to reproduce the result.
CUDA out of memory; no cleaver memory management; use larger gpu
optimizer dense/sparse ADAM numerical error -> gradient clipping?
when sparse embedding is better?
I am now passing TextFields so it essentially passing sparse indices of coherence;
setting sparse embedding matrix --> set sparse true
optimizer dense sparse adam
possible to run epochs from pre-saved model?
1 epoch takes > 10 hr; better if model can be trained incrementally from the previous runs;
fine-tune? how about vocab?
from json dictionary view --> provide html demo
1 input context -> many test case
for each case
rewrite sentence
highlight mention surface with index?
top 3 hyper links wiki page https://en.wikipedia.org/?curid=<>
how this happens? does it touches existing vocabulary?
do I must turn off exend to false?
-unknown, padding? those tokens are in pre-trained embedding?
name fields:
mention_surface: mention words shown in a sentence (context)
mention_sentence: a sentence
coherehce_labels: mention_surfaces shown together in a wiki page
type_labels: typically 1 or 2 of the 113 FIGER types for the mention_surface
DatasetReader reads training examples and returns Instance(fields) per example
Instance is a Map of fields with the key given inside the reader
(Instance is created by passing a dict of fields)
each field is some container (TextField, LabelField, MetaField, etc)
Vocabulary is also a mapping of dicts between text and index, where
each namespace (key of a field) is the key to each field.
There is only 1 Vocabulary object in the program.
once it is created for Fields supporting vocabularies (Almost all except Meta?)
Vocabulary object should contain vocabularies for
mention_surface, mention_sentence, coherehce_labels, type_labels, etc
$ allennlp make-vocab -h
usage: allennlp make-vocab [-h] -s SERIALIZATION_DIR [-o OVERRIDES]
[--include-package INCLUDE_PACKAGE]
param_path
Create a vocabulary from the specified dataset.
positional arguments:
param_path path to parameter file describing the model and its inputs
optional arguments:
-h, --help show this help message and exit
-s SERIALIZATION_DIR, --serialization-dir SERIALIZATION_DIR
directory in which to save the vocabulary directory
-o OVERRIDES, --overrides OVERRIDES
a JSON structure used to override the experiment
configuration
--include-package INCLUDE_PACKAGE
additional packages to include
MultiprocessDatasetReader does not free memory in CentOs Linux 7.
iterator: cache_instances : False
iterator: max_instances_in_memory : default (1000)
dataset_reader: output_queue_size: default (1000?)
dataset_reader: num_workers: 1
Now use DatasetReader instead after cat all training 41 files (~27 GB)
A similar issue was reported; it might be related to OS
forward receives
fields = {
"wid": wid_field, # label
"title": title_field, # label
"types": type_field, # multi label for one hot
"sentence": sentence_field, # text
"sentence_left": sentence_left_field, # text
"sentence_right": sentence_right_field, # text
"mention": mention_surface_field, # meta
"mention_normalized": mention_normalized_field, # meta
"coherences": coherences_field # multi label
}
all are padded & batched tensors except meta field.
Use pre-trained embedding, GloVe
Embeddings with One Hot Vector from MultiLabels (match dimensions + 1 for UNK)
DenseSparseAdam encountered numerical error
initializer in config how?
regularizer in config how?
Hi,
Thanks for the great repo! I'm wondering what's the allennlp version you've tested with the code?
Cmodel accuracy % (prior / posterior)
wiki: 0.8447937131630648 (78.1 / 86.1)
ace2004: 0.9233870967741935 (- / 93.1)
ace2005: 0.7951219512195122 (81.1 / 83.7)
conll2012dev: 0.7198788665368808 (70.9 / 83.4)
conll2012test: 0.6947565543071161 (68.5 / 81.4)
CTEmodel accuracy
wiki: 0.8447937131630648 (78.1 / 86.1)
ace2004: 0.9233870967741935 (- / 93.1)
ace2005: 0.8048780487804879 (81.1 / 83.7)
conll2012dev: 0.7196625567813109 (70.9 / 83.4)
conll2012test: 0.6952247191011236 (68.5 / 81.4)
run C model;
check model;
learn vocab;
now don't use pre-trained glove embedding.
how long?
entity dense coherence dense vs.
sparse sparse
type model
predictor from learned mode? evaluate?
TextFieldEmbedder, TokenEmbedder, Embedding configuration file rules are complicated;
what is correct way of using them?
Cmodel accuracy % (prior / posterior)
wiki: 84 (78.1 / 86.1)
ace2004: 91.9 (- / 93.1)
ace2005: 79.87 (81.1 / 83.7)
conll2012dev: 71.77 (70,9 / 83.4)
conll2012test: 0.686 (68.5 / 81.4)
CTEmodel accuracy
wiki: ??? (78.1 / 86.1)
ace2004: ??? (- / 93.1)
ace2005: ??? (81.1 / 83.7)
conll2012dev: ??? (70,9 / 83.4)
conll2012test: ??? (68.5 / 81.4)
Embedding config file and name _embedding fix
bucket iterator;
after switching to ListField[LabelField],
num_tokens key error; ["coherences", "num_tokens"] because... sorting_keys
inside model.. forward
metric computation
prediction
how to run
$ allennlp evaluate model_archive evaluation_data_path
--output-file FILE --cuda-device 0 --include-package
how to customize using model.tar.gz
crosswiki pickle file is loading because of the original config -> override that option?
what are the weights?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.