sbos / adagram.jl Goto Github PK

Adaptive Skip-gram implementation in Julia

License: MIT License

Shell 3.06% Julia 83.79% C 5.35% Python 7.80%

adagram.jl's Introduction

AdaGram

Adaptive Skip-gram (AdaGram) model is a nonparametric extension of famous Skip-gram model implemented in word2vec software which is able to learn multiple representations per word capturing different word meanings. This projects implements AdaGram in Julia language.

Installation

AdaGram is not in the julia package repository yet, so it should be installed in the following way:

using Pkg
Pkg.add(PackageSpec(url="https://github.com/sbos/AdaGram.jl.git"))

Training a model

The most straightforward way to train a model is to use train.sh script. If you run it with no parameters passed or with --help option, it will print usage information:

usage: train.jl [--window WINDOW] [--workers WORKERS]
                [--min-freq MIN-FREQ] [--remove-top-k REMOVE-TOP-K]
                [--dim DIM] [--prototypes PROTOTYPES] [--alpha ALPHA]
                [--d D] [--subsample SUBSAMPLE] [--context-cut]
                [--epochs EPOCHS] [--init-count INIT-COUNT]
                [--stopwords STOPWORDS]
                [--sense-treshold SENSE-TRESHOLD] [--regex REGEX] [-h]
                train dict output

Here is the description of all parameters:

WINDOW is a half-context size. Useful values are 3-10.
WORKERS is how much parallel processes will be used for training.
MIN-FREQ specifies the minimum word frequency below which a word will be ignored. Useful values are 5-50 depending on the corpora.
REMOVE-TOP-K allows to ignore K most frequent words as well.
DIM is the dimensionality of learned representations
PROTOTYPES sets the maximum number of learned prototypes. This is the truncating level used in truncated stick-breaking, so the actual amount of memory used depends on this number linearly.
ALPHA is the parameter of underlying Dirichlet process. Larger values of ALPHA lead to more meanings discovered. Useful values are 0.05-0.2.
D is used together with ALPHA in Pitman-Yor process and D=0 turns it into Dirichlet process. We couldn’t get reasonable results with PY, but left the option to change D.
SUBSAMPLE is a threshold for subsampling frequent words, similarly to how this is done in word2vec.
—context-cut option allows to randomly decrease WINDOW during the training, which increases training speed with almost no effects on model’s performance
EPOCHS specifies the number of passes over training text, usually one epoch is enough, larger number of epochs is usually required on small corpora.
INIT-COUNT is used for initialization of variational stick-breaking distribution. All prototypes are assigned with zero occurrences except first one which is assigned with INIT-COUNT. Zero value means that first prototype gets all occurrences.
STOPWORDS is a path to newline-separated file with list of words that must be ignored during the training
SENSE-THRESHOLD allows to sparse gradients and speed-up training. If the posterior probability of a prototype is blow that threshold then it won’t contribute to parameters’ gradients.
REGEX will be used to filter out words not matching with from the DICTIONARY provided
train — path to training text (see Format section below)
dict — path to dictionary file (see Format section below)
output — path for saving trained model.

Input format

Training text should be formatted as for word2vec. Words are case-sensitive and are assumed to be separated by space characters. All punctuation should be removed unless specially intented to be preserved. You may use utils/tokenize.sh INPUT_FILE OUTPUT_FILE for simple tokenization with UNIX utils.

In order to train a model you should also provide a dictionary file with word frequency statistics in the following format:

word1   34
word2   456
...
wordN   83

AdaGram will assume that provided word frequencies are actually obtained from training file. You may build a dictionary file using utils/dictionary.sh INPUT_FILE DICT_FILE.

Playing with a model

After model is trained, you may use learned word vectors in the same way as ones learned by word2vec. However, since AdaGram learns several vectors for each word, you may need to disambiguate a word using its context first, in order to determine which vector should be used.

First, load the model and the dictionary:

julia> using AdaGram
julia> vm, dict = load_model("PATH_TO_THE_MODEL");

To examine how many prototypes were learned for a word, use expected_pi function:

julia> expected_pi(vm, dict.word2id["apple"])
30-element Array{Float64,1}:
 0.341832   
 0.658164   
 3.13843e-6
 2.84892e-7
 2.58649e-8
 2.34823e-9
 2.13192e-10
 1.93554e-11
 1.75725e-12
 ⋮

This function returns a --prototypes-sized array with prior probability of each prototype. As one may see, in this example only first two prototypes have probabilities significantly larger than zero, and thus we may conclude that only two meanings of word "apple" were discovered. We may examine each prototype by looking at its 10 nearest neighbours:

julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{(Any,Any,Any),1}:
 ("almond",1,0.70396507f0)    
 ("cherry",2,0.69193166f0)    
 ("plum",1,0.690269f0)        
 ("apricot",1,0.6882005f0)    
 ("orange",4,0.6739181f0)     
 ("pecan",1,0.6662803f0)      
 ("pomegranate",1,0.6580653f0)
 ("blueberry",1,0.6509351f0)  
 ("pear",1,0.6484747f0)       
 ("peach",1,0.6313036f0)   
julia> nearest_neighbors(vm, dict, "apple", 2, 10)
10-element Array{(Any,Any,Any),1}:
 ("macintosh",1,0.79053026f0)     
 ("iifx",1,0.71349466f0)          
 ("iigs",1,0.7030192f0)           
 ("computers",1,0.6952761f0)      
 ("kaypro",1,0.6938647f0)         
 ("ipad",1,0.6914306f0)           
 ("pc",4,0.6801078f0)             
 ("ibm",1,0.66797054f0)           
 ("powerpc-based",1,0.66319686f0)
 ("ibm-compatible",1,0.66120595f0)

Now if we provide a context for word "apple" we may obtain posterior probability of each prototype:

julia> disambiguate(vm, dict, "apple", split("new iphone was announced today"))
30-element Array{Float64,1}:
 1.27888e-5
 0.999987  
 0.0       
 0.0       
 0.0       
 0.0       
 0.0       
 0.0       
 0.0       
 ⋮     
julia> disambiguate(vm, dict, "apple", split("fresh tasty breakfast"))
30-element Array{Float64,1}:
 0.999977  
 2.30527e-5
 0.0       
 0.0       
 0.0       
 0.0       
 0.0       
 0.0       
 0.0       
 ⋮

As one may see, model correctly estimated probabilities of each sense with quite large confidence. Vector corresponding to second prototype of word "apple" can be obtained from vm.In[:, 2, dict.word2id["apple"]] and then used as context-aware features of word "apple".

Plase refer to API documentation for more detailed usage info.

Future work

Full API documentation
C and python bindings
Disambiguation into user-provided sense inventory

References

Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, Dmitry Vetrov. Breaking Sticks and Ambiguities with Adaptive Skip-gram. ArXiv preprint, 2015
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

adagram.jl's People

Contributors

Stargazers

Watchers

Forkers

vseledkin kondra lopuhin buriy arspin viveksck sathappanspm xn0507 federicov ml-lab oxinabox napsternxg deep-compute aarthiis jerrygaolondon yuanzhike luckystar1992 marcevrard glicerico mirestrepo ruimao1988 antonovvk claudecoulombe e-lectrix zgornel shubhampachori12110095 andrewthomasjones iamarocks sandy4321 jbdatascience avsolatorio stjordanis alirezabayatmk andreymarkinppc paulmainwood standardgalactic v-sher xiuyums unixnme playfloor kohei0219

adagram.jl's Issues

Export Model to plain text (with sense information) ?

Hi,
First of all AdaGram seems to be very nice and thanks for making everything available. I managed to export the final vectors to binary/txt format using the 'write_word2vec' function, however this function does only wirte 1 sense per word and not the the mutli-sense information. Does anyone have a small script to convert the binary model to a plain text file including multiple senses and their sense probability ? I'm obviously not really familiar with julia.

Best regards,
Maximilian

Very long calculation of NNs

For models trained on a big corpus, like 60 Gb of text calculation of NNs is terribly long. Like 50-100 NNs per hour for this model: http://panchenko.me/data/joint/adagram/clean_lemma_model_alpha_05

Any possibility to speed up this?

Did someone succeed to run AdaGram on an Ubuntu server?

Greetings folks,

I've got a dirty memory access error (bus error) when running the Julia code loading a large AdaGram model vm, dict = AdaGram.load_model(MODEL_PATH) The kind of error that no one wants to have ... But everything works well on my Macbook?

There could be many causes: First, Julia installation (since for performance, Julia is very close to C) and I've used Anaconda, the compiler (Clang), the OS, the CPU and possibly other factors. Then, I've checked, there is enough RAM memory.

But first, I would like to know if someone has succeeded to run AdaGram on a Linux Ubuntu server?

signal (7): Bus error while loading /home/ccoulombe/AdaGram.jl/disambiguate.jl, in expression starting on line 22 macro expansion at ./cartesian.jl:62 [inlined] macro expansion at ./multidimensional.jl:431 [inlined] _unsafe_batchsetindex! at ./multidimensional.jl:423 _setindex! at ./multidimensional.jl:372 [inlined] setindex! at ./abstractarray.jl:840 [inlined] #9 at /home/ccoulombe/anaconda3/envs/my_env/share/julia/site/v0.5/AdaGram/src/AdaGram.jl:64 #620 at ./multi.jl:1030 run_work_thunk at ./multi.jl:1001 run_work_thunk at ./multi.jl:1010 [inlined] #617 at ./event.jl:68 unknown function (ip: 0x7f4b5c48b89f) warning: parsing line table prologue at 0x00003c27 should have ended at 0x00003ffc but it ended at 0x00003ce4 jl_call_method_internal at (null):0 [inlined] jl_apply_generic at (null):1950 warning: parsing line table prologue at 0x0000e1c8 should have ended at 0x0000e52f but it ended at 0x0000e285 jl_apply at (null):0 [inlined] start_task at (null):254 unknown function (ip: 0xffffffffffffffff) Allocations: 9998583 (Pool: 9997420; Big: 1163); GC: 17 ./run.sh: line 8: 2712 Bus error (core dumped) LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DIR/lib julia "$@"

AdaGram not working with new versions of Julia?

I tried building AdaGram on Julia v 1.1 and it seems that there might be some incompatibilities. This is what I am getting while trying to clone AdaGram using Pkg:

julia> Pkg.clone("https://github.com/sbos/AdaGram.jl.git") ┌ Warning: Pkg.clone is only kept for legacy CI script reasons, please use add└ @ Pkg.API /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:386 Updating registry at~/.julia/registries/GeneralUpdating git-repohttps://github.com/JuliaRegistries/General.git`
Updating git-repo https://github.com/sbos/AdaGram.jl.git
[ Info: Assigning UUID 3697d9a4-0c04-5c6c-a718-eefb7df45f8e to AdaGram
[ Info: Path /home/sadm/.julia/dev/AdaGram exists and looks like the correct package, using existing path
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Devectorize [03c08e68]:
Devectorize [03c08e68] log:
├─possible versions are: [0.2.0-0.2.1, 0.3.0, 0.4.0-0.4.2] or uninstalled
├─restricted to versions 0.0.0-* by AdaGram [3697d9a4], leaving only versions [0.2.0-0.2.1, 0.3.0, 0.4.0-0.4.2]
│ └─AdaGram [3697d9a4] log:
│ ├─possible versions are: 0.0.0 or uninstalled
│ └─AdaGram [3697d9a4] is fixed to version 0.0.0
└─restricted by julia compatibility requirements to versions: uninstalled — no versions left
`

Could you please specify which Julia versions I should use to make AdaGram work?

Training from

Hi, thanks a lot for offering AdaGram.

How can one train a model from many different files? Does one have to provide a single pre-procesed file with all text appended together?
Also, do you offer some clustering algorithm inside AdaGram, like word2vec does?

Thanks

Errors after learning, but it looks like write succeeded (on Julia 0.3)

...
From worker 3: 99.99% -6.6615 0.0000 0.0000 8.4/10.0 1.02 kwords/sec
From worker 5: 64000 words read, 1898744649/2428126864
Learning complete 223794613 / 2.23794609e8
WARNING: Forcibly interrupting busy workers
WARNING: Unable to terminate all workers
pure virtual method called
pure virtual method called
terminate called without an active exception
terminate called without an active exception

signal (6): Aborted

signal (6): Aborted
pure virtual method called
pure virtual method called
terminate called without an active exception
terminate called without an active exception

signal (6): Aborted

signal (6): Aborted
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_ZN9__gnu_cxx27__verbose_terminate_handlerEv at /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (unknown line)
_ZN9__gnu_cxx27__verbose_terminate_handlerEv at /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (unknown line)
_ZN9__gnu_cxx27__verbose_terminate_handlerEv at /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (unknown line)
_ZN9__gnu_cxx27__verbose_terminate_handlerEv at /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (unknown line)
unknown function (ip: 2104030934)
unknown function (ip: -2059376938)
unknown function (ip: -1177553194)
unknown function (ip: -2007681322)
unknown function (ip: 2104030979)
unknown function (ip: -2059376893)
unknown function (ip: -1177553149)
unknown function (ip: -2007681277)
unknown function (ip: 2104033727)
unknown function (ip: -2059374145)
unknown function (ip: -1177550401)
unknown function (ip: -2007678529)
unknown function (ip: -1149159388)
unknown function (ip: 2132424740)
unknown function (ip: -2030983132)
unknown function (ip: -1979287516)
unknown function (ip: 2132425093)
unknown function (ip: -1149159035)
unknown function (ip: -2030982779)
unknown function (ip: -1979287163)
unknown function (ip: 2128580430)
unknown function (ip: 2132214647)
unknown function (ip: -1153003698)
unknown function (ip: -1983131826)
unknown function (ip: -2034827442)
unknown function (ip: 2132222806)
unknown function (ip: -1979497609)
unknown function (ip: -1149369481)
unknown function (ip: 2132223012)
unknown function (ip: -2031193225)
unknown function (ip: -1979489450)
unknown function (ip: -1149361322)
unknown function (ip: -1979489244)
unknown function (ip: -2031185066)
unknown function (ip: 2123683422)
unknown function (ip: -1149361116)
unknown function (ip: -2031184860)
unknown function (ip: -1988028834)
unknown function (ip: 2123683758)
unknown function (ip: -1157900706)
unknown function (ip: -1988028498)
jl_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
unknown function (ip: -2039724450)
jl_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
unknown function (ip: -1157900370)
unknown function (ip: -2039724114)
run_work_thunk at multi.jl:623
run_work_thunk at multi.jl:630
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jlcall_run_work_thunk_20188 at (unknown line)
run_work_thunk at multi.jl:623
jl_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jl_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
run_work_thunk at multi.jl:630
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jlcall_run_work_thunk_20206 at (unknown line)
anonymous at task.jl:873
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
anonymous at task.jl:873
run_work_thunk at multi.jl:623
jl_handle_stack_switch at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
run_work_thunk at multi.jl:623
run_work_thunk at multi.jl:630
run_work_thunk at multi.jl:630
jlcall_run_work_thunk_20205 at (unknown line)
jlcall_run_work_thunk_20203 at (unknown line)
jl_handle_stack_switch at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
julia_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
anonymous at task.jl:873
julia_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
anonymous at task.jl:873
jl_handle_stack_switch at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jl_handle_stack_switch at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
julia_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
julia_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
unknown function (ip: 4199613)
unknown function (ip: 4199613)
unknown function (ip: 4199613)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 4199667)
unknown function (ip: 0)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 4199667)
unknown function (ip: 0)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 4199667)
unknown function (ip: 0)
unknown function (ip: 4199613)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 4199667)
unknown function (ip: 0)

(I've got into console)

Disambiguate can't load the lib.

julia> disambiguate(vm, dict, "apple", split("cell phone"))
ERROR: error compiling var_update_z!: could not load module superlib: superlib: cannot open shared object file: No such file or directory
in var_update_z! at /home/user/.julia/v0.3/AdaGram/src/gradient.jl:93
in disambiguate at /home/user/.julia/v0.3/AdaGram/src/util.jl:248
in disambiguate at /home/user/.julia/v0.3/AdaGram/src/util.jl:259 (repeats 2 times)

I guess I need to add this lib to DL_LIBRARY_PATH myself, but please add this to the documentation.

Missing subsample parameter example

What kind of value is suggested for this parameter? What is it range?

Support for Julia 0.4

WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:581
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:581
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:581
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:581
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:581
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:581
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:581
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:581
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:612
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:7
in anonymous at /home/buriy/.julia/v0.4/ArgParse/src/ArgParse.jl:612
WARNING: require is deprecated, use using or import instead
in depwarn at deprecated.jl:73
[inlined code] from deprecated.jl:694
in require at no file:0
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 88
WARNING: replacing module ArrayViews
WARNING: replacing module ArrayViews
WARNING: Method definition iscontiguous(Array) in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:38 overwritten in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:38.
WARNING: Method definition iscontiguous(DenseArray) in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:37 overwritten in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:37.
WARNING: Method definition iscontiguous(Array) in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:38 overwritten in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:38.
WARNING: Method definition iscontiguous(DenseArray) in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:37 overwritten in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:37.
WARNING: replacing module ArrayViews
WARNING: replacing module ArrayViews
WARNING: Method definition iscontiguous(Array) in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:38 overwritten in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:38.
WARNING: Method definition iscontiguous(DenseArray) in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:37 overwritten in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:37.
WARNING: Method definition iscontiguous(Array) in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:38 overwritten in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:38.
WARNING: Method definition iscontiguous(DenseArray) in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:37 overwritten in module ArrayViews at /home/buriy/.julia/v0.4/ArrayViews/src/common.jl:37.
WARNING: Base.String is deprecated, use AbstractString instead.
likely near /home/buriy/.julia/v0.4/AdaGram/train.jl:92
Building dictionary... WARNING: int64(s::AbstractString) is deprecated, use parse(Int64,s) instead.
in depwarn at deprecated.jl:73
in int64 at deprecated.jl:50
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:9
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:24
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: int64(s::AbstractString) is deprecated, use parse(Int64,s) instead.
in depwarn at deprecated.jl:73
in int64 at deprecated.jl:50
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:9
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:24
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: int32(x) is deprecated, use Int32(x) instead.
in depwarn at deprecated.jl:73
in int32 at deprecated.jl:50
in build_huffman_tree at /home/buriy/.julia/v0.4/AdaGram/src/softmax.jl:41
in call at /home/buriy/.julia/v0.4/AdaGram/src/AdaGram.jl:91
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:30
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: int32(x) is deprecated, use Int32(x) instead.
in depwarn at deprecated.jl:73
in int32 at deprecated.jl:50
in build_huffman_tree at /home/buriy/.julia/v0.4/AdaGram/src/softmax.jl:57
in call at /home/buriy/.julia/v0.4/AdaGram/src/AdaGram.jl:91
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:30
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: int32(x) is deprecated, use Int32(x) instead.
in depwarn at deprecated.jl:73
in int32 at deprecated.jl:50
in pop_initialize! at /home/buriy/.julia/v0.4/AdaGram/src/softmax.jl:49
in build_huffman_tree at /home/buriy/.julia/v0.4/AdaGram/src/softmax.jl:61
in call at /home/buriy/.julia/v0.4/AdaGram/src/AdaGram.jl:91
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:30
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: int32(x) is deprecated, use Int32(x) instead.
in depwarn at deprecated.jl:73
in int32 at deprecated.jl:50
in path at /home/buriy/.julia/v0.4/AdaGram/src/softmax.jl:29
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: uint8(x::AbstractFloat) is deprecated, use round(UInt8,x) instead.
in depwarn at deprecated.jl:73
in uint8 at deprecated.jl:50
in convert_huffman_tree at /home/buriy/.julia/v0.4/AdaGram/src/softmax.jl:77
in call at /home/buriy/.julia/v0.4/AdaGram/src/AdaGram.jl:92
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:30
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: float32(x) is deprecated, use Float32(x) instead.
in depwarn at deprecated.jl:73
in float32 at deprecated.jl:50
in call at /home/buriy/.julia/v0.4/AdaGram/src/AdaGram.jl:107
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:30
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: float32(x) is deprecated, use Float32(x) instead.
in depwarn at deprecated.jl:73
in float32 at deprecated.jl:50
in call at /home/buriy/.julia/v0.4/AdaGram/src/AdaGram.jl:108
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:30
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: push!(t::Associative,key,v) is deprecated, use setindex!(t,v,key) instead.
in depwarn at deprecated.jl:73
in push! at deprecated.jl:50
in call at /home/buriy/.julia/v0.4/AdaGram/src/AdaGram.jl:25
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:30
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
WARNING: push!(t::Associative,key,v) is deprecated, use setindex!(t,v,key) instead.
in depwarn at deprecated.jl:73
in push! at deprecated.jl:50
in call at /home/buriy/.julia/v0.4/AdaGram/src/AdaGram.jl:25
in read_from_file at /home/buriy/.julia/v0.4/AdaGram/src/util.jl:30
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 98
Done!
WARNING: float64(x) is deprecated, use Float64(x) instead.
in depwarn at deprecated.jl:73
in float64 at deprecated.jl:50
in inplace_train_vectors! at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:119
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 107
WARNING: isblank(c::Char) is deprecated, use c == ' ' || c == '\t' instead.
in depwarn at deprecated.jl:73
in isblank at deprecated.jl:50
in align at /home/buriy/.julia/v0.4/AdaGram/src/textutil.jl:64
in anonymous at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:133
in anonymous at multi.jl:889
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
while loading no file, in expression starting on line 0
WARNING: isblank(c::Char) is deprecated, use c == ' ' || c == '\t' instead.
in depwarn at deprecated.jl:73
in isblank at deprecated.jl:50
in align at /home/buriy/.julia/v0.4/AdaGram/src/textutil.jl:67
in anonymous at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:133
in anonymous at multi.jl:889
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
while loading no file, in expression starting on line 0
WARNING: isblank(s::AbstractString) is deprecated, use all((c->begin
c == ' ' || c == '\t'
end),s) instead.
in depwarn at deprecated.jl:73
in isblank at deprecated.jl:50
in producer at /home/buriy/.julia/v0.4/AdaGram/src/textutil.jl:22
while loading no file, in expression starting on line 0
WARNING: isblank(s::AbstractString) is deprecated, use all((c->begin
c == ' ' || c == '\t'
end),s) instead.
in depwarn at deprecated.jl:73
in isblank at deprecated.jl:50
in producer at /home/buriy/.julia/v0.4/AdaGram/src/textutil.jl:22
while loading no file, in expression starting on line 0
From worker 2: 64000 words read, 743789/305658269
WARNING: isblank(c::Char) is deprecated, use c == ' ' || c == '\t' instead.
in depwarn at deprecated.jl:73
in isblank at deprecated.jl:50
in align at /home/buriy/.julia/v0.4/AdaGram/src/textutil.jl:64
in anonymous at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:133
in anonymous at multi.jl:889
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
while loading no file, in expression starting on line 0
WARNING: isblank(c::Char) is deprecated, use c == ' ' || c == '\t' instead.
in depwarn at deprecated.jl:73
in isblank at deprecated.jl:50
in align at /home/buriy/.julia/v0.4/AdaGram/src/textutil.jl:67
in anonymous at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:133
in anonymous at multi.jl:889
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
while loading no file, in expression starting on line 0
WARNING: isblank(c::Char) is deprecated, use c == ' ' || c == '\t' instead.
in depwarn at deprecated.jl:73
in isblank at deprecated.jl:50
in align at /home/buriy/.julia/v0.4/AdaGram/src/textutil.jl:64
in anonymous at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:133
in anonymous at multi.jl:889
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
while loading no file, in expression starting on line 0
WARNING: isblank(c::Char) is deprecated, use c == ' ' || c == '\t' instead.
in depwarn at deprecated.jl:73
in isblank at deprecated.jl:50
in align at /home/buriy/.julia/v0.4/AdaGram/src/textutil.jl:67
in anonymous at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:133
in anonymous at multi.jl:889
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
while loading no file, in expression starting on line 0
ERROR: LoadError: On worker 2:
UndefVarError: subtract! not defined
in inplace_train_vectors! at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:35
in anonymous at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:144
in anonymous at multi.jl:889
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
in remotecall_fetch at multi.jl:731
in call_on_owner at multi.jl:776
in inplace_train_vectors! at /home/buriy/.julia/v0.4/AdaGram/src/gradient.jl:158
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/buriy/.julia/v0.4/AdaGram/train.jl, in expression starting on line 107
WARNING: Forcibly interrupting busy workers
fatal error on 5: ERROR: InterruptException:
make: *** [vec/ru-news3/norm.model] Error 1

No method matching Array error

Hello,

I ran into the following error while running train.sh. Would you have any ideas what I might have gotten wrong? I realize this project is likely wayyy past its support lifetime, so it's understandable if it's not worth debugging : ). I'm trying to compare Adagram Vs Word2Vec for synonym for my grad NLP final project but without it I'll skip on Adagram.

ERROR: LoadError: MethodError: no method matching Array{T,N}(::Base.#RemoteRef, ::Int64)
Closest candidates are:
Array{T,N}{T}(!Matched::Type{T}, ::Int64) at boot.jl:330
Array{T,N}{T}(!Matched::Type{T}, ::Int64, !Matched::Int64) at boot.jl:331
Array{T,N}{T}(!Matched::Type{T}, ::Int64, !Matched::Int64, !Matched::Int64) at boot.jl:332
...
in #inplace_train_vectors!#18(::Int64, ::Float64, ::Void, ::Float64, ::Bool, :: Int64, ::Float64, ::Float64, ::AdaGram.#inplace_train_vectors!, ::AdaGram.Vector Model, ::AdaGram.Dictionary, ::String, ::Int64) at /h/user/.julia/v0.5/AdaGr am/src/gradient.jl:152
in (::AdaGram.#kw##inplace_train_vectors!)(::Array{Any,1}, ::AdaGram.#inplace_t rain_vectors!, ::AdaGram.VectorModel, ::AdaGram.Dictionary, ::String, ::Int64) a t ./:0
in include_from_node1(::String) at ./loading.jl:488
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
while loading /h/user/julia/train.jl, in expression starting on line 105

My knowledge of Julia is very limited, so I am unable to do much debugging. Could this be related to the fact that I'm running Julia directly without having installed it to the system? (

exception on training - in "refs = Array(RemoteRef, nworkers())"

(MacOs, Macbook Pro)
...
WARNING: Base.ASCIIString is deprecated, use String instead.
likely near /AdaGram.jl/src/AdaGram.jl:49
WARNING: Base.ASCIIString is deprecated, use String instead.
likely near /AdaGram.jl/src/AdaGram.jl:49
Building dictionary... Done!
ERROR: LoadError: MethodError: no method matching Array{T,N}(::Base.#RemoteRef, ::Int64)
Closest candidates are:
Array{T,N}{T}(!Matched::Type{T}, ::Int64) at boot.jl:330
Array{T,N}{T}(!Matched::Type{T}, ::Int64, !Matched::Int64) at boot.jl:331
Array{T,N}{T}(!Matched::Type{T}, ::Int64, !Matched::Int64, !Matched::Int64) at boot.jl:332
...
in #inplace_train_vectors!#18(::Int64, ::Float64, ::Void, ::Float64, ::Bool, ::Int64, ::Float64, ::Float64, ::AdaGram.#inplace_train_vectors!, ::AdaGram.VectorModel, ::AdaGram.Dictionary, ::String, ::Int64) at /AdaGram.jl/src/gradient.jl:152
in (::AdaGram.#kw##inplace_train_vectors!)(::Array{Any,1}, ::AdaGram.#inplace_train_vectors!, ::AdaGram.VectorModel, ::AdaGram.Dictionary, ::String, ::Int64) at ./:0
in include_from_node1(::String) at ./loading.jl:488
in include_from_node1(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
in _start() at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
while loading /AdaGram.jl/train.jl, in expression starting on line 105

I got AssertionError when checking the similarity.

Hello!!
Thank you for providing the great library.

I tried to check the nearest_neighbors after I trained a model.
Then, AssertionError was shown in here.
The reason is that all elements of in_vs are 0, so the result of norm is also 0.

The model that I produced has 5 meanings for each word.
I guessed that a word that was checked only had less than 5 meanings. Is this interpretation correct?

In order to run the function, I used below code instead of @Assert.

if isnan(sim[s, v]) sim[s, v] = -Inf end

The result was what I expected.

nearest_neighbors(vm, dict, "apple", 1, 10)
("peach", 1, 0.957842), ("plum", 1, 0.952987), ("cherry", 1, 0.94981), ("lemon", 5, 0.947042), ("pear", 1, 0.945165), ("sweet", 2, 0.943617), ("quince", 1, 0.942606), ("blackberry", 1, 0.94111), ("melon", 1, 0.940196), ("pomegranate", 1, 0.940196)

I'm using Julia version 1.1.1. I think your project runs correctly if you modified this part.

AdaGram process is failing to load large data in Docker container

I can able to load large data (40 GB) in a docker container with other processes.
But with AdaGram I am not able to load more than 20 MB data in the docker container.

If I load more than 20 MB, I am getting bus error.

julia> using AdaGram
julia> AdaGram.load_model("/julia/full.embed")

signal (7): Bus error
_unsafe_batchsetindex! at ./multidimensional.jl:329
setindex! at ./abstractarray.jl:592
Bus error (core dumped)

I can able to load 40 GB of text file in julia shell in docker container.

Add ArgParse.jl to pkg dependencies

ArgParse now needs to be manually installed to satisfy runtime dependencies of train.jl

Dirichlet process gone bad: stick is broken in wrong place

I have trained a model on text8 corpus with the following config. (Please notice that this example sometimes work and show accurate result with other configs.)

./run.sh train.jl --epochs 5 --alpha 0.05 --prototypes 10 --min-freq 20 --remove-top-k 70 --window 5 text8 text8.dic text8.model

When I check apple word, first the amount senses (meanings):

julia> expected_pi(vm, dict.word2id["apple"])
10-element Array{Float64,1}:
 0.197259
 0.216447
 0.58626
 3.24536e-5
 1.54719e-6
 7.37607e-8
 3.51647e-9
 1.67644e-10
 7.99224e-12
 4.00096e-13

We have 3 senses and 7 free slots - nothing unusual. Then I ask to describe each sense:

julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("macintosh",2,0.6276491f0)
 ("intel",2,0.5980226f0)
 ("ibm",2,0.59220535f0)
 ("compaq",1,0.5730073f0)
 ("inc",2,0.572671f0)
 ("store",2,0.56161773f0)
 ("raskin",1,0.56127656f0)
 ("corp",1,0.55665475f0)
 ("ceo",1,0.54154074f0)
 ("ceo",2,0.54141444f0)

julia> nearest_neighbors(vm, dict, "apple", 2, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("apples",1,0.76360685f0)
 ("sweet",1,0.70247304f0)
 ("juice",1,0.6916403f0)
 ("cakes",1,0.6847711f0)
 ("fermented",1,0.681853f0)
 ("olive",1,0.6792287f0)
 ("fruit",1,0.6718393f0)
 ("peas",1,0.6700381f0)
 ("berries",1,0.66832954f0)
 ("roasted",1,0.66814494f0)

julia> nearest_neighbors(vm, dict, "apple", 3, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("macintosh",1,0.9284175f0)
 ("computers",1,0.8870821f0)
 ("pc",1,0.88180965f0)
 ("compatible",1,0.8577318f0)
 ("amiga",1,0.83944887f0)
 ("ibm",1,0.8265453f0)
 ("desktop",1,0.8234609f0)
 ("portable",1,0.81334895f0)
 ("pcs",1,0.8022719f0)
 ("dos",1,0.8022494f0)

As you can see the first and the third senses actually we same, why did AdaGram broken it into 2 different senses?

Recompiling the code after change?

I am new to Julia and I wanted to change some part of your code and then use the resulting model. But I am not able to recompile the entire package. I found a answer on stackoverflow which said,

julia --compile=yes src*.jl
But that did not help. The code still works as earlier.
Can you explain how to do that?

Errors during training on AWS instances.

Hi all, I'm trying to train AdaGram model on AWS Instances (32Gb RAM, 16 CPU Cores) but is showing following errors but i'm able to train model on my local machine. I'm using Julia 0.4.5. and 13 Gb data.

**Building dictionary...
signal (7): Bus error

signal (7): Bus error
_unsafe_batchsetindex! at ./multidimensional.jl:329
_unsafe_batchsetindex! at ./multidimensional.jl:329
setindex! at ./abstractarray.jl:592
setindex! at ./abstractarray.jl:592
anonymous at /home/ram/.julia/v0.4/AdaGram/src/AdaGram.jl:55
anonymous at /home/ram/.julia/v0.4/AdaGram/src/AdaGram.jl:55
Worker 3 terminated.
Worker 2 terminated.ERROR (unhandled task failure): EOFError: read end of file

ERROR (unhandled task failure): EOFError: read end of file
ArgumentError: stream is closed or unusable
in uv_write at ./stream.jl:948ArgumentError: stream is closed or unusable
in uv_write at ./stream.jl:948ArgumentError: stream is closed or unusable
in uv_write at ./stream.jl:948ERROR: LoadError: ProcessExitedException()
in fetch at ./channels.jl:47
in remotecall_wait at ./multi.jl:762

...and 1 other exceptions.

[inlined code] from ./task.jl:422
in SharedArray at ./sharedarray.jl:104
while loading /home/ram/.julia/v0.4/AdaGram/train.jl, in expression starting on line 96**

A function for calculating similarity between two words

Hi there, Is there a function for getting similarity value between two words. Like:
distance(w1,w2) ?
Thanks.

Retraining the Pretrained Models

First of all, great paper, thanks for making the source code available!

I want know, is there a way to retrain the pretrained model?

How to convert Array into DenseArray in Julia language

King - Man + Queen = Woman

nearest_neighbors(vm, dict, vm.In[:, 1, dict.word2id["king"]] - vm.In[:, 2, dict.word2id["man"]] + vm.In[:, 1, dict.word2id["queen"]], 1, 10)

Gives error about type mismatch, expected DenseArray{Float32, N}, got Array{Float32,1}:

ERROR: MethodError: `nearest_neighbors` has no method matching nearest_neighbors(::AdaGram.VectorModel, ::AdaGram.Dictionary, ::Array{Float32,1}, ::Int64, ::Int64)
Closest candidates are:
  nearest_neighbors(::AdaGram.VectorModel, ::AdaGram.Dictionary, ::DenseArray{Float32,N}, ::Integer)
  nearest_neighbors(::AdaGram.VectorModel, ::AdaGram.Dictionary, ::AbstractString, ::Int64, ::Integer)
  nearest_neighbors(::AdaGram.VectorModel, ::AdaGram.Dictionary, ::DenseArray{Float32,N})

AssertionError: Tuple{HierarchicalSoftmaxNode,Int64}[]

Hello! I'm trying to train AdaGram models, but do not have much experience in Julia - I've tried to follow the installation instructions, but keep getting this error when running train.sh or train.jl on my test files:

~/.julia/packages/AdaGram/bGnM5$ julia train.jl ../test/corpus.txt ../test/vocab.txt ../test/model.pkl
Building dictionary... ERROR: LoadError: AssertionError: Tuple{HierarchicalSoftmaxNode,Int64}[]
Stacktrace:
[1] build_huffman_tree(::Array{Int64,1}) at /home/username/.julia/packages/AdaGram/bGnM5/src/softmax.jl:62
[2] VectorModel(::Array{Int64,1}, ::Int64, ::Int64, ::Float64, ::Float64) at /home/username/.julia/packages/AdaGram/bGnM5/src/AdaGram.jl:105
[3] #read_from_file#21(::Regex, ::typeof(read_from_file), ::String, ::Int64, ::Int64, ::Float64, ::Float64, ::Int64, ::Int64, ::Set{AbstractString}) at /home/username/.julia/packages/AdaGram/bGnM5/src/util.jl:32
[4] (::AdaGram.var"#kw##read_from_file")(::NamedTuple{(:regex,),Tuple{Regex}}, ::typeof(read_from_file), ::String, ::Int64, ::Int64, ::Float64, ::Float64, ::Int64, ::Int64, ::Set{AbstractString}) at ./none:0
[5] top-level scope at /home/username/.julia/packages/AdaGram/bGnM5/train.jl:97
[6] include at ./boot.jl:328 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1105
[8] include(::Module, ::String) at ./Base.jl:31
[9] exec_options(::Base.JLOptions) at ./client.jl:287
[10] _start() at ./client.jl:460
in expression starting at /home/username/.julia/packages/AdaGram/bGnM5/train.jl:97

Am I doing something wrong? I've tried several different versions of Julia, but keep getting this error, sometimes a slightly different one involving a 'EOF error', depending on the version used.

Thank you!

disambiguate doesn't handle words not in vocabulary

To disambiguate(), we currently need a context that is composed of words which exist in the AdaGram model. When this context contains a word that's not in the model vocabulary, disambiguate() returns a KeyError.
Would it be convenient to discard the unknown words and disambiguate with the rest of the context? In the extreme case that no words in the context exist in the model, then we'd be using an empty list, which still returns the prior probabilities of the word...
This may be done with a warning to announce which word(s) from the context are being ignored

Alternate to /dev/shm

Building a dictionary with the dimensionality of 300 fails (encountering BUS ERROR) as the /dev/shm memory available on the HPC am trying to run is very limited (256M). Is there an alternate to /dev/shm , something like /tmp ($TMPDIR) which has a higher memory limit?

superlib not found

Like in this issue, I face the same problem even when using run.sh:

./run.sh train.jl --window=5 --workers=3 --alpha=0.1 tokenized_sentences.txt adagram_dict.txt test_model.p

ERROR: LoadError: On worker 2:
error compiling var_update_z!: could not load library "superlib"
dlopen(superlib.dylib, 1): image not found

Julia 0.5 macosx

What's the second element of the tuple returned by the nearest_neighbors() function?

I have a simple question? I've briefly checked out the source code but I can't figure the meaning of the second element (a small integer), of the tuple returned by the nearest_neighbors() function?

What can I use it for?

`julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{(Any,Any,Any),1}:

("almond", 1, 0.70396507f0)
("cherry", 2, 0.69193166f0)
("plum", 1, 0.690269f0)
("apricot", 1, 0.6882005f0)
("orange", 4, 0.6739181f0)
("pecan", 1, 0.6662803f0)
("pomegranate", 1, 0.6580653f0)
("blueberry", 1, 0.6509351f0)
("pear", 1, 0.6484747f0)
("peach", 1, 0.6313036f0) `

Project homepage gives 404

"Project homepage" in References leads to nowhere ATM.

Is it possible to get an idea of the recommended training parameters?

Is it possible to get an idea of the generally recommended parameters used to train models (only those that are not the default setting)?

For example, DIM = 300, ALPHA = 0.2, MIN-FREQ = 20, SENSE-THRESHOLD = 1e-17

I'm particularly interested to know the parameters that were used to train the huang_super_300D_0.2_min20_hs_t1e-17.model model

AbstractString not defined error when running code on Julia 0.4

Hi,
Thanks for releasing the code for Adaptive Skipgram. I am facing an issue running the code: where I get an error saying AbstractString not defined. Any ideas !

Thanks.

vvkulkarni@darwin:~/.julia/v0.4/AdaGram$ ./train.s-h
ERROR: AbstractString not defined
in include at boot.jl:238
at /home/vvkulkarni/.julia/v0.4/AdaGram/train.jl:573

Pretrained Models

First of all, great paper, thanks for making the source code available!

In the paper, it says that 'all trained models are available at http://github.com/sbos/AdaGram.jl'. However, I could not find any pretrained model on GitHub, and the project homepage (http://bayesgroup.ru/adagram) gives a 404.

I was wondering if the pretrained models are indeed available somewhere already, or if you have any plans of releasing them any time soon.

Thanks!

-- AnanS

disambiguation problem

I try to follow the readme, but disambiguation doesn't work.

System: Mac OSX Yosemite, LLVM (clang) installed

clang --version
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix

Here is the error msg:

julia> Pkg.build("AdaGram")
INFO: Building AdaGram

julia> using AdaGram
Warning: could not import Base.add! into NumericExtensions

julia> vm, dict = load_model("../HugeModel");

julia> disambiguate(vm, dict, "apple", split("new iphone was announced today"))
ERROR: error compiling var_update_z!: could not load module superlib: dlopen(superlib.dylib, 1): image not found
 in var_update_z! at /Users/alex/.julia/v0.3/AdaGram/src/gradient.jl:93
 in disambiguate at /Users/alex/.julia/v0.3/AdaGram/src/util.jl:248
 in disambiguate at /Users/alex/.julia/v0.3/AdaGram/src/util.jl:259 (repeats 2 times)

julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{(Any,Any,Any),1}:
 ("macintosh",1,0.95542336f0)
 ("hardware",3,0.91449404f0)
 ("pc",1,0.90442455f0)
 ("microsoft",1,0.9027111f0)
 ("ibm",1,0.9026086f0)
 ("pcs",1,0.9017542f0)
 ("dos",1,0.901704f0)
 ("emulator",1,0.88945335f0)
 ("windows",3,0.8883148f0)
 ("computers",1,0.88593197f0)

build.sh fails when running without clang

I tried installing the package without clang but the build.sh files fails. The reason being that the if condition in the build file is incorrect. I have made the required change and was able to install it successfully. I will send a pull request.

rare words always end up to be ambiguous

I trained a model for Hungarian, and found that, while the number of senses of words with frequency >= 90 follow a distribution as expected, words with freq < 90 mostly have at least two senses. (I will reopen the issue after giving more details.)

Disambiguation out of vocabulary words

Now any out of vocabulary word causes an error:

julia> disambiguate(vm, dict, "python", split("tigris python molurus molurus reptile sanctuary albicans pakistan mnp python jamesonii jamesonii ordinatus coluber boaeformis python molurus"))
ERROR: key not found: "ordinatus"
 in disambiguate at /Users/alex/.julia/v0.3/AdaGram/src/util.jl:259 (repeats 2 times)

An expected behaviour: out of vocabulary words are filtered out.

unicode support?

On any input file with unicode (i.e. utf-8 text in Russian) the default train.sh gives:

ERROR: LoadError: On worker 2:
UnicodeError: invalid character index
in yieldto at ./task.jl:71
in wait at ./task.jl:371
in schedule_and_wait at task.jl:352
in consume at task.jl:259
in read_words at [...].julia/v0.4/AdaGram/src/textutil.jl:126
in anonymous at [...].julia/v0.4/AdaGram/src/gradient.jl:136
in anonymous at multi.jl:889
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
in remotecall_fetch at multi.jl:731
in call_on_owner at multi.jl:776
in inplace_train_vectors! at [...].julia/v0.4/AdaGram/src/gradient.jl:158
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading [...]AdaGram.jl/train.jl, in expression starting on line 105

Since Julia language is new to me, I'm wondering maybe I can fix this behaviour without patching the AdaGram code? Because it looks like I have to fix looped_word_iterator at src/textutil.jl:26, do_work at src/gradient.jl:24 and maybe somewhere else. Please help :)