Giter Site home page Giter Site logo

attention-word-embedding's Introduction

Attention Word Embeddings

Original README CONTENT :

The code is inspired from the following github repository.

AWE is designed to learn rich word vector representations. It fuses the attention mechanism with the CBOW model of word2vec to address the limitations of the CBOW model. CBOW equally weights the context words when making the masked word prediction, which is inefficient, since some words have higher predictive value than others. We tackle this inefficiency by introducing our Attention Word Embedding (AWE) model. We also propose AWE-S, which incorporates subword information (code for which is in the fastText branch).

Details of this method and results can be found in our COLING PAPER.

Information on this fork

I forked the following repository in order to use it originally, and then I figured that many little things were broken or unclear.

I fixed part of the issue for train_cbow.py and dump_w2v_k_fmt.py so that someone could use it with latest torch and their own dataset.

Training

Some parameters have defaults specific to the original developer, I recommand heavily to take care of outputdir, outputmodelname and dataset_path. I heavily recommend to change --evaluation_... if you are using a small dataset, as the default split (0.0001) might make your eval vary a lot (eg. --validation_fraction 0.005)

python train_cbow.py \
   --n_epochs 5 \
   --batch_size 128 \
   --w2m_type acbow \
   --word_emb_dim 200 \
   --dataset_path SingleFileEndingWith.txt OR directory containing .txt files \
   --context_size 10 \
   --mode cbow \
   --outputmodelname acbow.200.lemma.model \
   --outputdir ./models \
   --temp_path tempdir \
   --max_words 20000 # Your vocabulary size !

The best model against the validation set is saved with the suffix _val.cbow_net.

You can then turn it into a classical matrix with vocabulary -> vector using the dump_w2v_k_fmt.py script:

python dump_w2v_k_fmt.py --vocab FILE.vocab --cbow_net MODEL_val.cbow_net --output YOURCHOICEOFFILE

It's then loadable through Gensim and the likes as a canonical w2vec format.

attention-word-embedding's People

Contributors

ponteineptique avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.