Giter Site home page Giter Site logo

pytorch_lstmcrf's Introduction

LSTM-CRF Model for Named Entity Recognition (or Sequence Labeling)

This repository implements an LSTM-CRF model for named entity recognition. The model is same as the one by Lample et al., (2016) except we do not have the last tanh layer after the BiLSTM. We achieve the SOTA performance on both CoNLL-2003 and OntoNotes 5.0 English datasets (check our benchmark).

Requirements

  • Python >= 3.6 and PyTorch >= 0.4.1
  • AllenNLP package (if you use ELMo)

If you use conda:

git clone https://github.com/allanj/pytorch_lstmcrf.git

conda create -n pt_lstmcrf python=3.7
conda activate pt_lstmcrf
# check https://pytorch.org for the suitable version of your machines
conda install pytorch=1.3.0 torchvision cudatoolkit=10.0 -c pytorch -n pt_lstmcrf
pip install tqdm
pip install termcolor
pip install overrides
pip install allennlp

Usage

  1. Put the Glove embedding file (glove.6B.100d.txt) under data directory (You can also use ELMo/BERT/Flair, Check below.) Note that if your embedding file does not exist, we just randomly initalize the embeddings.
  2. Simply run the following command and you can obtain results comparable to the benchmark above.
    python trainer.py
    If you want to use your 1st GPU device cuda:0 and train models for your own dataset with elmo embedding:
    python trainer.py --device cuda:0 --dataset YourData --context_emb elmo --model_folder saved_models
    
Training with your own data.
  1. Create a folder YourData under the data directory.
  2. Put the train.txt, dev.txt and test.txt files (make sure the format is compatible, i.e. the first column is words and the last column are tags) under this directory. If you have a different format, simply modify the reader in config/reader.py.
  3. Change the dataset argument to YourData when you run trainer.py.

Using ELMo (and BERT) as contextualized word embeddings

There are two ways to import the ELMo and BERT representations. We can either preprocess the input files into vectors and load them in the program or use the ELMo/BERT model to forward the input tokens everytime. The latter approach allows us to fine tune the parameters in ELMo and BERT. But the memory consumption is pretty high. For the purpose of most practical use case, I simply implemented the first method.

  1. Run the script with python -m preprocess.get_elmo_vec YourData. As a result, you get the vector files for your datasets.
  2. Run the main file with command: python trainer.py --context_emb elmo. You are good to go.

For using BERT, it would be a similar manner. Let me know if you want further functionality. Note that, we concatenate ELMo and word embeddings (i.e., Glove) in our model (check here). You may not need concatenation for BERT.

Running with our pretrained English (with ELMo) Model

We trained an English LSTM-CRF (+ELMo) model on the CoNLL-2003 dataset. You can directly predict a sentence with the following piece of code (Note: we do not do tokenization.).

You can download the English model through this link.

from ner_predictor import NERPredictor
sentence = "This is an English model ."
# Or you can make a list of sentence:
# sentence = ["This is an English model", "This is the second sentence"]
model_path = "english_model.tar.gz"
predictor = NERPredictor(model_path, cuda_device="cpu") ## you can use "cuda:0", "cuda:1" for gpu
prediction = predictor.predict(sentence)
print(prediction)

Further Details and Extensions

  1. Benchmark Performance

Ongoing Plan

  • Support for ELMo as features
  • Interactive model where we can just import model and decode a setence
  • Make the code more modularized (separate the encoder and inference layers) and readable (by adding more comments)
  • Put the benchmark performance documentation to another markdown file
  • Integrate ELMo/BERT as a module instead of just features.
  • Clean up the code to better organization (e.g., import stuff)

Contributors

A huge thanks to @yuchenlin for his contribution in this repo.

pytorch_lstmcrf's People

Contributors

allanj avatar yuchenlin avatar

Stargazers

Bekbolat Kurmetbek avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.