Giter Site home page Giter Site logo

simalign's Introduction

SimAlign: Similarity Based Word Aligner


Alignment Example

SimAlign is a high-quality word alignment tool that uses static and contextualized embeddings and does not require parallel training data.

For more details see the Paper.

Installation and Usage

Tested with Python 3.7, Transformers 2.3.0, Torch 1.5.0. Networkx 2.4 is optional (only required for Match algorithm). For full list of dependencies see setup.py. For installation of transformers see their repo.

Download the repo for use or alternatively install with pip

pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign

An example for using our code:

from simalign import SentenceAligner

# making an instance of our model.
# You can specify the embedding model and all alignment settings in the constructor.
myaligner = SentenceAligner(model="bert", token_type="bpe", matching_methods="mai")

# The source and target sentences should be tokenized to words.
src_sentence = ["This", "is", "a", "test", "."]
trg_sentence = ["Das", "ist", "ein", "Test", "."]

# The output is a dictionary with different matching methods.
# Each method has a list of pairs indicating the indexes of aligned words (The alignments are zero-indexed).
alignments = myaligner.get_word_aligns(src_sentence, trg_sentence)

for matching_method in alignments:
    print(matching_method, ":", alignments[matching_method])

# Expected output:
# mwmf : [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# inter : [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# itermax : [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

For more examples of how to use our code see example/align_example.py.

Demo

An online demo is available here.

Publication

If you use the code, please cite

@article{sabet2020simalign,
  title={SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings},
  author={Sabet, Masoud Jalili and Dufter, Philipp and Sch{\"u}tze, Hinrich},
  journal={arXiv preprint arXiv:2004.08728},
  year={2020}
}

Feedback

Feedback and Contributions more than welcome! Just reach out to @masoudjs or @pdufter.

FAQ

Do I need parallel data to train the system?

No, no parallal training data is required.

Which languages can be aligned?

This depends on the underlying pretrained multilingual language model used. For example, if mBERT is used, it covers 104 language

Do I need GPUs for running this?

Each alignment simply requires a single forward pass in the pretrained language model. While this is certainly faster on GPU, it runs fine on CPU.

TODOs

  • Add tests

License

Copyright (C) 2020, Masoud Jalili Sabet, Philipp Dufter

Licensed under the terms of the GNU General Public License, version 3. A full copy of the license can be found in LICENSE.

simalign's People

Contributors

pdufter avatar masoudjs avatar zparcheta avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.