Giter Site home page Giter Site logo

cross_lingual_ner's Introduction

Entity Projection via MT for Cross-Lingual NER

Repository containing the implementation of the Translate-Match-Project (TMP) method described in this paper. We demonstrate that using off-the-shelf Machine Translation (MT) systems and a few simple heuristics, significant gains can be made towards cross-lingual NER for medium-resource languages1.

Set-up

Basics

This code has been written in Python 3.6. Please create a dedicated environment (using virtualenv or conda) and install the packages listed in the requirements file in your environment using the following command. Also, create a data/ directory where all the input and output files can be stored. Note that all the commands and paths listed in this README assume that you are in the parent project directory, i.e., in cross_lingual_ner/.

pip install -r requirements.txt

Using Google Cloud Translation API

The Google Cloud Translation API needs to be used twice in order to successfully run the TMP method:

  1. To translate sentences from source (language) to target.
  2. To translate each entity phrase in a source sentence to target.

Please find below instructions to set up and use the API.

Setting up the API

Please follow the steps listed here to set up the API. Once your setup is finished, you will have access to an API Key that would be required to authenticate during API usage. Store this key (string) in a text file (not JSON) in your project directory. Please ensure that this key remains private to avoid unauthorized usage from your account.

Using the API programmatically

The following function in src/util/tmp.py accesses the Translation API:

from googleapiclient.discovery import build
.
.
.
def get_google_translations(src_list, src_lang_code, tgt_lang_code, api_key):
    service = build('translate', 'v2', developerKey=api_key)
    tgt_dict = service.translations().list(
        source=src_lang_code, target=tgt_lang_code, q=src_list).execute()
    return [t['translatedText'] for t in tgt_dict['translations']]

Note

The Google Cloud Translation service can at times error out due to request arrival rate exceeding the maximum rate allowed. The argument batch_size (in main.py) has been set to 128 and time_sleep to 10 (seconds) to minimize such errors. Note that these arguments are used only while translating sentences. Entity phrase translation occurs on a sentence-by-sentence basis without batching and without any wait time.

However, despite these measures, these errors continue to occur. If that happens, please note down the index of the batch (while translating sentences) and that of the sentence (while translating entity phrases) at which the error occurs. If the error occurs while translating sentences, re-run the process with the argument sent_iter (in main.py) set to this index. If the error occurs while translating entity do this with the argument phrase_iter.

Initially, both these indices are set to -1, so that all sentences or entity phrases get sent to the API for translation. When one or both of these have positive integral values, the batches or sentences numbered lower than these indices (sent_iter, phrase_iter) are not sent again for translation.

Getting annotated target data

Preprocess the source files

TRANSLATE from source to target

MATCH and PROJECT

Training a model in the target language

We used the code from...

Contact

Please send an email to [email protected] in case of any questions or suggestions related to the paper or the code.

Footnotes

1 We define medium-resource languages to be those for which while strong off-the-shelf MT systems exist, large annotated corpora for NER do not.

cross_lingual_ner's People

Contributors

alankarj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.