Giter Site home page Giter Site logo

aesthethic0de / glotlid Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cisnlp/glotlid

1.0 0.0 0.0 92 KB

Language Identification tool for more than 1600 languages (EMNLP 2023).

Home Page: https://arxiv.org/abs/2310.16248

License: Apache License 2.0

glotlid's Introduction

GlotLID

HuggingFace Model HuggingFace Demo GitHub license GitHub stars arXiv

TL;DR

The repository introduces GlotLID, an open-source language identification model with support for more than 1600 languages.

How to use

Language Identification (Python)

You can use the model directly with fasttext library to predict language label:

! pip install fasttext
! pip install huggingface_hub
import fasttext
from huggingface_hub import hf_hub_download

# download model
## cache_dir: path to the folder where the downloaded model will be stored/cached.
model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin", cache_dir=None)

# load the model
model = fasttext.load_model(model_path)

# predict language label (call this function as many times as needed)
model.predict("Hello, world!")

Sentence Vectors (Python)

You can also use the model with fasttext library to get sentence vectors:

! pip install fasttext
! pip install huggingface_hub
import fasttext
from huggingface_hub import hf_hub_download

# download model
## cache_dir: path to the folder where the downloaded model will be stored/cached.
model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin", cache_dir=None)

# load the model
model = fasttext.load_model(model_path)

# get sentence vector of input sentence (call this function as many times as needed)
embedding = model.get_sentence_vector(sent)

Versions

We always maintain the previous version of GlotLID in our huggingface repository.

To access a specific version, simply append the version number to the filename.

For v1: model_v1.bin (introduced in the GlotLID paper and used in all experiments).

For v2: model_v2.bin (an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1).

  • It suuports 1802 three-letter iso codes (1847 three letter iso codes with script)
  • For 1626 three-letter iso codes; v2 on the test set achieved F1 of 0.996 and FPR of 0.0002.
    • These 1626 languages are selected based on the 0.5 F1 threshold and 0.0005 FPR threshold for low resource languages.

model.bin always refers to the latest version (v2 now).

Data Sources

See list of data sources here.

You're welcome to open a pull request or (issue) and contribute new resources to our data list. Even for the languages we already support, we're actively seeking additional resources to mitigate domain shift issues.

Benchmark

  • UDHR: access our clean version of udhr here.
  • FLORES-200: devtest part of FLORES-200.

Evaluation

Codes will be uploaded soon.

FAQ

  • If you see wrong predicted tags by GlotLID for a normal long text open an issue, however:

    • if the script is not supported by our model then use GlotScript to verify for the predicted lang_script, script in the sentence exists! Otherwise, you need to write a function that returns 'und_mainscript' in this situations. GlotScript can identify both the mainscript and all available scripts in the sentence. We recommend using GlotLID in conjunction with GlotScript.
    • The high confidence threshold for each language could be different. This is because not all languages have the same distance from each other. For one language, 0.6 is a lot because it is very close to a similar language (such as dyu and bam), while for another, 0.9 might not be.
    • This model is primarily trained on longer sentences, avoid using it on very short sentences. Other language identification models are not good at short sentences as well unless you increase the ngram size, which is computationally expensive.
    • In GlotLID, the false positive rate (FPR) for high-resource languages is much higher than for low-resource languages. However, even with this higher FPR, it is still lower than in a situation where the language identification model only recognizes high-resource languages. We are also okay with this situation since our main concern is for the FPR of low-resource languages to be low. The high-resource base frequency is much higher than for low-resource languages, so cleanliness would not be a threat for those languages. However, for a low-resource language with a low base frequency, even a small FPR might result in most of the corpus being noisy.
  • If you want to add a language, provide the resource in an open issue, and we will add it. If you require the model urgently, we can expedite the process in less than a week (the training itself takes less than a day). However, if there's no immediate urgency, that language will be included in the official release according to our schedule (depends on new resources).-

  • If you need a custmoized model with susbet of languages let us known in an open issue

  • If you want to collaborate, please send us an email (to: [email protected]) specifying the type of collaboration you need from us.

  • for the rest of requests feel free to email or open an issue.

Citation

If you find our model, code and list of data sources useful for your research, please cite:

@inproceedings{
  kargaran2023glotlid,
  title={GlotLID: Language Identification for Low-Resource Languages},
  author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=dl4e3EBz5j}
}

glotlid's People

Contributors

kargaranamir avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.