Giter Site home page Giter Site logo

segurac / lm-spanish Goto Github PK

View Code? Open in Web Editor NEW

This project forked from plantl-gob-es/lm-spanish

0.0 1.0 0.0 149 KB

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

License: MIT License

Python 100.00%

lm-spanish's Introduction

Spanish Language Models 💃🏻

A repository part of the MarIA project.

Corpora 📃

Corpora Number of documents Number of tokens Size (GB)
BNE 201,080,084 135,733,450,668 570GB

Models 🤖

Fine-tunned models 🧗🏼‍♀️🏇🏼🤽🏼‍♀️🏌🏼‍♂️🏄🏼‍♀️

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

Datasets 🗂️

Evaluation ✅

Dataset Metric RoBERTa-b RoBERTa-l BETO* mBERT BERTIN** Electricidad***
MLDoc F1 0.9664 0.9702 0.9714🔥 0.9617 0.9668 0.9565
CoNLL-NERC F1 0.8851🔥 0.8823 0.8759 0.8691 0.8835 0.7954
CAPITEL-NERC F1 0.8960 0.9051🔥 0.8772 0.8810 0.8856 0.8035
PAWS-X F1 0.9020 0.9150🔥 0.8930 0.9000 0.8965 0.9045
UD-POS F1 0.9907🔥 0.9904 0.9900 0.9886 0.9898 0.9818
CAPITEL-POS F1 0.9846 0.9856🔥 0.9836 0.9839 0.9847 0.9816
SQAC F1 0.7923 0.8202🔥 0.7923 0.7562 0.7678 0.7383
STS Combined 0.8533🔥 0.8411 0.8159 0.8164 0.7945 0.8063
XNLI Accuracy 0.8016 0.8263🔥 0.8130 0.7876 0.7890 0.7878

* A model based on BERT architecture.

** A model based on RoBERTa architecture.

*** A model based on Electra architecture.

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

Cite 📣

@article{gutierrezfandino2022,
	author = {Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquin Silveira-Ocampo and Casimiro Pio Carrino and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Aitor Gonzalez-Agirre and Marta Villegas},
	title = {MarIA: Spanish Language Models},
	journal = {Procesamiento del Lenguaje Natural},
	volume = {68},
	number = {0},
	year = {2022},
	issn = {1989-7553},
	url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405},
	pages = {39--60}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

lm-spanish's People

Contributors

asier-gutierrez avatar gonzalez-agirre avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.