Giter Site home page Giter Site logo

tunib-electra's Introduction

TUNiB-Electra

We release several new versions of the ELECTRA model, which we name TUNiB-Electra. There are two motivations. First, all the existing pre-trained Korean encoder models are monolingual, that is, they have knowledge about Korean only. Our bilingual models are based on the balanced corpora of Korean and English. Second, we want new off-the-shelf models trained on much more texts. To this end, we collected a large amount of Korean text from various sources such as blog posts, comments, news, web novels, etc., which sum up to 100 GB in total.

You can use TUNiB-Electra with the Hugging Face transformers library.

What's New:

How to use

You can use this model directly with transformers library:

from transformers import AutoModel, AutoTokenizer

# Small Model (Korean-English bilingual model)
tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-en-small')
model = AutoModel.from_pretrained('tunib/electra-ko-en-small')

# Base Model (Korean-English bilingual model)
tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-en-base')
model = AutoModel.from_pretrained('tunib/electra-ko-en-base')

# Small Model (Korean-only model)
tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-small')
model = AutoModel.from_pretrained('tunib/electra-ko-small')

# Base Model (Korean-only model)
tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-base')
model = AutoModel.from_pretrained('tunib/electra-ko-base')

Tokenizer example

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-en-base')
>>> tokenizer.tokenize("tunib is a natural language processing tech startup.")
['tun', '##ib', 'is', 'a', 'natural', 'language', 'processing', 'tech', 'startup', '.']
>>> tokenizer.tokenize("튜닙은 자연어처리 테크 스타트업입니다.")
['튜', '##닙', '##은', '자연', '##어', '##처리', '테크', '스타트업', '##입니다', '.']

Results on Korean downstream tasks

Small Models

# Params Avg. NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
Korean-Hate-Speech (Dev)
(F1)
TUNiB-Electra-ko-small 14M 81.29 89.56 84.98 72.85 77.08 78.76 94.98 61.17 / 87.64 64.50
TUNiB-Electra-ko-en-small 18M 81.44 89.28 85.15 75.75 77.06 77.61 93.79 80.55 / 89.77 63.13
KoELECTRA-small-v3 14M 82.58 89.36 85.40 77.45 78.60 80.79 94.85 82.11 / 91.13 63.07

Base Models

# Params Avg. NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
Korean-Hate-Speech (Dev)
(F1)
TUNiB-Electra-ko-base 110M 85.99 90.95 87.63 84.65 82.27 85.00 95.77 64.01 / 90.32 71.40
TUNiB-Electra-ko-en-base 133M 85.34 90.59 87.25 84.90 80.43 83.81 94.85 83.09 / 92.06 68.83
KoELECTRA-base-v3 110M 85.92 90.63 88.11 84.45 82.24 85.53 95.25 84.83 / 93.45 67.61
KcELECTRA-base 124M 84.75 91.71 86.90 74.80 81.65 82.65 95.78 70.60 / 90.11 74.49
KoBERT-base 90M 81.92 89.63 86.11 80.65 79.00 79.64 93.93 52.81 / 80.27 66.21
KcBERT-base 110M 79.79 89.62 84.34 66.95 74.85 75.57 93.93 60.25 / 84.39 68.77
XLM-Roberta-base 280M 83.03 89.49 86.26 82.95 79.92 79.09 93.53 64.70 / 88.94 64.06

Results on English downstream tasks

Small Models

# Params Avg. CoLA
(MCC)
SST
(Acc)
MRPC
(Acc)
STS
(Spearman)
QQP
(Acc)
MNLI
(Acc)
QNLI
(Acc)
RTE
(Acc)
TUNiB-Electra-ko-en-small 18M 80.44 56.76 88.76 88.73 86.12 88.66 79.03 87.26 68.23
ELECTRA-small 13M 79.71 55.6 91.1 84.9 84.6 88.0 81.6 88.3 63.6
BERT-small 13M 74.06 27.8 89.7 83.4 78.8 87.0 77.6 86.4 61.8

Base Models

# Params Avg. CoLA
(MCC)
SST
(Acc)
MRPC
(Acc)
STS
(Spearman)
QQP
(Acc)
MNLI
(Acc)
QNLI
(Acc)
RTE
(Acc)
TUNiB-Electra-ko-en-base 133M 85.2 65.36 92.09 88.97 90.61 90.91 85.32 91.51 76.53
ELECTRA-base 110M 85.7 64.6 96.0 88.1 90.2 89.5 88.5 93.1 75.2
BERT-base 110M 80.8 52.1 93.5 84.8 85.8 89.2 84.6 90.5 66.4

Pre-training data

Acknowledgement

The project was created with Cloud TPU support from the Tensorflow Research Cloud (TFRC) program.

Citation

If you find this code/model useful, please consider citing:

@misc{tunib-electra,
  author       = {Ha, Sangchun and Kim, Soohwan and Ryu, Myeonghyeon and
                  Keum, Bitna and Oh, Saechan and Ko, Hyunwoong and Park, Kyubyong},
  title        = {TUNiB-Electra},
  howpublished = {\url{https://github.com/tunib-ai/tunib-electra}},
  year         = {2021},
}

License

TUNiB-Electra is licensed under the terms of the Apache 2.0 License.

Copyright 2021 TUNiB Inc. http://www.tunib.ai All Rights Reserved.

tunib-electra's People

Contributors

sooftware avatar upskyy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.