Giter Site home page Giter Site logo

turkish-language-models's Introduction

Turkish Language Models

Introduction

In this repository, we publish Transformer based Turkish language models and related tools.

New models, datasets, updates and tutorials are on the way. Please keep in touch with us.

For any question or request, feel free to open an issue. We are an enthusiastic R&D team to contribute to Turkish NLP community and we need your feedbacks.

You can also check our Zemberek-Python implementation. It is fully Python integrated and neither Java nor JVM are required.

Updates

August 28, 2020 Initial version of this repo.

December 8, 2020 New models have been trained with different corpus and training parameters. Finetuning scores have been updated with optimal hyper parameters.

  • Bert-cased is newly added.
  • Bert-uncased is updated.

Language Models

ELECTRA, ALBERT, BERT

Trained Models

Model Vocabulary Size (Thousand) Train Raw Text Data Size (GB) Train Step (Million)
loodos/electra-small-turkish-cased-discriminator 32 200 1
loodos/electra-small-turkish-uncased-discriminator 32 40 1
loodos/electra-base-turkish-uncased-discriminator 32 40 1
loodos/electra-base-turkish-64k-uncased-discriminator 64 200 1
loodos/bert-base-turkish-uncased 32 40 5
loodos/bert-base-turkish-cased 32 40 5
loodos/albert-base-turkish-uncased 32 40 3

Pretraining Details

  • Our training dataset(totally 200 GB raw Turkish text) is collected from online blogs, free e-books, newspapers, common crawl corpuses, Twitter, articles, Wikipedia and so.

  • We are not able to share all of our dataset due to copyright issues. But we are planning to share some part of it. Check our website frequently.

  • You can use filtered common crawl OSCAR corpus from here.

  • On pretraining and finetuning, we have found a normalization issue specific to Turkish in both Google's and Huggingface's repos. You can check details from here. We are waiting Huggingface for solving this issue. Until it is solved, we significantly suggest you to use our TextNormalization module before tokenizing.

Finetune Results

Results are recorded with our super computer, called Pantheon.

System Specs:

  • AMD Ryzen 3950x
  • 2 x Titan RTX
  • 128 GB Ram
  • 4 TB SSD and 30 TB HDD
  • MSI x570 Godlike Motherboard
  • Ubuntu 18.04, Cuda 10.1, Python 3.8.6, Tensorflow 2.3.1, Pytorch 1.7.0, Transformers 3.4.0

DBMDZ models are also known as BERTurk.

NER

Huggingface's token classification example is used for scoring.

Our training dataset is the same dataset that has been used by "Mustafa Keskin, Banu Diri, “Otomatik Veri Etiketleme ile Varlık ̇Ismi Tanıma”, 4st International Mediterranean Science and Engineering Congress (IMSEC 2019), 322-326."

You can contact Prof. Dr. Banu Diri for reaching the dataset and other details of the conference paper.

Hyper parameters(others are default):

task_type: NER
max_seq_len: 512
learning_rate: 5e-5
num_train_epochs: 5
seed: 1
per_gpu_train_batch_size: depends on model
Model F1 score
loodos/electra-small-turkish-cased-discriminator 0.63
loodos/electra-small-turkish-uncased-discriminator 0.81
loodos/electra-base-turkish-uncased-discriminator 0.87
loodos/electra-base-turkish-64k-uncased-discriminator 0.86
loodos/bert-base-turkish-uncased 0.89
loodos/bert-base-turkish-cased 0.90
loodos/albert-base-turkish-uncased 0.85
dbmdz/bert-base-turkish-128k-cased 0.89
dbmdz/bert-base-turkish-128k-uncased 0.80
dbmdz/bert-base-turkish-uncased 0.89
dbmdz/bert-base-turkish-cased 0.90
dbmdz/electra-base-turkish-cased-discriminator 0.89
dbmdz/electra-small-turkish-cased-discriminator 0.79

QA

Huggingface's question-answering example is used for scoring.

Training dataset is TQuad.

  • We have modified it a bit to match Squad v2 format and removed some of the answers with misleading "answer_start" parameter.

Hyper parameters(others are default):

max_seq_len: 512
doc_stride: 128
learning_rate: 3e-5
num_train_epochs: 5
per_gpu_train_batch_size: depends on model
Model Exact Score F1 Score
loodos/electra-small-turkish-cased-discriminator 28.57 47.00
loodos/electra-small-turkish-uncased-discriminator 36.09 57.68
loodos/electra-base-turkish-uncased-discriminator 54.82 73.75
loodos/electra-base-turkish-64k-uncased-discriminator 54.70 75.39
loodos/bert-base-turkish-uncased 58.40 75.86
loodos/bert-base-turkish-cased 58.29 76.37
loodos/albert-base-turkish-uncased 46.63 66.31
dbmdz/bert-base-turkish-128k-uncased 59.41 77.50
dbmdz/bert-base-turkish-128k-cased 60.53 77.49
dbmdz/bert-base-turkish-uncased 59.75 76.48
dbmdz/bert-base-turkish-cased 58.40 76.19
dbmdz/electra-base-turkish-cased-discriminator 57.39 77.51
dbmdz/electra-small-turkish-cased-discriminator 31.61 53.08

Sentiment Analysis

Our custom text classification project is used for scoring. TFAlbertForSequenceClassification, TFBertForSequenceClassification and TFElectraForSequenceClassification of Huggingface's Transformers are used in that project.

Training dataset is Sentiment Analysist. We have normalized each sentence in the dataset with Zemberek for better scoring before training. The samples were split to %80 train, %20 test.

Hyper parameters(others are default):

max_seq_len: 192
learning_rate: 1e-6
epsilon: 1e-7
num_train_epochs: 15
per_gpu_train_batch_size: depends on model
Model Test score
loodos/electra-small-turkish-cased-discriminator 66.22
loodos/electra-small-turkish-uncased-discriminator 78.97
loodos/electra-base-turkish-uncased-discriminator 89.70
loodos/electra-base-turkish-64k-uncased-discriminator 88.17
loodos/bert-base-turkish-uncased 92.17
loodos/bert-base-turkish-cased 91.52
loodos/albert-base-turkish-uncased 88.59
dbmdz/bert-base-turkish-128k-uncased 91.92
dbmdz/bert-base-turkish-128k-cased 91.94
dbmdz/bert-base-turkish-uncased 91.02
dbmdz/bert-base-turkish-cased 91.67
dbmdz/electra-base-turkish-cased-discriminator 91.85
dbmdz/electra-small-turkish-cased-discriminator 83.70

Citation

We are preparing a paper about our models and results. Until it is published, you can give this repository in your citations.

Special Thanks

We are thankful to TensorFlow Research Cloud (TFRC) for providing TPUs for pretraining our models, to DBMDZ for their detailed pretraining cheatsheet and Prof. Dr. Banu Diri for her guidance and providing some datasets.

turkish-language-models's People

Contributors

abdullaholuk avatar abdullaholuk-loodos avatar hakanozgur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

turkish-language-models's Issues

Publication or references

I couldn't find the proper references/publication for the pretrained models here. Could you please help me with that and if possible, maybe, update the readme? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.