Giter Site home page Giter Site logo

kor_sacrebleu's Introduction

Vacillating Human Correlation of SacreBLEU in Unprotected Languages

This repository provides datasets and codes for MT evaluation employed in the given publication, written by Ahrii Kim (김아리) and Jinhyeon Kim (김진현) and submitted at Preprints.org (1st ver.) and HumEval 2022 (final ver.).

Abstract

SacreBLEU, by incorporating a text normalizing step in the pipeline, has become a rising automatic evaluation metric in recent MT studies. With agglutinative languages such as Korean, however, the lexical-level metric cannot provide a conceivable result without a customized pre-tokenization. This paper endeavors to examine the influence of diversified tokenization schemes –word, morpheme, subword, character, and consonants & vowels (CV)– on the metric after its protective layer is peeled off.

By performing meta-evaluation with manually-constructed into-Korean resources, our empirical study demonstrates that the human correlation of the surface-based metric and other homogeneous ones (as an extension) vacillates greatly by the token type. Moreover, the human correlation of the metric often deteriorates due to some tokenization, with CV one of its culprits. Guiding through the proper usage of tokenizers for the given metric, we discover i) the feasibility of the character tokens and ii) the deficit of CV in the Korean MT evaluation.

Dataset

  • Base
    • Source Text: English from WMT 20 English III-type (2,048 sentences / 61 documents)
    • Reference Text: Korean* (manually created)
    • System Translation: 4 online APIs
  • Judgment
    • Human: Direct Assessment (DA) of adequacy & fluency
    • Automatic:
      • BLEU, TER, and ChrF from SacreBLEU (Post, 2018)
      • NLTK_BLEU (Papineni et al., 2002)
      • GLEU (Wu et al., 2016)
      • RIBES (Isozaki et al., 2010)
      • NIST
      • EED (Wang et al., 2016)
      • CharacTER (Stanchev et al., 2019)
        *For legal issue, a sample of the reference set is publicly available.

Tokenization

  1. Word Level
  2. Morpheme Level
  3. Subword Level
  4. Character Level
  5. CV Level

Examplary Tokens

table1

Type Tokens
Sentence "모델 레옹 데임은 아직 그 누구도 시도한 적 없는 방식으로 캣워크를 활보했다"
Word ['모델', '레옹', '데임은', '아직', '그', '누구도', '시도한', '적', '없는', '방식으로', '캣워크를', '활보했다']
MeCab-ko ['모델', '레옹', '데임', '은', '아직', '그', '누구', '도', '시도', '한', '적', '없', '는', '방식', '으로', '캣', '워크', '를', '활보', '했', '다']
Kiwi ['모델', '레옹', '데이', 'ᆷ', '은', '아직', '그', '누구', '도', '시도', '하', 'ᆫ', '적', '없', '는', '방식', '으로', '캣워크', '를', '활보', '하', '었', '다']
Khaiii ['모델', '레옹', '데임', '은', '아직', '그', '누구', '도', '시도', '하', 'ㄴ', '적', '없', '는', '방식', '으로', '캣워크', '를', '활보', '하', '였', '다']
SPM ['모델', '레', '옹', '데', '임', '은', '아직', '그', '누구', '도', '시도', '한', '적', '없', '는', '방식', '으로', '', '캣', '워크', '를', '활', '보', '했', '다']
Character ['모', '델', '레', '옹', '데', '임', '은', '아', '직', '그', '누', '구', '도', '시', '도', '한', '적', '없', '는', '방', '식', '으', '로', '캣', '워', '크', '를', '활', '보', '했', '다']
CV ['ㅁ', 'ㅗ', 'ㄷ', 'ㅔ', 'ㄹ', ' ', 'ㄹ', 'ㅔ', 'ㅇ', 'ㅗ', 'ㅇ', ' ', 'ㄷ', 'ㅔ', 'ㅇ', 'ㅣ', 'ㅁ', 'ㅇ', 'ㅡ', 'ㄴ', ' ', 'ㅇ', 'ㅏ', 'ㅈ', 'ㅣ', 'ㄱ', ' ', 'ㄱ', 'ㅡ', ' ', 'ㄴ', 'ㅜ', 'ㄱ', 'ㅜ', 'ㄷ', 'ㅗ', ' ', 'ㅅ', 'ㅣ', 'ㄷ', 'ㅗ', 'ㅎ', 'ㅏ', 'ㄴ', ' ', 'ㅈ', 'ㅓ', 'ㄱ', ' ', 'ㅇ', 'ㅓ', 'ㅄ', 'ㄴ', 'ㅡ', 'ㄴ', ' ', 'ㅂ', 'ㅏ', 'ㅇ', 'ㅅ', 'ㅣ', 'ㄱ', 'ㅇ', 'ㅡ', 'ㄹ', 'ㅗ', ' ', 'ㅋ', 'ㅐ', 'ㅅ', 'ㅇ', 'ㅝ', 'ㅋ', 'ㅡ', 'ㄹ', 'ㅡ', 'ㄹ', ' ', 'ㅎ', 'ㅘ', 'ㄹ', 'ㅂ', 'ㅗ', 'ㅎ', 'ㅐ', 'ㅆ', 'ㄷ', 'ㅏ']

Tools & Results

Using the tools

Before implementation, we remind you to have all tokenizers installed. To test all the tools applied for our experiment (rank clustering, bootstrap resampling, tokenized samples, and automatic metrics scores), use the following code:

# for instance, metric = gleu
$ python3 script.py gleu

The given two metrics are not included in the script above, as they aer copied from the original libraries. You can test them seperately with our example with the following code:

[EED]
$ python3 ./tool/metric/ExtendedEditDistance/EED.py \
                -ref data/ref_example.txt \
                -hyp data/hyp_example.txt

[CharacTER]
$ python3 ./tool/metric/CharacTER/CharacTER.py \
                -r data/ref_example.txt \
                -o data/hyp_example.txt

Reproducing the experiment results

The regeneration of the figures of our paper is with the code given below. The options are to choose either the segment or corpus level and to save the images using --save.

$ python3 tool/draw_graph.py 'corpus'

Computing time

You can check the time to tokenize our sample text with the following code.

$ bash tool/tokenizer.sh data/sample tsv "Ref Hyp" "Kkma Hannanum Okt Komoran Mecab Khaiii Kiwi Spm Syllable CV"

Citation

@inproceedings{kor-sacrebleu22,
    title={Vacillating Human Correlation of SacreBLEU in Unprotected Languages},
    author={Kim, Ahrii and Kim, Jinhyeon},
    booktitle={Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)}
    year={2022}
}

License

Apache License Version 2.0

kor_sacrebleu's People

Contributors

trotacodigos avatar

Stargazers

Judy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.