System Info transformers-4.40.2 Python 3.10.12 Distributor

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for your detailed deion and code samples, it is very appreciated! 🤗

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Trained tokenizer has broken encoding for cyrillic about transformers HOT 3 OPEN

Anna-Pinewood commented on June 16, 2024

Trained tokenizer has broken encoding for cyrillic

from transformers.

Comments (3)

ArthurZucker commented on June 16, 2024 1

cc @itazap

from transformers.

itazap commented on June 16, 2024

Thanks for your detailed description and code samples, it is very appreciated! 🤗

The Cyrillic character 'б' being represented by multiple tokens such as ['Ð', '±'] is expected behavior and is based on the way Unicode characters are handled by the tokenizer.

In UTF-8, characters can be 1-4 bytes long, and Cyrillic letters like ‘б’ in UTF-8 is encoded in 2 bytes: (see 'б' = '0xd0 0xb1': Cyrillic table)

When the tokenizer represents these bytes it becomes 2 separate characters in the Latin character set (see '0xd0 0xb1' and '±' : Latin table)

So in the tokenization training process, each byte is considered as a separate token initially, so in your sample, the б character might not appear enough to merge together. (See similar issues:
254. 203, 268)

You can verify that the weird looking tokenization is working correctly using:

encoded = my_tokenizer.encode('б') # [128000, 272, 260]
decoded = my_tokenizer.decode(encoded) # 'б' - converts back to Cyrillic representation
print(decoded)

Let me know if you have more questions or experience issues!

from transformers.

Anna-Pinewood commented on June 16, 2024

@itazap , Thank you for your quick response!
I'm still confused.
TL;DR
Byte encoding for 'б' character is not expected, since it did appear in training corpus for several time. It works as expected for sentencepiece lib and for tokenizers lib.

If I understand correctly, you are talking about byte fallback. But I though it happens if the symbol or word is not presented in the corpus at all! However, the "б" letter appeared in toy corpus for several times...

Again, I understand it the way that if my vocab_size is big enough, bpe tree should capture all possible separate words, because the bpe tree depth is not limited. So firstly I got alphabet tokens, then all possible pairs, then all possible words. And somehow it doesn't happen, even though vocab_size is 52000 which is more then enough for the toy corpus.
my_tokenizer = old_tokenizer.train_new_from_iterator(toy_corpus_ru, 52000)
There is an example code for sentencepiece in this Colab notebook with examples of unexpected behavior . And it works as expected – 'б' is encoded with single token – 40. Same applies for tokenizers.Tokenizer. (it is also in the notebook). How come that it works differently in transformers?

Thanks in advance for your help.

from transformers.

Recommend Projects

Trained tokenizer has broken encoding for cyrillic about transformers HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent