Giter Site home page Giter Site logo

Comments (3)

ArthurZucker avatar ArthurZucker commented on June 16, 2024 1

cc @itazap

from transformers.

itazap avatar itazap commented on June 16, 2024

Thanks for your detailed description and code samples, it is very appreciated! 🤗

The Cyrillic character 'б' being represented by multiple tokens such as ['Ð', '±'] is expected behavior and is based on the way Unicode characters are handled by the tokenizer.

In UTF-8, characters can be 1-4 bytes long, and Cyrillic letters like ‘б’ in UTF-8 is encoded in 2 bytes: (see 'б' = '0xd0 0xb1': Cyrillic table)

When the tokenizer represents these bytes it becomes 2 separate characters in the Latin character set (see '0xd0 0xb1' and '±' : Latin table)

So in the tokenization training process, each byte is considered as a separate token initially, so in your sample, the б character might not appear enough to merge together. (See similar issues:
254203268)

You can verify that the weird looking tokenization is working correctly using:

encoded = my_tokenizer.encode('б') # [128000, 272, 260]
decoded = my_tokenizer.decode(encoded) # 'б' - converts back to Cyrillic representation
print(decoded)

Let me know if you have more questions or experience issues!

from transformers.

Anna-Pinewood avatar Anna-Pinewood commented on June 16, 2024

@itazap , Thank you for your quick response!
I'm still confused.
TL;DR
Byte encoding for 'б' character is not expected, since it did appear in training corpus for several time. It works as expected for sentencepiece lib and for tokenizers lib.

  1. If I understand correctly, you are talking about byte fallback. But I though it happens if the symbol or word is not presented in the corpus at all! However, the "б" letter appeared in toy corpus for several times...
image
  1. Again, I understand it the way that if my vocab_size is big enough, bpe tree should capture all possible separate words, because the bpe tree depth is not limited. So firstly I got alphabet tokens, then all possible pairs, then all possible words. And somehow it doesn't happen, even though vocab_size is 52000 which is more then enough for the toy corpus.
    my_tokenizer = old_tokenizer.train_new_from_iterator(toy_corpus_ru, 52000)

  2. There is an example code for sentencepiece in this Colab notebook with examples of unexpected behavior . And it works as expected – 'б' is encoded with single token – 40. Same applies for tokenizers.Tokenizer. (it is also in the notebook). How come that it works differently in transformers?

image

Thanks in advance for your help.

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.