Comments (3)
cc @itazap
from transformers.
Thanks for your detailed description and code samples, it is very appreciated! 🤗
The Cyrillic character 'б' being represented by multiple tokens such as ['Ð', '±'] is expected behavior and is based on the way Unicode characters are handled by the tokenizer.
In UTF-8, characters can be 1-4 bytes long, and Cyrillic letters like ‘б’ in UTF-8 is encoded in 2 bytes: (see 'б' = '0xd0 0xb1': Cyrillic table)
When the tokenizer represents these bytes it becomes 2 separate characters in the Latin character set (see '0xd0 0xb1' and '±' : Latin table)
So in the tokenization training process, each byte is considered as a separate token initially, so in your sample, the б character might not appear enough to merge together. (See similar issues:
254. 203, 268)
You can verify that the weird looking tokenization is working correctly using:
encoded = my_tokenizer.encode('б') # [128000, 272, 260]
decoded = my_tokenizer.decode(encoded) # 'б' - converts back to Cyrillic representation
print(decoded)
Let me know if you have more questions or experience issues!
from transformers.
@itazap , Thank you for your quick response!
I'm still confused.
TL;DR
Byte encoding for 'б' character is not expected, since it did appear in training corpus for several time. It works as expected for sentencepiece
lib and for tokenizers
lib.
- If I understand correctly, you are talking about byte fallback. But I though it happens if the symbol or word is not presented in the corpus at all! However, the "б" letter appeared in toy corpus for several times...
-
Again, I understand it the way that if my vocab_size is big enough, bpe tree should capture all possible separate words, because the bpe tree depth is not limited. So firstly I got alphabet tokens, then all possible pairs, then all possible words. And somehow it doesn't happen, even though vocab_size is 52000 which is more then enough for the toy corpus.
my_tokenizer = old_tokenizer.train_new_from_iterator(toy_corpus_ru, 52000)
-
There is an example code for sentencepiece in this Colab notebook with examples of unexpected behavior . And it works as expected – 'б' is encoded with single token – 40. Same applies for tokenizers.Tokenizer. (it is also in the notebook). How come that it works differently in transformers?
Thanks in advance for your help.
from transformers.
Related Issues (20)
- MusicGen fails when being used with pipeline HOT 1
- convert Fairseq to huggingface error HOT 1
- Saving the model from checkpoint
- How to get back the input time series after using PatchTSTForPretraining?
- Loss function defintion in the BertForSequenceClassification
- Some weights of BlipModel were not initialized from the model checkpoint at Salesforce/blip-image-captioning-base. HOT 1
- Control flow issue with symbolic_trace when using inputs_embeds in MistralForCausalLM HOT 2
- RecurrentGemma Doesn't Support left padding? HOT 3
- Add Mamba2 HOT 5
- Speed up image processors - cast to array before BatchFeature
- PreTrainedModel.from_pretrained(path, from_flax=True) fails for sharded Flax checkpoints
- [pipeline] VQA pipeline does not accept list as input HOT 1
- Loading XGLM with Tensorflow and apply resize_token_embeddings() raises an error. HOT 2
- Batch size schedulers HOT 8
- FlashAttention2 issue with Mistral/Mixtral related to max length and RotaryEmbedding HOT 2
- Issue with model.generate returning outdated hidden states for earlier tokens HOT 2
- PEFT + ZeRO Phase 2 + Transformers doesn't output pytorch_model.bin HOT 4
- xpu: Support new PyTorch XPU backend (>=2.4) HOT 3
- Loaded Donut models lost their predictive power HOT 4
- Add a new model for RNA sequence modeling HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.