Comments (3)
If you tokenize properly the input (tokenize before convert_tokens), it automatically 'fallbacks' to subword/character-level(-like) embedding.
You can add new words in the vocabulary but you'll have to train the corresponding embeddings.
from transformers.
Hi @siddsach,
Thanks for your kind words!
@artemisart is right, BPE progressively falls-back on character level embeddings for unseen words.
from transformers.
If you tokenize properly the input (tokenize before convert_tokens), it automatically 'fallbacks' to subword/character-level(-like) embedding.
You can add new words in the vocabulary but you'll have to train the corresponding embeddings.
Hi, what do you mean tokenize properly the input (tokenize before convert_tokens)
?
Can you refer a tokenization sample (before and after) or a sample code if any? thank you
from transformers.
Related Issues (20)
- Why cast to float32 in this line? HOT 2
- Tranformers documentation translation to Persian HOT 1
- RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback): name 'LRScheduler' is not defined HOT 6
- Providing several prompt_images and prompt_masks to seggpt leads to RuntimeError HOT 4
- Can't save checkpoint with shared tensors HOT 3
- Enhance HfArgumentParser with Dict command-line parser HOT 1
- try eval befor train gives ValueError with deepspeed Zero2
- `BartForConditionalGeneration` has no attribute `shared` HOT 1
- OPRO-FT- config.json file not loaded -Andyrasika/Mistral7b-ORPO HOT 3
- EncoderDecoderModel with XLM-R
- Mamba: which tokenizer has been saved and how to use it? HOT 1
- Create panoptic segmentation task guide
- Error at the generation stage by MusicGen stereo model HOT 3
- Trying to stack tensors from different devices in `_pad_to_max_length` in Whisper batched inference
- [Whisper] Word-level timestamps broken for short-form audio HOT 2
- [BUG] Load StarCoder2 AWQ using Transformers HOT 5
- `import transformers` accidentally initializing both torch and jax/xla at startup time HOT 5
- FSDP Doesn't Work with model.generate() HOT 2
- Nondeterministic behavior from GPT with MPS backend HOT 6
- LlamaRMSNorm() Dtype Casting Error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.