Comments (4)
To clarify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?
from transformers.
To specify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?
For example, I have a sentence that maybe exceed the max length, but I have to encode it to make sure that, So If I pre-processing it, and I have a lot of data that would waste time, cause I have to batch tokenize them later. If there is param in the tokenizer method, when the encoded sentence exceed max_length, it could just discard the sentence, that would only encode sentence once.
from transformers.
from transformers.
Hey! This has not been requested much, would recommend doing this manually in your data collator for example: you encode first everything, discard what's to long then pad !
from transformers.
Related Issues (20)
- RuntimeError: Failed to import transformers.integrations.bitsandbytes because of the following error (look up to see its traceback): cannot import name 'get_keys_to_not_convert' from 'transformers.integrations' HOT 1
- fp16 support for grounding dino HOT 1
- class Cache must not be a subclass of `torch.nn.Module` HOT 2
- Can't load Llama's tokenizer with add_prefix_space=True parameter. HOT 1
- tracker: move `prepare_inputs_for_generation` into the generation mixin ๐งน HOT 2
- `eval_on_start` triggers `AttributeError` in JupyterLab HOT 1
- Mamba slow implementation datatype mismatch HOT 1
- Causal lanuage modeling should consider loss masks for pad tokens. HOT 2
- Confusion around correct bounding box format for DETR training HOT 6
- Q-GaLore Support HOT 2
- Cache updating when use_cache = False HOT 8
- split head_dim from hidden_size for llama like gemma or mistral
- Only last elements have expected outputs when doing batch inference HOT 5
- `resize_token_embeddings` with same tokenizer changing the model Embedding size HOT 6
- Jamba model: 'AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute '_modules'' HOT 3
- Difference in embedding weight initialization for randomly initialized T5 model
- confusing deprecation msg for `DynamicCache.seen_tokens` - no `cache_position` in this class HOT 1
- Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to HuggingFace HOT 2
- Multiprocessing support HOT 4
- Speculative sampling does not maintain probability distribution of main model HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.