Feature request When use tokenizer, it truncate data to max_length

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Tokenizer discard data that exceed max_length about transformers HOT 4 OPEN

fengyunflya commented on August 21, 2024

Tokenizer discard data that exceed max_length

from transformers.

Comments (4)

seanswyi commented on August 21, 2024

To clarify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?

from transformers.

fengyunflya commented on August 21, 2024

To specify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?

For example, I have a sentence that maybe exceed the max length, but I have to encode it to make sure that, So If I pre-processing it, and I have a lot of data that would waste time, cause I have to batch tokenize them later. If there is param in the tokenizer method, when the encoded sentence exceed max_length, it could just discard the sentence, that would only encode sentence once.

from transformers.

amyeroberts commented on August 21, 2024

cc @ArthurZucker

from transformers.

ArthurZucker commented on August 21, 2024

Hey! This has not been requested much, would recommend doing this manually in your data collator for example: you encode first everything, discard what's to long then pad !

from transformers.

Tokenizer discard data that exceed max_length about transformers HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent