Comments (3)
To help with debugging, here are the decoded outputs of each chunk:
for output in model_outputs:
print(tokenizer.batch_decode(output['tokens']))
["<|startoftranscript|><|notimestamps|> DO IT! Just DO IT! Don't let your dreams be dreams. Yesterday, you said tomorrow, so just DO IT! MAKE YOUR DRIMS! CONTRO! JUST DO IT! Some people dream success while you're gonna wake up and work hard at it. Nothing is impossible.<|endoftext|>"]
["<|startoftranscript|><|notimestamps|> Some people dream success while you're gonna wake up and work hard at it. Nothing is impossible. You should get to the point where anyone else would quit and you're not gonna stop there. No, what are you waiting for? Do it! Just do it! Yes, you can! Just do it!<|endoftext|>"]
['<|startoftranscript|><|notimestamps|> Just do it! Yes you can! Just do it! If your tire is starting over, stop giving up.<|endoftext|>']
Indeed, the duplicated phrasing is at the word boundaries, so we can see where the algorithm messes up.
from transformers.
Thanks for pointing out this issue @xenova !
It wonder if the problem comes from the chunking boundaries in model_outputs
and not from how word level timestamps are converted and concatenated in tokenizer._decode_asr
, but I need to dive a bit here.
Could you share the snippet you used to generate model_outputs
so I can dive into what's going wrong here?
Thanks!
from transformers.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
from transformers.
Related Issues (20)
- Whisper generate return a slice of result if result have more than one added token HOT 2
- Gob
- LR = 0 when using DeepSpeed Config and LORA on Trainer. HOT 3
- Cannot export sdxl encoder to onnx when transformers[torch] >= 4.43.0 (Occurred when translating scaled_dot_product_attention). HOT 2
- [i18n-<languageCode>] Translating docs to <languageName>spañol HOT 1
- apply_chat_template method not working correctly for llama 3 tokenizer HOT 4
- Trainer has stuck during the code block of "Trainer.train" in Jupyter Notebook HOT 1
- llama3 position_ids error with left padding HOT 2
- Mode-aware chat templates for distinct training and inference behaviors HOT 3
- XLMRobertaTokenizer attribute has disappeared from transformers.models.xlm_roberta HOT 1
- how to fine tune TrOCR on specifique langage guide. HOT 1
- Incorrect logits shape for GIT model (microsoft/git-base-textvqa) HOT 2
- ValueError: Unrecognized model. Should have a model_type key in its config.json HOT 1
- Can not detect bitsandbytes-windows HOT 4
- Using multi GPU fails with AutoModelForCausalLM quantization_config=quantization_config HOT 2
- Add multi image prompts to multimodal LLMs that support it (PaliGemma) HOT 3
- How to get the score of each token when using pipeline HOT 1
- Covert chemaleon weights to hf, ImportError HOT 1
- Clarification on Classification Token. HOT 1
- ValueError: No columns in the dataset match the model's forward method signature when using SFTTrainer and DataParallel. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.