Comments (7)
@jiqing-feng After a bit of exploration I do not see any bugs in the way assisted decoding is passing in arguments. My guess is that the problem comes from small numerical precision errors that are accumulated over generation timesteps. In other words, for greedy decoding we always have 1 more token when generating, so the calculation of key/value is actually a vector-matrix multiplication. However for assisted generation it's always a matrix-matrix multiplication due to having large number of candidate tokens verified. So my opinion is that torch internally handles those differently with slightly different operation's order, which leads to error accumulation.
cc @gante do you have any other ideas why this happens?
It is reasonable, thanks : )
from transformers.
Related to (#30042)
from transformers.
@jiqing-feng , the fix was merged on main.
You can update transformers with !pip install --upgrade git+https://github.com/huggingface/transformers.git
to get the correct behavior. Tested with the script you provided and can confirm that generations match
Closing issue as resolved :)
from transformers.
greedy search
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation history:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block in AutoCAD using python?```\n\nYou are chatbot. The conversation history is given between ``` ````. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\n\nconversation history:\n```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I create a block in AutoCAD using python?```\n\nYou can reply to the']
assisted decoding
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation history:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block in AutoCAD using python?```\n\nYou are chatbot. The conversation history is given between ``` ````. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\n\nconversation history:\n\nhuman: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I create a block in AutoCAD using python?\n\nPlease provide a response as "']
It's not exactly the same in the last few tokens, but better. Is it reasonable with a little difference?
from transformers.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
promtpt = """
You are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human". conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade tra
"""
device = "cuda:1"
model_id = "meta-llama/Llama-2-7b-chat-hf"
as_model_id = "Felladrin/Llama-68M-Chat-v1"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to(device)
as_model = AutoModelForCausalLM.from_pretrained(as_model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(promtpt, return_tensors="pt").to(device)
generate_kwargs = {"do_sample": False, "num_beams": 1, "max_new_tokens": 256}
print("greedy search")
outputs = model.generate(**inputs, **generate_kwargs)
print(outputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
print("assisted decoding")
outputs = model.generate(**inputs, assistant_model=as_model, **generate_kwargs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
print(outputs)
output:
greedy search
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".
conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade
tra\ngpt: Sure, I\'d be happy to help you create a travel plan for a family with small kids from London to Belgrade! Can you please provide me with some details such as the age of the children, the travel
dates, and any specific interests or preferences? @@@ human: Sure! The kids are 7 and 9 years old. We are planning to travel on July 15th and will be in Belgrade for 4 days. They are interested in history
, culture, and fun activities like museums, parks, and playgrounds. @@@ gpt: Great! Based on your preferences, I have created a 4-day itinerary for your family\'s trip to Belgrade. Here\'s a summary of the
plan: Day 1: Arrival and Exploring the City Centre @@@ human: That sounds great! Can you please provide me with more details about each activity and the estimated time required for each one? @@@ gpt: Of c
ourse! Here are the details of each activity in the itinerary: Day 1: Arrival and Exploring the City Centre @@@ human: That\'s very helpful! Can you please provide me with some']
assisted decoding
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".
conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade
tra\ngpt: Sure, I\'d be happy to help you create a travel plan for a family with small kids from London to Belgrade! Can you please provide me with some details such as the age of the children, the travel
dates, and any specific interests or preferences? @@@ human: Sure! The kids are 7 and 9 years old. We are planning to travel on July 10th and return on July 17th. They are both very interested in history
and culture, and they enjoy visiting museums and historical sites. Do you have any recommendations for places to visit in Belgrade? gpt: Great! Based on the information you provided, I would recommend visi
ting the following places in Belgrade: 1. The Nikola Tesla Museum: This museum is dedicated to the life and work of the famous Serbian inventor and engineer, Nikola Tesla. It\'s a great place for kids to l
earn about science and technology. 2. The Museum of Contemporary Art: This museum features a collection of modern and contemporary art from Serbia and around the world. The kids can enjoy the interactive e
xhibits and learn about different artistic styles. 3. The']
Found mismatch when output length is long.
from transformers.
@jiqing-feng After a bit of exploration I do not see any bugs in the way assisted decoding is passing in arguments. My guess is that the problem comes from small numerical precision errors that are accumulated over generation timesteps. In other words, for greedy decoding we always have 1 more token when generating, so the calculation of key/value is actually a vector-matrix multiplication. However for assisted generation it's always a matrix-matrix multiplication due to having large number of candidate tokens verified. So my opinion is that torch internally handles those differently with slightly different operation's order, which leads to error accumulation.
cc @gante do you have any other ideas why this happens?
from transformers.
@jiqing-feng Yes, numerical issues will cause assisted generation to pick a different token from time to time. It's the exact same issue as with batched generation or the use of KV caches :)
👉 you can read more about the issue here
from transformers.
Related Issues (20)
- DPT implementation contains unused parameters HOT 4
- Urdu Encoding Issue in Hugging Face Tokenizer HOT 1
- Add Prismatic VLMs to Transformers HOT 3
- Error converting from PyTorch to HuggingFace - Mistral / Mixtral
- model_max_length default parameters are missing in transformers>=4.40.0 HOT 2
- (Have PR) Speed up `BeamScorer` to make GPT-2 generation 2-3x faster HOT 1
- Evaluate trainer on Code-Switched Speech fails with "ValueError: Multiple languages detected when trying to predict the most likely target language for transcription." HOT 1
- can't import phi3config etc. HOT 2
- [Phi-3-mini-128k-instruct] Difference in slow and fast tokenization after adding new tokens HOT 2
- Cannot copy out of meta tensor; no data! for SwinV2ForImageClassification HOT 3
- Mismatched tensor size error when generating text with beam_search on mps
- Question about quantized model with zero3 HOT 1
- [i18n-<languageCode>] Translating docs to <languageName> HOT 1
- default max value of max_new_token HOT 5
- LLM inference with static kv-cache example gives different generations depending on the batch examples HOT 7
- Add Wav2Vec2BertProcessorWithLM HOT 1
- Issue related to dtype with F.conv1d in Whisper evaluation HOT 2
- Bug with train class method for MobileViTForSemanticSegmentation HOT 2
- Starcoder2 has +10% inference latency when flash attention 2 is enabled HOT 2
- CLIP Training Example Bug - Overfitting HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.