mit-han-lab / streaming-llm Goto Github PK

View Code? Open in Web Editor NEW

6.3K 61.0 356.0 39.2 MB

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Home Page: https://arxiv.org/abs/2309.17453

License: MIT License

Python 100.00%

streaming-llm's Introduction

Efficient Streaming Language Models with Attention Sinks

[paper] [slides][video]

streamingllm_demo.mp4

TL;DR

We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance.

News

[2024/02] StreamingLLM is covered by MIT News as a spotlight!
[2024/01] SwiftInfer, a TensorRT-based implementation makes StreamingLLM more production-grade.
[2024/01] StreamingLLM is integrated into NVIDIA TensorRT-LLM!
[2023/12] StreamingLLM enables endless and efficient LLM generation on iPhone!
[2023/12] StreamingLLM is integrated by HuggingFace Transformers' main branch.
[2023/10] StreamingLLM is integrated into Intel Extension for Transformers.
[2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs.

Abstract

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup.

Usage

Environment Setup

conda create -yn streaming python=3.8
conda activate streaming

pip install torch torchvision torchaudio
pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece

python setup.py develop

Run Streaming Llama Chatbot

CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py  --enable_streaming

FAQ

What does "working on infinite-length inputs" imply for LLMs?

Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods.
Is the context window of LLMs expanded?

No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096.
Can I input an extensive text, like a book, into StreamingLLM for summarization?

While you can input a lengthy text, the model will only recognize the latest tokens. Thus, if a book is an input, StreamingLLM might only summarize the concluding paragraphs, which might not be very insightful. As emphasized earlier, we neither expand the LLMs' context window nor enhance their long-term memory. StreamingLLM's strength lies in generating fluent text from recent tokens without needing a cache refresh.
What is the ideal use case for StreamingLLM?

StreamingLLM is optimized for streaming applications, such as multi-round dialogues. It's ideal for scenarios where a model needs to operate continually without requiring extensive memory or dependency on past data. An example is a daily assistant based on LLMs. StreamingLLM would let the model function continuously, basing its responses on recent conversations without needing to refresh its cache. Earlier methods would either need a cache reset when the conversation length exceeded the training length (losing recent context) or recompute KV states from recent text history, which can be time-consuming.
How does StreamingLLM relate to recent works on context extension?

StreamingLLM is orthogonal to recent context extension methods and can be integrated with them. In StreamingLLM's context, "context extension" refers to the possibility of using a larger cache size to store more recent tokens. For a practical demonstration, refer to Figure 9 in our paper, where we implement StreamingLLM with models like LongChat-7B-v1.5-32K and Llama-2-7B-32K-Instruct.

TODOs

We will release the code and data in the following order, please stay tuned!

Release core code of StreamingLLM, including Llama-2, MPT, Falcon, and Pythia.
Release perplexity evaluation code
Release Streaming Llama Chatbot demo.
Release StreamEval dataset and evaluation code.

Citation

If you find StreamingLLM useful or relevant to your project and research, please kindly cite our paper:

@article{xiao2023streamingllm,
        title={Efficient Streaming Language Models with Attention Sinks},
        author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike},
        journal={arXiv},
        year={2023}
        }

streaming-llm's People

Contributors

Stargazers

Watchers

Forkers

tomaarsen spacelabllm bender-2000 shivamsinha15 krish240574 kustomzone moxmoussa ysterbal r2d4 teashawn xiechengmude codeaudit ananthasharma karimjedda quantumtau paperwave n1lanjan korlaism ashioyajotham beniceokay veltsy rickyfer22 codeakrome mrcodechef yueeeeeeee org-tekeli-borisp jesusoctavioas touristshaun wuchirat kaynewest sankeerthrao timlucastech techthiyanes tomchapin francyjglisboa hbcbh1999 mjunaidi jason-cs18 0xmgg beckdevil athy125 peter-abel knpubscripts paulgwamanda nimmen wesleyx88 airobotproject azizullah2017 goldinium cmcintosh yokko123 zheleska automationkit aaronjerez1 ericxsun daoyuan14 jaedukseo ssusantachary sorokinvld jonnyquan cosmojg thanhpham1987 gladiopeace afiqmuzaffar eltociear m9e damienpraxy navezjt dantegpt dng-ai yasserotiefy freefrancisco fcastanedo valaydave muaazdev 5l1v3r1 shanthshivam sandeep-saurabh hospitaltapes jinzaizhichi pakloong saisurbehera lyhiving apollohuang1 greydoubt fdoperezi fastflair msmelissa third-party-collaboration chatchatbio matrixsociety joseph16388 klei22 ricklentz john-adeojo a1afreejk drbigdata gchenfly noorieai farhadfa22

streaming-llm's Issues

Finetuning a model in the streaming mode ?

I want to ask if i can finetune a normal llm in the streaming mode and if this would be benefiical on results.

Is code's position wrong with "kv_cache.evict_for_space" ?

as your paper see:
streamingLLM will work on generation but your "kv_cache.evict_for_space" is in "streaming_inference" not in "greedy_generate".

if "kv_cache.evict_for_space" in "streaming_inference" , is there different with dense_attention like position extrapolation ？

some question about paper

Hello author, I have a few questions about your article and I hope to get your help: @Guangxuan-Xiao

If the K-1 tokens before the current token are truncated each time (if the maximum context window is K), and the sequence is re-calculated for attention, will the PPL not suddenly increase a lot?
In theory, when maintaining the maximum sequence length, if there are 4 tokens in front as anchor points, inputting a new token will only crowd out the first token. There are still 3 tokens behind, and ppl will decrease, but why does it increase so much?
I would like to confirm whether streaming llm only discards the kv of the intermediate Evicted token in the normal inference KV cache. Therefore, it is actually an improvement on the normal KV cache to accelerate inference. In the case of continuous input, there is no need to clear the cache. recalculate

[Feature Request] Release StreamEval dataset and evaluation code in OpenCompass

Hi, this is opencompass community volunteer,

OpenCompass is an open-source, efficient, and comprehensive evaluation suite and platform designed for large models. Looking forward to adding StreamEval to OpenCompass.

https://opencompass.org.cn/leaderboard-llm
https://github.com/open-compass/opencompass

Support for MPT

Will you release the code for MPT? I saw in your paper the result of MPT but I didn't find the code for MPT in https://github.com/mit-han-lab/streaming-llm/tree/main/streaming_llm/pos_shift.

Questions on "streaming-llm" Paper

Firstly, I'd like to express my appreciation for your insightful paper and the open-source 'streaming-llm'. Your approach and experiments are truly commendable. I hope you don't mind, I would really appreciate it if you could give me some hints about the below questions.

As mentioned in your paper, the attention score of initial tokens seems crucial. I was under the impression that Dense Attention might be more effective than Window Attention, especially since Dense Attention consistently calculates KV values of initial tokens. Is it because Window Attention performs better because Dense Attention struggles with significantly longer input lengths, affecting Length Extrapolation?
Regarding Figure 1-(b), the PPL of Window Attention seems the lowest, but it has an 'X' label. Maybe the PPL score of Window Attention should be larger in real, or it shouldn't take 'X' label?

I appreciate any clarifications you can provide.

Model paths randomly set

If you run initially it download the models on a random folders. For example

C:\Users\super.cache\huggingface\hub\models--lmsys--vicuna-13b-v1.3\snapshots\6566e9cb1787585d1147dcf4f9bc48f29e1328d2

How is it expected to work the code afterwards?

Loading model from lmsys/vicuna-13b-v1.3 ...
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Traceback (most recent call last):
File "examples/run_streaming_llama.py", line 122, in
main(args)
File "examples/run_streaming_llama.py", line 80, in main
model, tokenizer = load(model_name_or_path)
File "d:\trial\streaming-llm\streaming_llm\utils.py", line 58, in load
model = AutoModelForCausalLM.from_pretrained(
File "C:\Users\super\anaconda3\envs\streaming\lib\site-packages\transformers\models\auto\auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "C:\Users\super\anaconda3\envs\streaming\lib\site-packages\transformers\modeling_utils.py", line 3175, in from_pretrained
) = cls._load_pretrained_model(
File "C:\Users\super\anaconda3\envs\streaming\lib\site-packages\transformers\modeling_utils.py", line 3296, in _load_pretrained_model
raise ValueError(
ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format.

Questions on the demo results

I would like to express my gratitude for your paper and code, which have been truly enlightening for me. I conducted the experiments following the instructions provided in the README. I would be grateful if you could provide some insights regarding the following query.

In the scenario where "enable_streaming" is enabled, the language model (LLM) performs well when presented with questions from the test dataset. It generates responses to each question seamlessly until completion.

However, in the case where "enable_streaming" is not enabled, the LLM initially responds well to a few questions at the beginning. However, as more questions are presented, its performance deteriorates, ultimately leading to an error: "torch.cuda.OutOfMemoryError: CUDA out of memory," which aligns with what is shown in the README video.

I can comprehend that the decline in performance in the second case may be attributed to attention sink's eviction. Nevertheless, I am uncertain about the reason behind the "CUDA out of memory" error.

I would greatly appreciate any insights or clarifications you could offer on this matter.

Metal Support

raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

Any plans to get Metal support for us M2 users without CUDA? Thanks!

output

wrong

Google Colab installation

https://colab.research.google.com/drive/1YtXE_JKVntkGK14Yo9thjCjPMVzhA71d?usp=sharing

Here is the colab, but it doesn't run in colab it stops after a while due to memory overload or something like that.
Also there are few changes to be made in the files which are downloaded in the steps in order for it run, so you can't run it as it is.

If you already have a good solution, please support it.

Question about evaluation results and demo

I found the concept of "window attention" confusing. In fig1 there are two types of window attention, b(naive window attention) and c(recompute window attention). Fig.3 shows that c-recompute-window-attention behaves close to streaming-llm on ppl, but table1 says that "window" attention has poor performance on ppl, so I guess table1 uses b-naive-window-attention? And table5 says that "window" attention fails in ARC benchmark, so I guess this is also b-naive-window-attention? Then in figure10, it says that the speedup is benchmarked with c-recompute-window-attention. Could you benchmark ALL results with BOTH "window-attention" methods to make the comparison fair? Or did I miss anything?
Looking into your demo video and https://github.com/mit-han-lab/streaming-llm/blob/main/examples/run_streaming_llama.py , I don't quite understand why the model generates erroneous tokens(when "model performance breaks") if streaming is not enabled. Since the prompts are actually processed by the model one-by-one (#L63), I suppose the model should be either OOM or "generating good tokens". Where does the erroneous tokens come from?
What is the exact pipeline of ARC evaluation (table 5)? Does the model "process q1 -> generate a1 with evicted past cache of q1 -> process q2 with evicted past cache of q1 and a1 -> generate a2 with evicted past cache of q1 and a1 and q2-> ..." (which is what run_streaming_llama.py do), or "process [q1, q2, q3... qn] -> generate [a1, a2, a3, ..., an]"?

Thanks in advance!

For LLMs already trained with window attention and BOS token

Nice work!

I am wondering whether this attention sink magic is still needed for LLMs that has been already trained with window attention (e.g. mistral). While I am curious about this, I still think attention sink is a better way. Since it could be used on almost any LLMs either trained with or without window attention.

And in particular for Llama or say LLMs with an BOS token, attention sink can be viewed as a soft version of hard truncation of farthest tokens where sink token is very much like the BOS token and the position ids are also properly reorganized. This makes me further question about whether attention sink would work well on long-context scenarios (e.g. longeval)? Although it seems that streameval is testing the long-context modeling ability, I did not get the reason why streamllm can outperform dense attention when the context length lies between cache size and pretrained length.

BTW, I am not very certain about what window attention with recompute does, and why it could work?

How to answer the question in the middle of long input

I wander how the streaming-llm answers the questions in the middle of long input. Specifically, what is the entire decoding process? When it generates the first token, where does the tokens in the KV cache come from?

Strim

Does past_key_values be repeatedly compute?

Hi! Attention sink is very amazing for llm.
I am confuse about past_key_values in streaming-llm.
In my image, past_key_values will be recompute in every new input.
But I notice past_key_values were be stored in streaming-llm by turn use_cache on.
I was try my best to stroed past_key_values and reuse it in new input inference, the output will be very very strange.
But the output is really good in streaming-llm.
I really want to know what kind the effort you did for reuse past_key_values.
Thanks a lot!

hi

While streaming with sinks, how does the framework change the positional encodings of the KV cache without having to multiply with the Key and Value matrices?

From Section 3.2 in the paper:

When determining the relative distance and adding positional information to tokens, StreamingLLM
focuses on positions within the cache rather than those in the original text. This distinction is crucial
for StreamingLLM’s performance. For instance, if the current cache has tokens [0, 1, 2, 3, 6, 7, 8]
and is in the process of decoding the 9th token, the positions assigned are [0, 1, 2, 3, 4, 5, 6, 7], rather
than the positions in the original text, which would be [0, 1, 2, 3, 6, 7, 8, 9].

How is it possible to change the positional embeddings for the cached tokens when the passes are done for the next iteration.

window_size attention pretrain

Thanks for your great job!
I want to ask how to use initial token like section 3.3 to pretrain a gpt-like model ,which use window_size attention...For example, the input length is 2048 while the window_size is 512,query and key will be shaped like (batch,len/window_size,window_size,dimension), should I append the initial token into every q,make the shape (batch,len/window_seze,window_size+1,dimension)?

question about re-computation

Hello. After read paper and code, I have a question about the method about the Sliding Window w/Re-compuation, is there any code or paper about this method for reference? Thank you a lot.

Have you run any passkey retrieval tests on streaming-llm?

And separately, can this approach be considered similar to LM-infinite with the lambda shaped mask? https://arxiv.org/abs/2308.16137

TypeError: llama_pos_shift_attention_forward() got an unexpected keyword argument 'padding_mask'

I'm trying to reproduce the work per readme but am getting "TypeError: llama_pos_shift_attention_forward() got an unexpected keyword argument 'padding_mask'" on line 103 in run_streaming_llama.py. I have no idea how to debug this as there is no padding_mask in the code.

Implementation of lama2 7b chat hf model

How can i integrate the lama2 7b model through this streaming llm, the model is already pretrained version, will it work over here?

How do you feed long texts to a model?

I tried naively to add examples in https://github.com/mit-han-lab/streaming-llm/blob/main/data/mt_bench.jsonl, including examples with length of 4k tokens, without changing anything in the script. I receive:

ASSISTANT: Token indices sequence length is longer than the specified maximum sequence length for this model (3905 > 2048). Running this sequence through the model will result in indexing errors
- - - - - - - - - d - d - d - d - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - d - d d d d d - d - d - - - - - - - - - - - - - - - - - - - - - - d - d - d - d - - - - - - - - - - - - - d - d d - d - d d d d - d - d d d d d d d d - d - d - d - d - d - d - d - - - d - d - d - d - d - d d d d d d d d d d d d - d - d - d - d - d - d - d - d - d - d - d d d d d - d - d - d - d - d - d d d d d d d d d d d d d d d d - d d d d - d0 - d - d - d - d - d - d - d - d - - - - - - - d - d - d - d - - - - - - d - d - - - d - d - d - d - d - d - d - d - - - - - - d - d - - - - - - - - d - d - d d d d d - d - d - d - d - d - d - d - d0 d0 d0 d d d d d d d d d d d d d d d d - d - d - d - - d - d - d - d - d - d d d d d d - d - - - - - - - - - - - - - - - - - - - - d - d - d - - - - - - - d - d - d - d - d - d d d d - d - d - d d d d d - d - d - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - d - d d d d d d d d d - d - d - d - d - d d d d d - d d d d d d - d - d - d - d - d - d - d - d - - - - - d - d - d - d - d - - - - - - - - - - d - d d d d d d d d d s d d d d0 n0 d0 - d - d - d - d - d - d - d - d - - - - - - - - d0 d d d d d d d d d d d d d d d d d d d d d d d d - d - d - d - d - - - - - - - - - d - d - d - d - - - - - - - - - - - - - - - - - - d - d - d - - - - - - - - - - - d - d - d - d - d d d, d, s s s s s s s s s s d s s s s s d d d. d, d d d n n n n d0 d00 d0 d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d, d, d, d, d, d, d, d, d, d, d d d d d0 d00, d, d, d, d000,0,0000000 d0 d. d. d. d, d, d d. d. et d. d d d d d et d et d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d et d et d et d et et et et et d et d d d d d d d d d d d d, d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d. d. d. d d d d d d d d d d d d d d d d d d d d d d d d d d

Did I misunderstand "infinite-length inputs without sacrificing efficiency and performance. "?

Comparison with SWA in Mistral

Hi @Guangxuan-Xiao, do you have any comparison with sliding window attention from Mistral? The paper only describes SWA with re-computation which is not how it works in the new models.

Sliding Window with Re-computation rebuilds the KV states from the L recent tokens for each new token.

Basically, this is not what they do in the Mistral model. They do not rebuild the KV states, they evict the oldest part of the cache in favor of the newest parts.

I'm (A Bit) Suspicious of Table 3.

Hi there, thanks for writing such an interesting paper. When I heard of your paper, I immediately had the thought that it might be related to Evan Miller's Attention is Off By One blog post. And I was right! I was excited to see your experiments but when I came to Table 3 which describes the results of pre-training, from scratch, identical language models corresponding to vanilla Attention, Attention with a Zero Sink, and Attention with a Learnable Sink with 0-4 sink tokens prepended.

Maybe it's because of some strange sense of "moral truth" I have with the Zero Sink, but I was a little surprised that the Zero Sink didn't do better experimentally. But then I looked closer and I noticed your $0 + 1024$ experiments and I was a little confused with the results presented.

In the table description you say

Cache config x+y denotes adding x initial tokens with y recent tokens.

Based on that definition, shouldn't the $0 + 1024$ case with 0 sink tokens make the 3 formulations equivalent? Where do the wildly different perplexity results come from for that experiment? Perhaps I'm misunderstanding the description of this table.

Thank you for fielding my questions!

question about initial tokens

Thanks for your awesome work. I have some questions about the concept of initial tokens and the implementation of learnable initial tokens.

In Fig.2, you reach the conclusion that existing LLMs tend to pay more attentions to the first (four) tokens of each sentence, and this conclusion is still consistant even if these tokens are replaced with meaningless tokens like '\n'.

So, can I understand this phenomenon as that the position embeddings of the first (four) tokens play an importnat role in attracting the attention weights durring the generation of the following tokens?

In my understanding of your experimental implemenation, you add a learnable initial token at the begining of each sentence to be used to store attention bias. I am curious that will this learnable intial token produce the same effect as the speicial token '' that will be usually automatically added by the tokenizer to indicate the beginning of the sentence

Thanks

Can support to codellama34b?

question about Table 1 in paper

Hi, I have a question about the experimental results in table1, are three settings of slidding windows tokens using positions assigned to[0,1,2,3...window_size] ?

add suport codellama

Question on intuition of "attention sink" and "alibi PE"

Hi,

Thanks for the amazing work on streaming-llm. While reading the paper, I came up with this question on why applying "attention sink" also works with models with alibi position embedding.
One observation from alibi based models is that the added alibi relative bias results in very small attention scores for the tokens at the beginning of a sequence given a long sequence since the attention score follows an ~exponential decay with sequence length. Is there any explanation you can provide behind this observation as to why attention sink works under this scenario? Maybe I've missed some things from the paper, thanks very much.

I keep getting a 403 forbidden

Confused with four attention mechanism and their performance mentioned by paper

Nice idea, and it really works well! Thanks for you nice work. But I have some questions. In paper, it mentions four attention mechanism, dense attention fails because it mismatches with the traing phase's length when the outputs' length is longer than training phase, window attention fails because it evicts the initial tokens' kv cache, but for sliding attention with recomputation and streaming attention, I have some questions.

The sliding attention with recomputation just recomputes the kv state from L recent tokens, theoretically, it should have the same PPL with window attention, because they uses the same tokens' kv, it clear to see in picture. It just saves the kv space. However, sliding attention has better PPL in picture. Did I have a misunderstanding with sliding attention?
The streaming attention's idea is to use initial tokens' kv and L recent tokens' kv, the reason why it uses initial tokens' kv is clear in paper. Compared with dense attention, dense attention also uses these tokens' kv, so it should have no the softmax shift problem. Dense attention even uses more tokens, so dense attention should have better PPL because it has longer context, the streaming attention should have higher inference speed and longer output length because it uses less tokens. But in picture, the streaming attention have better PPL than dense attention.

Thanks for your nice work again. Hope to get a reply.

RuntimeError in run_streaming_llama.py When Using Accelerate with Streaming LLMa Model on A4500 GPU

Description:
When running the run_streaming_llama.py script with the --enable_streaming flag, I encountered a RuntimeError related to CUDA and the PyTorch Accelerate library.

Steps to Reproduce:
Set the environment variable: CUDA_VISIBLE_DEVICES=0
Run the following command:

python examples/run_streaming_llama.py --enable_streaming

Expected Behavior:
The script should run successfully and provide streaming inference results.

Actual Behavior:
The script crashes with the following error:

RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

GPU information

CUDA version: 11.7
GPU: NVIDIA RTX A4500
GPU driver information: 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  525.78.01  Mon Dec 26 05:58:42 UTC 2022
GCC version:  gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

OS information:

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal

Full error log:

(streaming) zli@aes:~/streaming-llm$ CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py  --enable_streaming
Loading model from lmsys/vicuna-13b-v1.3 ...
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.74s/it]
Loading data from data/mt_bench.jsonl ...
StartRecentKVCache: 4, 2000

USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

ASSISTANT: 🌴🌺 Aloha from the beautiful islands of Hawaii! I recently had the opportunity to explore the diverse culture and stunning landscapes of this tropical paradise, and I can't wait to share my adventures with you.

🏝️ First things first, let's talk about the beaches. Hawaii is known for its pristine white sand and crystal clear waters, and I was lucky enough to visit some of the most breathtaking beaches on the islands. My favorite was Waimea Bay on the North Shore of Oahu. The waves were perfect for surfing, and the surrounding cliffs and lush greenery made for a postcard-worthy backdrop.

🌊 But Hawaii isn't just about the beaches. The culture of the islands is rich and vibrant, and I made it a point to immerse myself in it during my trip. I attended a traditional Hawaiian luau on the Big Island, where I learned about the history and customs of the native people. I also tried my hand at hula dancing and even got to sample some delicious Hawaiian cuisine, including poke bowls and shaved ice.

🏯 Another must-see attraction in Hawaii is the Hawaii Volcanoes National Park on the Big Island. The park is home to two active volcanoes, Kilauea and Mauna Loa, and offers a unique opportunity to see the power of nature up close. I hiked through the park and was amazed by the steam vents, lava tubes, and volcanic craters. It was truly a once-in-a-lifetime experience.

🌺 In addition to the natural wonders of Hawaii, the islands are also home to some of the most beautiful gardens and parks in the world. The Maui Tropical Plantation offered a glimpse into the island's agricultural history, while the Hilo Tropical Botanical Garden showcased the stunning flora and fauna of the area.

🏯 In conclusion, Hawaii is a destination that offers something for everyone. From the stunning beaches to the rich culture and unique attractions, there's no shortage of things to see and do on these beautiful islands. If you're planning a trip to Hawaii, be sure to add these must-see attractions to your itinerary. Mahalo for reading, and aloha!

USER: Rewrite your previous response. Start every sentence with the letter A.

ASSISTANT: 🌴🌺 Aloha from the beautiful islands of Hawaii! I recently had the opportunity to explore the diverse culture and stunning landscapes of this tropical paradise, and I can't wait to share my adventures with you.

🏝️ First things first, let's talk about the beaches. Hawaii is known for its pristine white sand and crystal clear waters, and I was lucky enough to visit some of the most breathtaking beaches on the islands. My favorite was Waimea Bay on the North Shore of Oahu. The waves were perfect for surfing, and the surrounding cliffs and lush greenery made for a postcard-worthy backdrop.

🌊 But Hawaii isn't just about the beaches. The culture of the islands is rich and vibrant, and I made it a point to immerse myself in it during my trip. I attended a traditional Hawaiian luau on the Big Island, where I learned about the history and customs of the native people. I also tried my hand at hula dancing and even got to sample some delicious Hawaiian cuisine, including poke bowls and shaved ice.

🌺 In addition to the natural wonders of Hawaii, the islands are also home to some of the most stunning gardens and parks in Traceback (most recent call last):
  File "examples/run_streaming_llama.py", line 122, in <module>
    main(args)
  File "examples/run_streaming_llama.py", line 103, in main
    streaming_inference(
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "examples/run_streaming_llama.py", line 73, in streaming_inference
    past_key_values = greedy_generate(
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "examples/run_streaming_llama.py", line 30, in greedy_generate
    outputs = model(
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 838, in forward
    logits = self.lm_head(hidden_states)
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/accelerate/hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/accelerate/hooks.py", line 286, in pre_forward
    set_module_tensor_to_device(
  File "/home/zli/anaconda3/envs/streaming/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_device
    new_value = value.to(device)
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

The video included in the README does not play in Firefox

Firefox on macOS doesn't seem to like the HEVC-encoded MOV file linked in the README. Also, most browsers available on Windows and Linux, including Google Chrome and Microsoft Edge, do not support this file format either. For broader compatibility, it might be worth considering a better supported format like H.264 MP4.

The k_seq_dim and v_seq_dim in StartRecentKVCache look related to the type of model

The k_seq_dim and v_seq_dim in StartRecentKVCache seem to be related to the model type, if you want to add the baichuan2 series model, what do you need to modify the k_seq_dim and v_seq_dim to?（StartRecentKVCache 里的k_seq_dim和v_seq_dim看上去和模型种类有关，这里假如要加入baichuan2系列模型，需要将k_seq_dim和v_seq_dim修改成什么？）

Can support to Qwen14B?

Streaming-llm is a good work for llm to solve long-text generation problem.

And I hope the streaming-llm can supported Qwen-14B in the future, which achieves a good performance in Chinese generation.

Thanks for your work.

can support to Baichuan2?

b979594a04f1bbefe1ff21eb8affacef2a186d25

mempool/mempool@b979594

测试了没有提速哇，咋回事呢？

How to use streaming llm to train a new model? is there any sample code . thansk

Question about long input and difference between streaming-llm and dense attention.

Thank you for your nice work and I have read the issue #33, thank you for your patient explanation for the difference between StreamingLLM and Dense Attention, based on your answer, I have futher question:

As you mentioned in FAQ.3, I guess streaming-llm processes long input in a truncation manner. But if I don't mind the expensive time consuming and handle the long input text in a streaming-llm manner (recompuation from the very begining of the text to the end with attention sinks window by window), theoretically, will it perform better than truncation? do you perform some experiments?
For a model with large context size (say 16k), we can still implement streaming llm use shorter context size(say 4k), have you perform comparison between dense attention with 16k and streaming llm with 4k? Theoretically, streaming-4k should perform worse than dense-16k, what's the gap? this is important if one want to use the streaming-llm to approximate larger window size performance.

Questions about "Run Streaming Llama Chatbot"

First of all, thanks for releasing the excellent work!
I have some questions running the example you provided. I use the command:

# I have downloaded the Llama-2-7b-hf and put it to /data/model/Llama-2-7b-hf
CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py  --enable_streaming  --model_name_or_path /data/model/Llama-2-7b-hf

And I get the following results:

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:41<00:00, 20.89s/it]
Loading data from data/mt_bench.jsonl ...
prompts length:  158
StartRecentKVCache: 4, 2000

USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

ASSISTANT: seq_len:  38

### 1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1
该问题的推理速度：23.53038809423589 token/sec

USER: Rewrite your previous response. Start every sentence with the letter A.

ASSISTANT: seq_len:  24
USER: Rewrite your previous response. Start every sentence with the letter B.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter C.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter D.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter E.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter F.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter G.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter H.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter I.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter J.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter K.

It seems that it does not work well! Anything wrong with my test? Should I change some things to get right results?

And when using the lmsys/vicuna-13b-v1.3 as the model, the results seems ok.

Thanks!

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

(streaming) F:\StreamingLLM\streaming-llm>python examples/run_streaming_llama.py --enable_streaming
Loading model from lmsys/vicuna-13b-v1.3 ...
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:08<00:00, 2.85s/it]
Loading data from data/mt_bench.jsonl ...
StartRecentKVCache: 4, 2000

USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

ASSISTANT: Traceback (most recent call last):
File "examples/run_streaming_llama.py", line 122, in
main(args)
File "examples/run_streaming_llama.py", line 103, in main
streaming_inference(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "examples/run_streaming_llama.py", line 73, in streaming_inference
past_key_values = greedy_generate(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "examples/run_streaming_llama.py", line 20, in greedy_generate
outputs = model(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 820, in forward
outputs = self.model(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 708, in forward
layer_outputs = decoder_layer(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "f:\streamingllm\streaming-llm\streaming_llm\pos_shift\modify_llama.py", line 71, in llama_pos_shift_attention_forward
query_states = self.q_proj(hidden_states)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in call_impl
return forward_call(*args, **kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_impl_cpu" not implemented for 'Half'

(streaming) F:\StreamingLLM\streaming-llm> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

System Manufacturer: ASUS
System Model: System Product Name
System Type: x64-based PC
Processor(s): 1 Processor(s) Installed.
[01]: AMD64 Family 23 Model 49 Stepping 0 AuthenticAMD ~3701 Mhz
BIOS Version: American Megatrends Inc. 1701, 12/13/2022
Windows Directory: C:\Windows
System Directory: C:\Windows\system32

'CUDA_VISIBLE_DEVICES' is not recognized as an internal or external command, operable program or batch file.

It kept saying "'CUDA_VISIBLE_DEVICES' is not recognized as an internal or external command, operable program or batch file." When I tried running the command "CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming".

I was able to get it working by using the command "set CUDA_VISIBLE_DEVICES=0 && python examples/run_streaming_llama.py --enable_streaming"

Implementing lama2 7b

I have a question related how to run the code, i follow up all the instructions that are mentioned in the repo but my confusion is will the model be downloaded itself, for example i want to test the code for lama2 7b chat model, how to use this streaming llama code for that?

Python Module as a drop-in replacement for `transformers` using Attention Sinks

Hello!

This work is looking extremely promising! Great job. I wanted to notify you that I've created a project for a drop-in replacement of transformers using the attention sinks approach. For example, loading Llama 2 with attention sinks is now as simple as:

from attention_sinks import AutoModel

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto")

If you're interested, you can check out the repository here: https://github.com/tomaarsen/attention_sinks

I ran some experiments over there using the attention_sinks Python module, and was able to get some extremely promising results, much like your paper:

transformers: Linear VRAM usage as it doesn't do any windowing. Performance fails after ~4096
window_attention: Constant VRAM usage due to the windowing at 1024 tokens. Fails after ~1024 tokens.
attention_sinks: Constant VRAM usage due to windowing at 4 attention sink tokens + 1020 most recent tokens. Never fails despite constant VRAM usage.

Tom Aarsen

How to generate longer token streams?

Have everything running on python3.10 under ubuntu 22.04 with 2x 24 gig gpus.
Tested original and revised versions of 'mt_bench.jsonl' and output is good with a 70b 4bit gptq model.

Trying to increase number of tokens streamed but it appears fixed for each generation.

Edited 'run_streaming_llama.py - line 61

def streaming_inference(model, tokenizer, prompts, kv_cache=None, max_gen_len=10000):

but output length is similar to default 2000

Edited kv_cache.py 'recent_size=512,' to similar values but length of output remains the same.

Would appreciate options and/or edits required to generate 10000+ tokens

Cheers