horseee / llm-pruner Goto Github PK

[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support LLaMA, Llama-2, BLOOM, Vicuna, Baichuan, etc.

Home Page: https://arxiv.org/abs/2305.11627

License: Apache License 2.0

Python 99.29% Shell 0.16% C++ 0.55%

compression language-model llm pruning pruning-algorithms baichuan chatglm llama vicuna llama-2

llm-pruner's Introduction

Hi there 👋

I'm ✨ Xinyin Ma ✨

🌱 I’m currently learning at NUS LV lab, under the supervision of Prof. Xinchao Wang

🎓 Previously I'm studying in Zhejiang University, under the supervision of Prof. Weiming Lu

🤔 My research interest mainly focuses on the efficient generative model.

👯[Personal Page] | 📫[Google Scholar] | 💬[Semantic Scholar]

🎃 Latest Goal: to have a kitten like this

llm-pruner's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory niconico6 eltociear thandongtb ziyu-deep cxxszz jjhw ai-mou whitefu bestdpf smn2010 xinqiangyu panyu333 jshim0978 sustechbruce yjc95 fatih-ilhan feiward brucekyle99 christindbose jeffreycastellano kal729 xueyongdial berumotto-vermouth arkizh zeusioi snoopytt adambear michaelvin1322 shuai-xie yonghuazhang-buaa zrayzzz s1ghhh chuangellow wenqian1228 mrbananahuman techthiyanes gengwb jxzhangjhu kalchakra13 wtjgh sbwww casszhao huxjtu ericliclair huguanglong stephen016 lakshmankolasani mikeling sungbalance ironicbo wangziyannb sagnik119 jiahangxu timmygiorno makotov danield21 relojeffrey compressionorg wuhaohui1231 gpydzh liuxiaozhu01 michelia-zhx mortal-xsq yiranhuangirene m23csa007

llm-pruner's Issues

evaluate

Hello, After pruning, use alpaca_data_zh_51k Data set fine-tuning, How to evaluate the performance of the model on the alpaca_data_zh_51k after fine-tuning? Thanks.

Cannot use huggface to load

I cut 25% of all the layers, but the cut shape is not I wanne,
I hope the shape is [N,N] ,but [N,M] ,the M=N*0.25.
it's difficult to load.

Latency code

Can you share your latency code in the experiment? In my test on A100, I found that the pruned LLaMA-7b is even slower than original LLaMA. Thank you very much!

Force even pruning across layers

Is there a way to force the pruning to remove the same amount of parameters from all layers?
This would make the resulting model compatible with hf implementation (use from_pretrained)

When I post-train the pruned model by running python post_training.py --prune_model prune_log/pytorch_model.bin --data_path yahma/alpaca-cleaned --output_dir tune_log --wandb_project llama_tune --lora_r 8 --num_epochs 2 --learning_rate 1e-4 --batch_size 64, I meet with the following problem:

wandb.errors.UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key])

Could you please tell me how I should deal with that? Thank you!

Pruning Llama2-7B

I’ve tried to prune Llama2-7B on a MacBook Pro M1 but the system end it by killing the process because of OOM (I’ve 32GB)

Is there something I can do? Did somebody prone this model and publish it?

thank you!

hi, Does post_training support full parameter fine-tuning of the pruned model？

Question on recovery and training data

Excellent project btw and I am sure it will get traction.

I am curious about the recovery stage. Is the team proposing recovery training using the same training data that created the base model?

Checking the pruned but uncompressed model

Hi,

Thanks a lot for this awesome work! I am wondering whether there is a way to check the pruned but uncompressed model. Now when I save the model, they are already compressed, so I assume those pruned weights are discarded. Any chance that I can locate those pruned weights?

Thanks!

Sparse Mask question

Hi, I have a question about the sparsity of the weight:
After merge lora into sparse weight will change sparse weight into dense?

my process have some problems

download the vicuna model from this: vicuna model
because of network problem, i download the book corpus.tar.bz2 and uncompress it:

and change the get_bookcorpus api:

run the pruner process and get the log:

python hf_prune.py --pruning_ratio 0.1
--block_wise
--block_mlp_layer_start 4 --block_mlp_layer_end 30
--block_attention_layer_start 4 --block_attention_layer_end 30
--pruner_type Taylor
--test_after_train
--device cpu --eval_device cuda
--save_ckpt_log_name llama_prune
--base_model vicuna_model
--save_model

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.30s/it]
2023-06-27 14:47:57 - INFO : Use taylor pruner...
2023-06-27 14:47:57 - INFO : Pruning Attention Layer = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
2023-06-27 14:47:57 - INFO : Pruning MLP Layer = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
2023-06-27 14:47:58 - INFO : Start Pruning
2023-06-27 14:47:59 - WARNING : Found cached dataset text (/root/.cache/huggingface/datasets/text/default-a5b37c6d93890eb9/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)
2023-06-27 14:48:00 - INFO : Start Backwarding in iterative steps = 0...
2023-06-27 14:48:08 - INFO : Loss = 3.7414731979370117
2023-06-27 14:48:34 - INFO : After Iter 1/1, #parameters: 6223081472
2023-06-27 14:48:34 - INFO : #Param before: 6738415616, #Param after: 6223081472, Ratio = 92.3523%
保存模型
prune_log/llama_prune/pytorch_model.bin
2023-06-27 14:51:02 - INFO :
==================Generation Results After Pruning================

2023-06-27 14:51:04 - INFO : I believe the meaning of life is to continue seeking for self-fulfillment while also contributing to the betterment of the world and society. It's not to be selfish or materialistic, but to find balance and happiness through meaningful work, relationships, and experiences.
2023-06-27 14:51:08 - INFO : Simply put, the theory of relativity states that 1. Gravity is not a force in the sense that we usually understand it, but rather the result of the curvature of spacetime, caused by the presence of massive objects. This curvature creates a field that can bend the path of light. 2. The speed of light is always constant in a vacuum, regardless of the motion of the observer or the source. As a result, the amount of time it takes for light to travel a certain distance is dependent on the observer’s relative motion. 3. Time and space are intertw
2023-06-27 14:51:13 - INFO : ~~Building a website can be done in 10 simple steps:~~

Choose a domain name that is unique and easy to remember.

Choose a hosting service that will support your website.

Choose a web host and register your domain.

Set up your website using a web hosting service.

Customize the look and feel of your website.

Add content to your website.

Test your website to ensure that everything is working properly.

Launch your website to the public.

Continuously improve your website and
2023-06-27 14:51:14 - INFO : Tweet: "I hate it when my phone battery dies."
Sentiment: Negative

Tweet: "My day has been 👍"
Sentiment: Positive

Tweet: "This is the link to the article"
Sentiment: Neutral

Tweet: "This new music video was incredibile"
Sentiment: Positive

Tweet: "I just spent the whole day cleaning my room"
Sentiment: Negative

Note: The sentiment refers
2023-06-27 14:51:18 - INFO : ~~Translate English to French:~~

sea otter => loutre de mer

peppermint => menthe poivrée

plush girafe => girafe peluche

cheese => fromage

English to French:

honey badger => renard des merveilles

spaghetti squirrel => écureuil à la spaghettini

poutine rainbow → rainbow de poutine

French translation of “the cat sat on the mat”: Le chat est assis sur le tapis.
2023-06-27 14:51:18 - INFO :
==================Finish================

2023-06-27 14:51:18 - WARNING : Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
2023-06-27 14:51:18 - WARNING : Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Token indices sequence length is longer than the specified maximum sequence length for this model (341469 > 2048). Running this sequence through the model will result in indexing errors
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:58<00:00, 11.48it/s]
{'wikitext2': 19.091033031037714}
2023-06-27 14:52:18 - WARNING : Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
2023-06-27 14:52:18 - WARNING : Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:58<00:00, 11.49it/s]
{'wikitext2': 19.091033031037714, 'ptb': 19.091033031037714}
2023-06-27 14:53:18 - INFO : PPL after pruning: {'wikitext2': 19.091033031037714, 'ptb': 19.091033031037714}
2023-06-27 14:53:18 - INFO : Memory Requirement: 12052.52490234375 MiB

inference the pruner model and get the result:

commend:
python generate.py --model_type pruneLLM --ckpt pruner_model/pytorch_model_10%.bin

inference code:
def main(args):
if args.model_type == 'pretrain':

tokenizer = LlamaTokenizer.from_pretrained(args.base_model,cache_dir='llama_hf_model',force_download=True)

model = LlamaForCausalLM.from_pretrained(

args.base_model,

low_cpu_mem_usage=True if torch_version >=9 else False,cache_dir='llama_hf_model',force_download=True

)

tokenizer = LlamaTokenizer.from_pretrained(args.base_model) model = LlamaForCausalLM.from_pretrained( args.base_model, low_cpu_mem_usage=True if torch_version >=9 else False,cache_dir='llama_hf_model' ) description = "Model Type: {}\n Base Model: {}".format(args.model_type, args.base_model) elif args.model_type == 'pruneLLM': pruned_dict = torch.load(args.ckpt, map_location='cpu') tokenizer, model = pruned_dict['tokenizer'], pruned_dict['model'] description = "Model Type: {}\n Pruned Model: {}".format(args.model_type, args.ckpt) elif args.model_type == 'tune_prune_LLM': pruned_dict = torch.load(args.ckpt, map_location='cpu') tokenizer, model = pruned_dict['tokenizer'], pruned_dict['model'] model = PeftModel.from_pretrained( model, args.lora_ckpt, torch_dtype=torch.float16, ) description = "Model Type: {}\n Pruned Model: {}\n LORA ckpt: {}".format(args.model_type, args.ckpt, args.lora_ckpt) else: raise NotImplementedError if device == "cuda": model.half() model = model.cuda() # unwind broken decapoda-research config model.config.pad_token_id = tokenizer.pad_token_id = 0 # unk model.config.bos_token_id = 1 model.config.eos_token_id = 2 model.eval() print("Human:") line = input() while line: inputs = tokenizer(line, return_tensors="pt") input_ids = inputs["input_ids"].to(device) with torch.no_grad(): generation_output = model.generate( input_ids=input_ids,

early_stopping=True,

num_beams=4,

do_sample=True, top_k=40, top_p=0.95, temperature=1, max_length=1024, return_dict_in_generate=True, ) s = generation_output.sequences[0] output = tokenizer.decode(s) print(output) print("\n-------------------------------\n") print("Human:") #每次终端用户输入前，加上Human提示。 line = input() result: Human:

写一首春天的诗句
写一首春天的诗句：
烟雾拢眉头风呈起，
吃掉大地茁��cies，
伞拧开远眼，
春日裡嗟衟啦。

虽然这句诗语优雅，
语言优美且清新，
但哦，听起来好聪明。

谁是说这句话？
又有什么寓意？

Human:
心情不好的时候应该如何调整
~~心情不好的时候应该如何调整自己的思维？~~

Human:
怎样学习机器学习
~~怎样学习机器学习，让生慢慢领略智慧~~

摘自《深度遗传算子》第1章：

众多高分层的训练，以各自的方式，对模型进行修饰，从而使其在应用中发挥更好的作用。这就是逼真模型的目的。

很有趣的是，不论模型如何被修改，它的主要特点从未变化，那又无法解释为什么？由于，每一种修改都涉及到模型的某个原始部分，而这些部分的集合伴成模型的全部。在每一个修改中，只对某些基本元素进行调整，然后从零开始进行多次迭代，从而进一步发现模型的某些局部结构的合适性。又如果一个模型的某个结构在另一个模型的同一棵树下，它将随意地飘出来，这就是它是一个可以被理解为一个舍定结构的“半智慧的模型”。

因此，拥有扭悪的理解力的人，能够独自地通过对模型的分析，将其变成一种可以进行工作的价值，同时仍然保留其先前的形态。而人们毕竟很难学会屏蔽自己对某个问题的解决方法，以便更充分地重视其他方面的学习，而这虽然会让人们显得不停地思考，但是只要终于能够将其推向后层，可以在后来更高级别的框架下产生更多的创造力，实现更快的学习进展。

总之，从理解到模型的核心，从逻辑的梳理到演化的本质，这些都是需要通过机器学习这种领域来帮助我们解决的问题，也许之所以在人心中散发出一种奇特的气息，仿佛似乎能够挽救我们难以解决的谜团。

延迟评估

您在论文中提到的延迟数据具体是运行那个文件得到的：

Eval Loss NaN on Llama-2

Hi,

By anychance, have you tried literally run on the llama-2 model?

I tried using default llama parameters for pruning and post-training, resulting in similar wikitext2 score (~19) but much worse score for ptb (~70).

Also, when running post-training with the parameter set of llama, llama-2 loss explodes after ~0.2 epoch. Tried using smaller lr (1e-5) yet eval loss exploded to nan.

It would be of great help if you could provide some insights on both pruning and post-training parameters.

Thanks.

When would the code for GPT-J-6B be released?

Thanks a lot for your work on compression on LLMs, and looking forward for the code for GPT-J. When would it be available for it? It would be a great help for my experiment.

Issue: Missing Generation of `pytorch_model.bin` File During Model Tuning

Thank you for sharing your interesting project!

Recently, when I ran bash ./script/llama_prune.sh, the pruning step worked perfectly fine. However, during the tuning step, although there were no error information, the generated structure only included the following:

checkpoints-200
- model.safetensors
- optimizer.pt
- rng_state.pth
- scheduler.pt
- trainer_state.json
- training_args.bin

I noticed that the pytorch_model.bin file was not saved. I haven't modified the code, and I am using PyTorch version 2.1.2+cu121. Could you suggest what the possible reason for this might be?

ConnectionError: Couldn't reach https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:53<00:00, 12.35it/s]
{'wikitext2': 20.046345644076645}
/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument trust_remote_code=True.
Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets.
warnings.warn(
HF google storage unreachable. Downloading and preparing it from source
2024-01-14 05:26:19 - WARNING : HF google storage unreachable. Downloading and preparing it from source
Traceback (most recent call last):
File "/home/iotsc01/xinpengq/LLM-Pruner-main/hf_prune.py", line 314, in
main(args)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/hf_prune.py", line 267, in main
ppl = PPLMetric(model, tokenizer, ['wikitext2', 'ptb'], args.max_seq_len, device=args.eval_device)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/LLMPruner/evaluator/ppl.py", line 10, in PPLMetric
_, test_loader = get_loaders(dataset, tokenizer, seq_len=seq_len, batch_size = batch_size)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/LLMPruner/datasets/ppl_dataset.py", line 50, in get_loaders
train_data, test_data = get_ptb(seq_len, tokenizer)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/LLMPruner/datasets/ppl_dataset.py", line 19, in get_ptb
traindata = load_dataset('ptb_text_only', 'penn_treebank', split='train')
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/builder.py", line 1767, in _download_and_prepare
super()._download_and_prepare(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/builder.py", line 1078, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/iotsc01/.cache/huggingface/modules/datasets_modules/datasets/ptb_text_only/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f/ptb_text_only.py", line 131, in _split_generators
data_dir = dl_manager.download_and_extract(my_urls)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 562, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 426, in download
downloaded_path_or_paths = map_nested(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 466, in map_nested
mapped = [
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 467, in
_single_map_nested((function, obj, types, None, True, None))
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 370, in _single_map_nested
return function(data_struct)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 451, in _download
out = cached_path(url_or_filename, download_config=download_config)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 188, in cached_path
output_path = get_from_cache(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 573, in get_from_cache
raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
ConnectionError: Couldn't reach https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))

Adding a tutorial for adapting new models?

Is there a chance you can add a tutorial on adapting new models?

When would the code for ChatGLM be released?

Thanks a lot for your work on compression on LLMs, and looking forward for the code for ChatGLM. When would it be available for GLMs?

Gain using more data

Hi! Thanks for sharing the work again!

I'm wondering how is your test on using more data going?

I tried using 300K data for post-training on the pruned LLAMA, and the PPL results are basically the same as using 50K data only. Is it not proper to evaluate using PPL, or maybe the training parameters need to be tuned more carefully? Thanks for any advice or discussion!

Error when using GPU for pruning

Hi when I try to implement the following:

python hf_prune.py --pruning_ratio 0.25 --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --pruner_type taylor --test_after_train --device cuda --eval_device cuda --save_ckpt_log_name lama_prune

I got an error:
Traceback (most recent call last):
File "hf_prune.py", line 296, in
main(args)
File "hf_prune.py", line 129, in main
loss = model(example_prompts, labels=example_prompts).loss
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 689, in forward
outputs = self.model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 532, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2156, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Is it not supported for GPU pruning?

PS., May I ask that how much memory is needed overall?

I encountered the following error message when I assign iterative_steps = 2 during baichuan-7B pruning

Traceback (most recent call last):
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/examples/baichuan.py", line 342, in
main(args)
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/examples/baichuan.py", line 229, in main
pruner.step()
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 186, in step
for group in self.prune_local():
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 245, in prune_local
imp = self.estimate_importance(group, ch_groups=ch_groups, consecutive_groups=consecutive_groups)
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 190, in estimate_importance
return self.importance(group, ch_groups=ch_groups, consecutive_groups=consecutive_groups)
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/LLMPruner/pruner/hf_baichuan_pruner.py", line 180, in call
local_norm = local_norm[idxs]
IndexError: index 3840 is out of bounds for dimension 0 with size 3840

Use LLM-Pruner for Baichuan model

Hi, I am trying to use LLM-Pruner on Baichuan-13B model (https://github.com/baichuan-inc/Baichuan-13B). It is also llama structured so I thought it should work instantly, but I got some errors... I am still trying to debug, but slowly... Any help or advice would be very appreciated!

Specifically, I ran "CUDA_VISIBLE_DEVICES=0,1 python hf_prune_baichuan.py --base_model models/baichuan-13b-chat --pruning_ratio 0.25 --device cpu --eval_device cuda --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --save_ckpt_log_name baichuan_13b_chat_0.2 --pruner_type taylor --test_after_train --taylor param_first --save_model",

and I got the following output:
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [02:14<00:00, 44.74s/it]
2023-07-17 02:29:09 - INFO : Use taylor pruner...
2023-07-17 02:29:09 - INFO : Pruning Attention Layer = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
2023-07-17 02:29:09 - INFO : Pruning MLP Layer = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
/dfs/data/LLM-Pruner/LLMPruner/torch_pruning/dependency.py:362: UserWarning: Unwrapped parameters detected: ['model.layers.10.input_layernorm.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.34.input_layernorm.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.29.post_attention_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.38.post_attention_layernorm.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.33.post_attention_layernorm.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.35.input_layernorm.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.30.post_attention_layernorm.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.32.input_layernorm.weight', 'model.layers.39.post_attention_layernorm.weight', 'model.layers.34.post_attention_layernorm.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.36.input_layernorm.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.28.input_layernorm.weight', 'model.layers.35.post_attention_layernorm.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.37.input_layernorm.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.32.post_attention_layernorm.weight', 'model.norm.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.36.post_attention_layernorm.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.38.input_layernorm.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.28.post_attention_layernorm.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.33.input_layernorm.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.37.post_attention_layernorm.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.39.input_layernorm.weight', 'model.layers.1.input_layernorm.weight', 'model.layers.8.post_attention_layernorm.weight'].
Torch-Pruning will prune the last non-singleton dimension of a parameter. If you wish to customize this behavior, please provide an unwrapped_parameters argument.
warnings.warn("Unwrapped parameters detected: {}.\n Torch-Pruning will prune the last non-singleton dimension of a parameter. If you wish to customize this behavior, please provide an unwrapped_parameters argument.".format([_param_to_name[p] for p in unwrapped_detected]))
2023-07-17 02:30:02 - INFO : Start Pruning
2023-07-17 02:30:02 - WARNING : Found cached dataset bookcorpus (/dfs/data/data/bookcorpus/bookcorpus/plain_text/1.0.0/eddee3cae1cc263a431aa98207d4d27fd8a73b0a9742f692af0e6c65afa4d75f)
2023-07-17 02:30:45 - INFO : Start Backwarding in iterative steps = 0...
2023-07-17 02:33:56 - INFO : Loss = 3.644896984100342
Traceback (most recent call last):
File "hf_prune_baichuan.py", line 299, in
main(args)
File "hf_prune_baichuan.py", line 136, in main
pruner.step()
File "/dfs/data/LLM-Pruner/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 179, in step
for group in self.prune_local():
File "/dfs/data/LLM-Pruner/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 238, in prune_local
imp = self.estimate_importance(group, ch_groups=ch_groups, consecutive_groups=consecutive_groups)
File "/dfs/data/LLM-Pruner/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 183, in estimate_importance
return self.importance(group, ch_groups=ch_groups, consecutive_groups=consecutive_groups)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/pruner/hf_baichuan_pruner.py", line 306, in call
local_norm = local_norm[idxs]
IndexError: index 10240 is out of bounds for dimension 0 with size 5120

I modified "hf_prune_llama.py" and "LLMPruner/pruner/hf_llama_pruner.py":
1、replacing the model loading part as:
tokenizer = AutoTokenizer.from_pretrained(args.base_model, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(args.base_model, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained(args.base_model)
2、replacing all "q_proj, k_proj, v_proj" with "W_pack"

Do you have any advice on quick fixing? Thank you very much!

Unable to reproduce the results for param_first and param_second in the paper after finetuning.

Could you please tell me which command you used for post training of model to get results for element 1 and element 2 method.

Examples on the Huggingface Hub

Apologies if this has been asked before, but do you have pruned models that we can test and run locally? Anything on the huggingface hub?

I'd like to test some of your models directly before investing time in your pruning code. Thanks!

为什么num_examples默认是10？

在将部分层进行剪枝之后，不能直接通过tgi加载模型

在将部分层进行剪枝之后，不能直接通过tgi加载模型，落地难度大，有什么好的idea吗？

cannot import name 'SiLUActivation' from 'transformers.activations'

python test_speedup.py --model_type pretrain
Traceback (most recent call last):
File "/home/azuryl/project/llm-pruner/LLM-Pruner/test_speedup.py", line 10, in
from transformers.activations import SiLUActivation
ImportError: cannot import name 'SiLUActivation' from 'transformers.activations' (/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/transformers/activations.py)

Question related to the model tuning

Hi,

Great work first!

I am confused with the model tuning part.

According to the code, it seemed that you used the lora method.
This, in my opinion, will destroy the sparsity you have made in the original model after merging the lora weights to the model weights.

could you explain this?

Thanks
Shawn

the new pytorch.bin is bigger than original model issue

When I choose save model, I found some strange things。The new pytorch.bin is bigger than original model。I choose Baichuan-7B ,--pruning_ratio 0.5 for test, and add --save_model for save the model after pruning. but the new pytorch.bin is 17GB, and the original model is only 13GB?
Could you please tell me why? Thank you!

Supporting device_map = 'auto' similar to the one in .from_pretrained method from Huggingface

The pruned model is saved using torch.save and torch.load for loading the model. I was wondering if there is a way to use a similar method such as device_map='auto' similar to the one in .from_pretrained method from Huggingface

Pruning MQA?

How to pruning LLMs with Multi Query Attention?

Adding quantization

If I use the multiple strategies such as GPTQ + LLM-Pruner + LoRA, maybe the compressing ratio of LLM will be greatly improved with an acceptable performance?

this method can be used for bloom?

Error occurs when pruning LLaMa2-7b

With cmd like:
CUDA_VISIBLE_DEVICES=0 python hf_prune.py --base_model path_to_cached_hf_llama2-7b --pruning_ratio 0.25 --device cpu --eval_device cuda --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --pruner_type taylor --test_after_train --taylor param_first --save_model
It throws an error :"addmm_impl_cpu_" not implemented for 'Half'

torch==2.0.0
transformers==4.31.0

RuntimeError of test_speedup.py

Hi! I ran the following command:
python test_speedup.py --model_type pruneLLM --base_model ../llama-7b-hf --ckpt prune_log/llama_prune/pytorch_model.bin

and got the following error:
Warning: module Embedding is treated as a zero-op.
Warning: module LlamaRotaryEmbedding is treated as a zero-op.
Warning: module LlamaMLP is treated as a zero-op.
Warning: module LlamaDecoderLayer is treated as a zero-op.
Warning: module LlamaModel is treated as a zero-op.
Warning: module LlamaForCausalLM is treated as a zero-op.
Traceback (most recent call last):
File "test_speedup.py", line 91, in
main(args)
File "test_speedup.py", line 69, in main
macs, params = get_model_complexity_info(model, (64,), as_strings=True,
File "/opt/conda/lib/python3.8/site-packages/ptflops/flops_counter.py", line 30, in get_model_complexity_info
flops_count, params_count = get_flops_pytorch(model, input_res,
File "/opt/conda/lib/python3.8/site-packages/ptflops/pytorch_engine.py", line 44, in get_flops_pytorch
_ = flops_model(batch)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 689, in forward
outputs = self.model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 532, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2156, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Could you please help to have a look?

Calculating Importance of 'param_mix'

Hello. First of all, thank you for sharing great research.

I have a question about calculating the importance of parameters.

In class TaylorImportance in hf_llama_pruner.py, line 274,

Could you please tell me why the importance about mixed-order is calculated as follow:

salience = salience - 0.5 * layer.weight * layer.weight.acc_grad * layer.weight

(not as the sum of 1st and 2nd orders)

Is higher-order term neglected?

OSError: Can't load tokenizer for 'baffo32/decapoda-research-llama-7B-hf'.

bash scripts/llama_prune.sh

[START] - Start Pruning Model
Traceback (most recent call last):
File "/home/iotsc01/xinpengq/LLM-Pruner-main/hf_prune.py", line 314, in
main(args)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/hf_prune.py", line 39, in main
tokenizer = LlamaTokenizer.from_pretrained(args.base_model)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2012, in from_pretrained
raise EnvironmentError(
OSError: Can't load tokenizer for 'baffo32/decapoda-research-llama-7B-hf'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'baffo32/decapoda-research-llama-7B-hf' is the correct path to a directory containing all relevant files for a LlamaTokenizer tokenizer.
[FINISH] - Finish Pruning Model
[START] - Start Tuning
Traceback (most recent call last):
File "/home/iotsc01/xinpengq/LLM-Pruner-main/post_training.py", line 262, in
main(args)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/post_training.py", line 33, in main
pruned_dict = torch.load(args.prune_model, map_location='cpu')
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/torch/serialization.py", line 986, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/torch/serialization.py", line 435, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/torch/serialization.py", line 416, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'prune_log/llama_prune/pytorch_model.bin'
[FINISH] - Finish Prune and Post-Training.
[INFO] - The pruned model is at {prune_log/llama_prune/pytorch_model.bin}, and the recovery weight is at {tune_log/llama_0.2}/
You can use the command:
python generate.py --model_type tune_prune_LLM --ckpt prune_log/llama_prune/pytorch_model.bin --lora_ckpt tune_log/llama_0.2
to use the pruned model

Code for evaluation on zero-shot tasks using lm-evaluation-harness

Thank you for your great work! Since the pruned model cannot be loaded with the .from_pretrained() in Hugging Face, would you mind sharing the code for evaluation on zero-shot tasks using lm-evaluation-harness? Thanks a lot!

Can not import LlamaConfig

from transformers import LlamaConfig
ImportError: cannot import name 'LlamaConfig' from 'transformers' (/home/aelkordy/.local/lib/python3.8/site-packages/transformers/init.py)
(test) aelkordy@g1lmd1:/vault/aelkordy/NLP_projects/pruning/LLM-Pruner$ python hf_prune.py --pruning_ratio 0.25 --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --pruner_type taylor --test_after_train --device cpu --eval_device cuda --save_ckpt_log_name llama_prune
Traceback (most recent call last):
File "hf_prune.py", line 14, in
from LLMPruner.models.hf_llama.modeling_llama import LlamaForCausalLM, LlamaRMSNorm, LlamaAttention, LlamaMLP
File "/vault/aelkordy/NLP_projects/pruning/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 33, in
from transformers import LlamaConfig
ImportError: cannot import name

Can prune model convert to llama.cpp ggml？

Reload the prunered Model failed

i used the default scripts on README to pruner vicuna7b and get the prunered output "pytorch.bin".
but i can not reload the pytorch.bin use the .from_pretrained()

what shou i do to inference ?

i tried both "AutoModelForCausalLM" and "LlamaForCausalLM",none of them worked

# from transformers import (AutoModelForCausalLM, AutoTokenizer, )
from transformers import LlamaTokenizer
from LLMPruner.models.hf_llama.modeling_llama import LlamaForCausalLM

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme │
│ rs/modeling_utils.py:442 in load_state_dict                                  │
│                                                                              │
│    439 │   │   │   )                                                         │
│    440 │   │   return safe_load_file(checkpoint_file)                        │
│    441 │   try:                                                              │
│ ❱  442 │   │   return torch.load(checkpoint_file, map_location="cpu")        │
│    443 │   except Exception as e:                                            │
│    444 │   │   try:                                                          │
│    445 │   │   │   with open(checkpoint_file) as f:                          │
│                                                                              │
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/torch/seri │
│ alization.py:809 in load                                                     │
│                                                                              │
│    806 │   │   │   │   │   │   return _load(opened_zipfile, map_location, _w │
│    807 │   │   │   │   │   except RuntimeError as e:                         │
│    808 │   │   │   │   │   │   raise pickle.UnpicklingError(UNSAFE_MESSAGE + │
│ ❱  809 │   │   │   │   return _load(opened_zipfile, map_location, pickle_mod │
│    810 │   │   if weights_only:                                              │
│    811 │   │   │   try:                                                      │
│    812 │   │   │   │   return _legacy_load(opened_file, map_location, _weigh │
│                                                                              │
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/torch/seri │
│ alization.py:1172 in _load                                                   │
│                                                                              │
│   1169 │                                                                     │
│   1170 │   unpickler = UnpicklerWrapper(data_file, **pickle_load_args)       │
│   1171 │   unpickler.persistent_load = persistent_load                       │
│ ❱ 1172 │   result = unpickler.load()                                         │
│   1173 │                                                                     │
│   1174 │   torch._utils._validate_loaded_sparse_tensors()                    │
│   1175                                                                       │
│                                                                              │
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/torch/seri │
│ alization.py:1165 in find_class                                              │
│                                                                              │
│   1162 │   │   │   │   except KeyError:                                      │
│   1163 │   │   │   │   │   pass                                              │
│   1164 │   │   │   mod_name = load_module_mapping.get(mod_name, mod_name)    │
│ ❱ 1165 │   │   │   return super().find_class(mod_name, name)                 │
│   1166 │                                                                     │
│   1167 │   # Load the data (which may in turn use `persistent_load` to load  │
│   1168 │   data_file = io.BytesIO(zip_file.get_record(pickle_file))          │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'LLMPruner'

During handling of the above exception, another exception occurred:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme │
│ rs/modeling_utils.py:446 in load_state_dict                                  │
│                                                                              │
│    443 │   except Exception as e:                                            │
│    444 │   │   try:                                                          │
│    445 │   │   │   with open(checkpoint_file) as f:                          │
│ ❱  446 │   │   │   │   if f.read(7) == "version":                            │
│    447 │   │   │   │   │   raise OSError(                                    │
│    448 │   │   │   │   │   │   "You seem to have cloned a repository without │
│    449 │   │   │   │   │   │   "git-lfs and run `git lfs install` followed b │
│                                                                              │
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/codecs.py:322 in decode  │
│                                                                              │
│    319 │   def decode(self, input, final=False):                             │
│    320 │   │   # decode input (taking the buffer into account)               │
│    321 │   │   data = self.buffer + input                                    │
│ ❱  322 │   │   (result, consumed) = self._buffer_decode(data, self.errors, f │
│    323 │   │   # keep undecoded input until the next call                    │
│    324 │   │   self.buffer = data[consumed:]                                 │
│    325 │   │   return result                                                 │
╰──────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid
start byte

During handling of the above exception, another exception occurred:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/usr-fy/PycharmProjects/API_test/local_test.py:5 in <module>            │
│                                                                              │
│    2                                                                         │
│    3 ini = InitModel()                                                       │
│    4 # ini.load_model_tokenizer('/home/usr-fy/Desktop/mymodel_zoo/vicuna-7b- │
│ ❱  5 ini.load_model_tokenizer('/home/usr-fy/llama_prune')                    │
│    6                                                                         │
│    7 model, tokenizer = ini.get_model()                                      │
│    8                                                                         │
│                                                                              │
│ /home/usr-fy/PycharmProjects/API_test/InferenceAPI/InitModel.py:20 in        │
│ load_model_tokenizer                                                         │
│                                                                              │
│   17 │                                                                       │
│   18 │   def load_model_tokenizer(self, model_path):                         │
│   19 │   │   self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_ │
│ ❱ 20 │   │   self.model = AutoModelForCausalLM.from_pretrained(              │
│   21 │   │   │   model_path,                                                 │
│   22 │   │   │   low_cpu_mem_usage=True,                                     │
│   23 │   │   │   torch_dtype=torch.float16,                                  │
│                                                                              │
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme │
│ rs/models/auto/auto_factory.py:471 in from_pretrained                        │
│                                                                              │
│   468 │   │   │   )                                                          │
│   469 │   │   elif type(config) in cls._model_mapping.keys():                │
│   470 │   │   │   model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 471 │   │   │   return model_class.from_pretrained(                        │
│   472 │   │   │   │   pretrained_model_name_or_path, *model_args, config=con │
│   473 │   │   │   )                                                          │
│   474 │   │   raise ValueError(                                              │
│                                                                              │
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme │
│ rs/modeling_utils.py:2560 in from_pretrained                                 │
│                                                                              │
│   2557 │   │   if from_pt:                                                   │
│   2558 │   │   │   if not is_sharded and state_dict is None:                 │
│   2559 │   │   │   │   # Time to load the checkpoint                         │
│ ❱ 2560 │   │   │   │   state_dict = load_state_dict(resolved_archive_file)   │
│   2561 │   │   │                                                             │
│   2562 │   │   │   # set dtype to instantiate the model under:               │
│   2563 │   │   │   # 1. If torch_dtype is not None, we use that dtype        │
│                                                                              │
│ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme │
│ rs/modeling_utils.py:458 in load_state_dict                                  │
│                                                                              │
│    455 │   │   │   │   │   │   "model. Make sure you have saved the model pr │
│    456 │   │   │   │   │   ) from e                                          │
│    457 │   │   except (UnicodeDecodeError, ValueError):                      │
│ ❱  458 │   │   │   raise OSError(                                            │
│    459 │   │   │   │   f"Unable to load weights from pytorch checkpoint file │
│    460 │   │   │   │   f"at '{checkpoint_file}'. "                           │
│    461 │   │   │   │   "If you tried to load a PyTorch model from a TF 2.0 c │
╰──────────────────────────────────────────────────────────────────────────────╯
OSError: Unable to load weights from pytorch checkpoint file for 
'/home/usr-fy/llama_prune/pytorch_model.bin' at 
'/home/usr-fy/llama_prune/pytorch_model.bin'. If you tried to load a PyTorch 
model from a TF 2.0 checkpoint, please set from_tf=True.

When will you support ChatGLM?

401 Client Error: Unauthorized for url: https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer_config.json

bash scripts/llama_prune.sh
[START] - Start Pruning Model
Traceback (most recent call last):
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
response.raise_for_status()
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/transformers/utils/hub.py", line 389, in cached_file
resolved_file = hf_hub_download(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1374, in hf_hub_download
raise head_call_error
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
metadata = get_hf_file_metadata(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
r = _request_wrapper(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
response = _request_wrapper(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 426, in _request_wrapper
hf_raise_for_status(response)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65813e67-47dd0288795e66057f2cb0d0;16159d73-3032-4e61-8a3f-18bc009609a8)

Repository Not Found for url: https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer_config.json.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/azuryl/project/LLM-Pruner/hf_prune.py", line 314, in
main(args)
File "/home/azuryl/project/LLM-Pruner/hf_prune.py", line 39, in main
tokenizer = LlamaTokenizer.from_pretrained(args.base_model)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1951, in from_pretrained
resolved_config_file = cached_file(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/transformers/utils/hub.py", line 410, in cached_file
raise EnvironmentError(
OSError: decapoda-research/llama-7b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>
[FINISH] - Finish Pruning Model
[START] - Start Tuning
Traceback (most recent call last):
File "/home/azuryl/project/LLM-Pruner/post_training.py", line 262, in
main(args)
File "/home/azuryl/project/LLM-Pruner/post_training.py", line 33, in main
pruned_dict = torch.load(args.prune_model, map_location='cpu')
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/torch/serialization.py", line 986, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/torch/serialization.py", line 435, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/torch/serialization.py", line 416, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'prune_log/llama_prune/pytorch_model.bin'
[FINISH] - Finish Prune and Post-Training.
[INFO] - The pruned model is at {prune_log/llama_prune/pytorch_model.bin}, and the recovery weight is at {tune_log/llama_0.2}/
You can use the command:
python generate.py --model_type tune_prune_LLM --ckpt prune_log/llama_prune/pytorch_model.bin --lora_ckpt tune_log/llama_0.2
to use the pruned model

剪枝率值的问题

@VainF @eltociear @horseee 您的剪枝率是一个固定的值吗，可以设置成一个list来对每一层设置一个剪枝率

Zero-shot Evaluation

Hi, thanks for your great work!
I wonder how to evaluate the pruned and fintuned model in lm-evaluation-harness?

The quantization of the compressed models

If I want to further quantize the pruned model, how should I proceed? I saw this mentioned in the paper

Evaluation metric (acc vs. acc_norm) for lm-evaluation-harness tasks

Hi, thank you very much for generously open-sourcing your excellent work.

I've run the evaluation code you kindly shared and obtained the below results. I have a question regarding the metric for each task. Could you please clarify which one between acc or acc_norm [ref] was used for the PIQA, HellaSwag, ARC-e, ARC-c, and OBQA tasks? Thanks for taking the time to check this inquiry.

20%-pruned -> post-trained LLaMA from scripts/llama_prune.sh

Reproduced with LLM-Pruner's code based on lm-evaluation-harness 4d21ab6b27

Task	BoolQ	PIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA
LLM-Pruner paper	66.79	77.58	68.48	64.96	64.06	37.88	39.00
Reproduced (LLM-Pruner code): `acc`	65.20	77.15	53.19	63.69	64.77	36.26	28.80
Reproduced (LLM-Pruner): `acc_norm`	n/a	76.93	68.63	n/a	52.27	36.95	40.40

Using different commit hash, lm-evaluation-harness 3326c547a7

Task	BoolQ	PIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA
Using different ver: `acc`	66.24	77.58	53.54	66.14	70.54	37.54	31.40
Using different ver: `acc_norm`	n/a	78.13	71.39	n/a	65.95	39.33	41.20

Original LLaMA-7B

Reproduced with LLM-Pruner's code based on lm-evaluation-harness 4d21ab6b27

Task	BoolQ	PIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA
LLaMA paper	76.5	79.8	76.1	70.1	72.8	47.6	57.2
LLM-Pruner paper	73.18	78.35	72.99	67.01	67.45	41.38	42.40
Reproduced (LLM-Pruner code): `acc`	73.06	78.35	56.42	67.01	67.34	38.14	28.20
Reproduced (LLM-Pruner): `acc_norm`	n/a	77.37	72.99	n/a	52.48	41.38	42.40

Using different commit hash, lm-evaluation-harness 3326c547a7

Task	BoolQ	PIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA
Using different ver: `acc`	75.05	78.67	56.93	69.93	75.29	41.89	34.60
Using different ver: `acc_norm`	n/a	79.16	76.22	n/a	72.85	44.71	44.40

Note

Underlined: reported metrics in the LLM-Pruner paper
The scores reported in the LLM-Pruner paper are fully reproducible using this repo, and the lm-evaluation-harness version affects the scores because of recent updates.
[Table 1 of LLM-Pruner Paper] The evaluation is performed under different prompts, which is lower than the official results in the LLaMA paper.

param_first and param_mix result the same ppl

I simply use the following commands to run:
python hf_prune.py --pruning_ratio 0.62785 --block_wise --block_mlp_layer_start 0 --block_mlp_layer_end 32 --block_attention_layer_start 32 --block_attention_layer_end 32 --pruner_type taylor --base_model /mnt/petrelfs/xxx/llama2-7b --device cpu --eval_device cuda --taylor param_first --save_ckpt_log_name llama_prune --save_model --num_examples 128
python hf_prune.py --pruning_ratio 0.62785 --block_wise --block_mlp_layer_start 0 --block_mlp_layer_end 32 --block_attention_layer_start 32 --block_attention_layer_end 32 --pruner_type taylor --base_model /mnt/petrelfs/xxx/llama2-7b --device cpu --eval_device cuda --taylor param_mix --save_ckpt_log_name llama_prune --save_model --num_examples 128
But they result in the same ppl.

Reproducing paper results

I run LLM pruner with the command specified in the ReadMe to prune LLama-7B

python hf_prune.py --pruning_ratio 0.25 \
      --block_wise \
      --block_mlp_layer_start 4 --block_mlp_layer_end 30 \
      --block_attention_layer_start 4 --block_attention_layer_end 30 \
      --pruner_type taylor \
      --test_after_train \
      --device cpu  --eval_device cuda \
      --save_ckpt_log_name llama_prune

I get the following results

#Param before: 6738415616, #Param after: 5422977024, Ratio = 80.4785%
PPL after pruning: {'wikitext2': 19.96819234893607, 'ptb': 80.37625124290746}

Perplexities reported in Table 1 in the paper are WikiText2 - 19.09 and PTB - 34.21. Is there any reason for the difference in thses perplexities especially PTB? Thanks

recover training

when I start recover training Baichuan-7B, I meet the bug.

Exception has occurred: RuntimeError
Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/LLMPruner/peft/peft_model.py", line 664, in forward
return self.base_model(
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/LLMPruner/models/hf_baichuan/baichuan7B/modeling_baichuan_7B.py", line 610, in forward
outputs = self.model(
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/LLMPruner/models/hf_baichuan/baichuan7B/modeling_baichuan_7B.py", line 452, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper__index_select)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/post_training.py", line 199, in main
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/post_training.py", line 244, in
main(args)

How can I fix it?

horseee / llm-pruner Goto Github PK

llm-pruner's Introduction

Hi there 👋

I'm ✨ Xinyin Ma ✨

llm-pruner's People

Contributors

Stargazers

Watchers

Forkers

llm-pruner's Issues

tokenizer = LlamaTokenizer.from_pretrained(args.base_model,cache_dir='llama_hf_model',force_download=True)

model = LlamaForCausalLM.from_pretrained(

args.base_model,

low_cpu_mem_usage=True if torch_version >=9 else False,cache_dir='llama_hf_model',force_download=True

)

early_stopping=True,

num_beams=4,

20%-pruned -> post-trained LLaMA from scripts/llama_prune.sh

Original LLaMA-7B

Note

Recommend Projects

Recommend Topics

Recommend Org