Giter Site home page Giter Site logo

horseee / llm-pruner Goto Github PK

View Code? Open in Web Editor NEW
654.0 13.0 68.0 6.06 MB

[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support LLaMA, Llama-2, BLOOM, Vicuna, Baichuan, etc.

Home Page: https://arxiv.org/abs/2305.11627

License: Apache License 2.0

Python 99.29% Shell 0.16% C++ 0.55%
compression language-model llm pruning pruning-algorithms baichuan chatglm llama vicuna llama-2

llm-pruner's Introduction

llm-pruner's People

Contributors

eltociear avatar horseee avatar vainf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llm-pruner's Issues

evaluate

Hello, After pruning, use alpaca_data_zh_51k Data set fine-tuning, How to evaluate the performance of the model on the alpaca_data_zh_51k after fine-tuning? Thanks.

Cannot use huggface to load

I cut 25% of all the layers, but the cut shape is not I wanne,
I hope the shape is [N,N] ,but [N,M] ,the M=N*0.25.
it's difficult to load.

Latency code

Can you share your latency code in the experiment? In my test on A100, I found that the pruned LLaMA-7b is even slower than original LLaMA. Thank you very much!

Force even pruning across layers

Is there a way to force the pruning to remove the same amount of parameters from all layers?
This would make the resulting model compatible with hf implementation (use from_pretrained)

a post-training issue

Thanks for your nice work!

When I post-train the pruned model by running python post_training.py --prune_model prune_log/pytorch_model.bin --data_path yahma/alpaca-cleaned --output_dir tune_log --wandb_project llama_tune --lora_r 8 --num_epochs 2 --learning_rate 1e-4 --batch_size 64, I meet with the following problem:

wandb.errors.UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key])

Could you please tell me how I should deal with that? Thank you!

Pruning Llama2-7B

Iโ€™ve tried to prune Llama2-7B on a MacBook Pro M1 but the system end it by killing the process because of OOM (Iโ€™ve 32GB)

Is there something I can do? Did somebody prone this model and publish it?

thank you!

Question on recovery and training data

Excellent project btw and I am sure it will get traction.

I am curious about the recovery stage. Is the team proposing recovery training using the same training data that created the base model?

Checking the pruned but uncompressed model

Hi,

Thanks a lot for this awesome work! I am wondering whether there is a way to check the pruned but uncompressed model. Now when I save the model, they are already compressed, so I assume those pruned weights are discarded. Any chance that I can locate those pruned weights?

Thanks!

Sparse Mask question

Hi, I have a question about the sparsity of the weight:
After merge lora into sparse weight will change sparse weight into dense?

my process have some problems

  1. download the vicuna model from this: vicuna model
  2. because of network problem, i download the book corpus.tar.bz2 and uncompress it:
image and change the get_bookcorpus api: image
  1. run the pruner process and get the log:

python hf_prune.py --pruning_ratio 0.1
--block_wise
--block_mlp_layer_start 4 --block_mlp_layer_end 30
--block_attention_layer_start 4 --block_attention_layer_end 30
--pruner_type Taylor
--test_after_train
--device cpu --eval_device cuda
--save_ckpt_log_name llama_prune
--base_model vicuna_model
--save_model

Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:12<00:00, 6.30s/it]
2023-06-27 14:47:57 - INFO : Use taylor pruner...
2023-06-27 14:47:57 - INFO : Pruning Attention Layer = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
2023-06-27 14:47:57 - INFO : Pruning MLP Layer = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
2023-06-27 14:47:58 - INFO : Start Pruning
2023-06-27 14:47:59 - WARNING : Found cached dataset text (/root/.cache/huggingface/datasets/text/default-a5b37c6d93890eb9/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)
2023-06-27 14:48:00 - INFO : Start Backwarding in iterative steps = 0...
2023-06-27 14:48:08 - INFO : Loss = 3.7414731979370117
2023-06-27 14:48:34 - INFO : After Iter 1/1, #parameters: 6223081472
2023-06-27 14:48:34 - INFO : #Param before: 6738415616, #Param after: 6223081472, Ratio = 92.3523%
ไฟๅญ˜ๆจกๅž‹
prune_log/llama_prune/pytorch_model.bin
2023-06-27 14:51:02 - INFO :
==================Generation Results After Pruning================

2023-06-27 14:51:04 - INFO : I believe the meaning of life is to continue seeking for self-fulfillment while also contributing to the betterment of the world and society. It's not to be selfish or materialistic, but to find balance and happiness through meaningful work, relationships, and experiences.
2023-06-27 14:51:08 - INFO : Simply put, the theory of relativity states that 1. Gravity is not a force in the sense that we usually understand it, but rather the result of the curvature of spacetime, caused by the presence of massive objects. This curvature creates a field that can bend the path of light. 2. The speed of light is always constant in a vacuum, regardless of the motion of the observer or the source. As a result, the amount of time it takes for light to travel a certain distance is dependent on the observerโ€™s relative motion. 3. Time and space are intertw
2023-06-27 14:51:13 - INFO : Building a website can be done in 10 simple steps:

  1. Choose a domain name that is unique and easy to remember.
  2. Choose a hosting service that will support your website.
  3. Choose a web host and register your domain.
  4. Set up your website using a web hosting service.
  5. Customize the look and feel of your website.
  6. Add content to your website.
  7. Test your website to ensure that everything is working properly.
  8. Launch your website to the public.
  9. Continuously improve your website and
    2023-06-27 14:51:14 - INFO : Tweet: "I hate it when my phone battery dies."
    Sentiment: Negative

Tweet: "My day has been ๐Ÿ‘"
Sentiment: Positive

Tweet: "This is the link to the article"
Sentiment: Neutral

Tweet: "This new music video was incredibile"
Sentiment: Positive

Tweet: "I just spent the whole day cleaning my room"
Sentiment: Negative

Note: The sentiment refers
2023-06-27 14:51:18 - INFO : Translate English to French:

sea otter => loutre de mer

peppermint => menthe poivrรฉe

plush girafe => girafe peluche

cheese => fromage

English to French:

honey badger => renard des merveilles

spaghetti squirrel => รฉcureuil ร  la spaghettini

poutine rainbow โ†’ rainbow de poutine

French translation of โ€œthe cat sat on the matโ€: Le chat est assis sur le tapis.
2023-06-27 14:51:18 - INFO :
==================Finish================

2023-06-27 14:51:18 - WARNING : Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
2023-06-27 14:51:18 - WARNING : Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Token indices sequence length is longer than the specified maximum sequence length for this model (341469 > 2048). Running this sequence through the model will result in indexing errors
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 667/667 [00:58<00:00, 11.48it/s]
{'wikitext2': 19.091033031037714}
2023-06-27 14:52:18 - WARNING : Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
2023-06-27 14:52:18 - WARNING : Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 667/667 [00:58<00:00, 11.49it/s]
{'wikitext2': 19.091033031037714, 'ptb': 19.091033031037714}
2023-06-27 14:53:18 - INFO : PPL after pruning: {'wikitext2': 19.091033031037714, 'ptb': 19.091033031037714}
2023-06-27 14:53:18 - INFO : Memory Requirement: 12052.52490234375 MiB

  1. inference the pruner model and get the result:

    commend:
    python generate.py --model_type pruneLLM --ckpt pruner_model/pytorch_model_10%.bin

    inference code:
    def main(args):
    if args.model_type == 'pretrain':

tokenizer = LlamaTokenizer.from_pretrained(args.base_model,cache_dir='llama_hf_model',force_download=True)

model = LlamaForCausalLM.from_pretrained(

args.base_model,

low_cpu_mem_usage=True if torch_version >=9 else False,cache_dir='llama_hf_model',force_download=True

)

    tokenizer = LlamaTokenizer.from_pretrained(args.base_model)
    model = LlamaForCausalLM.from_pretrained(
        args.base_model,
        low_cpu_mem_usage=True if torch_version >=9 else False,cache_dir='llama_hf_model'
    )
    description = "Model Type: {}\n Base Model: {}".format(args.model_type, args.base_model)
elif args.model_type == 'pruneLLM':
    pruned_dict = torch.load(args.ckpt, map_location='cpu')
    tokenizer, model = pruned_dict['tokenizer'], pruned_dict['model']
    description = "Model Type: {}\n Pruned Model: {}".format(args.model_type, args.ckpt)
elif args.model_type == 'tune_prune_LLM':
    pruned_dict = torch.load(args.ckpt, map_location='cpu')
    tokenizer, model = pruned_dict['tokenizer'], pruned_dict['model']
    model = PeftModel.from_pretrained(
        model,
        args.lora_ckpt,
        torch_dtype=torch.float16,
    )
    description = "Model Type: {}\n Pruned Model: {}\n LORA ckpt: {}".format(args.model_type, args.ckpt, args.lora_ckpt)
else:
    raise NotImplementedError

if device == "cuda":
    model.half()
    model = model.cuda()

# unwind broken decapoda-research config
model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
model.config.bos_token_id = 1
model.config.eos_token_id = 2

model.eval()

print("Human:")
line = input()
while line:
    inputs = tokenizer(line, return_tensors="pt")
    input_ids = inputs["input_ids"].to(device)

    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,

early_stopping=True,

num_beams=4,

            do_sample=True,
            top_k=40,
            top_p=0.95,
            temperature=1,
            max_length=1024,
            return_dict_in_generate=True,
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    print(output)
    print("\n-------------------------------\n")
    print("Human:") #ๆฏๆฌก็ปˆ็ซฏ็”จๆˆท่พ“ๅ…ฅๅ‰๏ผŒๅŠ ไธŠHumanๆ็คบใ€‚
    line = input()

 result:
      Human:

ๅ†™ไธ€้ฆ–ๆ˜ฅๅคฉ็š„่ฏ—ๅฅ
ๅ†™ไธ€้ฆ–ๆ˜ฅๅคฉ็š„่ฏ—ๅฅ๏ผš
็ƒŸ้›พๆ‹ข็œ‰ๅคด้ฃŽๅ‘ˆ่ตท๏ผŒ
ๅƒๆŽ‰ๅคงๅœฐ่Œ๏ฟฝ๏ฟฝcies๏ผŒ
ไผžๆ‹งๅผ€่ฟœ็œผ๏ผŒ
ๆ˜ฅๆ—ฅ่ฃกๅ—Ÿ่กŸๅ•ฆใ€‚

่™ฝ็„ถ่ฟ™ๅฅ่ฏ—่ฏญไผ˜้›…๏ผŒ
่ฏญ่จ€ไผ˜็พŽไธ”ๆธ…ๆ–ฐ๏ผŒ
ไฝ†ๅ“ฆ๏ผŒๅฌ่ตทๆฅๅฅฝ่ชๆ˜Žใ€‚

่ฐๆ˜ฏ่ฏด่ฟ™ๅฅ่ฏ๏ผŸ
ๅˆๆœ‰ไป€ไนˆๅฏ“ๆ„๏ผŸ


Human:
ๅฟƒๆƒ…ไธๅฅฝ็š„ๆ—ถๅ€™ๅบ”่ฏฅๅฆ‚ไฝ•่ฐƒๆ•ด
ๅฟƒๆƒ…ไธๅฅฝ็š„ๆ—ถๅ€™ๅบ”่ฏฅๅฆ‚ไฝ•่ฐƒๆ•ด่‡ชๅทฑ็š„ๆ€็ปด๏ผŸ


Human:
ๆ€Žๆ ทๅญฆไน ๆœบๅ™จๅญฆไน 
ๆ€Žๆ ทๅญฆไน ๆœบๅ™จๅญฆไน ๏ผŒ่ฎฉ็”Ÿๆ…ขๆ…ข้ข†็•ฅๆ™บๆ…ง

ๆ‘˜่‡ชใ€Šๆทฑๅบฆ้—ไผ ็ฎ—ๅญใ€‹็ฌฌ1็ซ ๏ผš

ไผ—ๅคš้ซ˜ๅˆ†ๅฑ‚็š„่ฎญ็ปƒ๏ผŒไปฅๅ„่‡ช็š„ๆ–นๅผ๏ผŒๅฏนๆจกๅž‹่ฟ›่กŒไฟฎ้ฅฐ๏ผŒไปŽ่€Œไฝฟๅ…ถๅœจๅบ”็”จไธญๅ‘ๆŒฅๆ›ดๅฅฝ็š„ไฝœ็”จใ€‚่ฟ™ๅฐฑๆ˜ฏ้€ผ็œŸๆจกๅž‹็š„็›ฎ็š„ใ€‚

ๅพˆๆœ‰่ถฃ็š„ๆ˜ฏ๏ผŒไธ่ฎบๆจกๅž‹ๅฆ‚ไฝ•่ขซไฟฎๆ”น๏ผŒๅฎƒ็š„ไธป่ฆ็‰น็‚นไปŽๆœชๅ˜ๅŒ–๏ผŒ้‚ฃๅˆๆ— ๆณ•่งฃ้‡Šไธบไป€ไนˆ๏ผŸ็”ฑไบŽ๏ผŒๆฏไธ€็งไฟฎๆ”น้ƒฝๆถ‰ๅŠๅˆฐๆจกๅž‹็š„ๆŸไธชๅŽŸๅง‹้ƒจๅˆ†๏ผŒ่€Œ่ฟ™ไบ›้ƒจๅˆ†็š„้›†ๅˆไผดๆˆๆจกๅž‹็š„ๅ…จ้ƒจใ€‚ๅœจๆฏไธ€ไธชไฟฎๆ”นไธญ๏ผŒๅชๅฏนๆŸไบ›ๅŸบๆœฌๅ…ƒ็ด ่ฟ›่กŒ่ฐƒๆ•ด๏ผŒ็„ถๅŽไปŽ้›ถๅผ€ๅง‹่ฟ›่กŒๅคšๆฌก่ฟญไปฃ๏ผŒไปŽ่€Œ่ฟ›ไธ€ๆญฅๅ‘็Žฐๆจกๅž‹็š„ๆŸไบ›ๅฑ€้ƒจ็ป“ๆž„็š„ๅˆ้€‚ๆ€งใ€‚ๅˆๅฆ‚ๆžœไธ€ไธชๆจกๅž‹็š„ๆŸไธช็ป“ๆž„ๅœจๅฆไธ€ไธชๆจกๅž‹็š„ๅŒไธ€ๆฃตๆ ‘ไธ‹๏ผŒๅฎƒๅฐ†้šๆ„ๅœฐ้ฃ˜ๅ‡บๆฅ๏ผŒ่ฟ™ๅฐฑๆ˜ฏๅฎƒๆ˜ฏไธ€ไธชๅฏไปฅ่ขซ็†่งฃไธบไธ€ไธช่ˆๅฎš็ป“ๆž„็š„โ€œๅŠๆ™บๆ…ง็š„ๆจกๅž‹โ€ใ€‚

ๅ› ๆญค๏ผŒๆ‹ฅๆœ‰ๆ‰ญๆ‚ช็š„็†่งฃๅŠ›็š„ไบบ๏ผŒ่ƒฝๅคŸ็‹ฌ่‡ชๅœฐ้€š่ฟ‡ๅฏนๆจกๅž‹็š„ๅˆ†ๆž๏ผŒๅฐ†ๅ…ถๅ˜ๆˆไธ€็งๅฏไปฅ่ฟ›่กŒๅทฅไฝœ็š„ไปทๅ€ผ๏ผŒๅŒๆ—ถไป็„ถไฟ็•™ๅ…ถๅ…ˆๅ‰็š„ๅฝขๆ€ใ€‚่€Œไบบไปฌๆฏ•็ซŸๅพˆ้šพๅญฆไผšๅฑ่”ฝ่‡ชๅทฑๅฏนๆŸไธช้—ฎ้ข˜็š„่งฃๅ†ณๆ–นๆณ•๏ผŒไปฅไพฟๆ›ดๅ……ๅˆ†ๅœฐ้‡่ง†ๅ…ถไป–ๆ–น้ข็š„ๅญฆไน ๏ผŒ่€Œ่ฟ™่™ฝ็„ถไผš่ฎฉไบบไปฌๆ˜พๅพ—ไธๅœๅœฐๆ€่€ƒ๏ผŒไฝ†ๆ˜ฏๅช่ฆ็ปˆไบŽ่ƒฝๅคŸๅฐ†ๅ…ถๆŽจๅ‘ๅŽๅฑ‚๏ผŒๅฏไปฅๅœจๅŽๆฅๆ›ด้ซ˜็บงๅˆซ็š„ๆก†ๆžถไธ‹ไบง็”Ÿๆ›ดๅคš็š„ๅˆ›้€ ๅŠ›๏ผŒๅฎž็Žฐๆ›ดๅฟซ็š„ๅญฆไน ่ฟ›ๅฑ•ใ€‚

ๆ€ปไน‹๏ผŒไปŽ็†่งฃๅˆฐๆจกๅž‹็š„ๆ ธๅฟƒ๏ผŒไปŽ้€ป่พ‘็š„ๆขณ็†ๅˆฐๆผ”ๅŒ–็š„ๆœฌ่ดจ๏ผŒ่ฟ™ไบ›้ƒฝๆ˜ฏ้œ€่ฆ้€š่ฟ‡ๆœบๅ™จๅญฆไน ่ฟ™็ง้ข†ๅŸŸๆฅๅธฎๅŠฉๆˆ‘ไปฌ่งฃๅ†ณ็š„้—ฎ้ข˜๏ผŒไนŸ่ฎธไน‹ๆ‰€ไปฅๅœจไบบๅฟƒไธญๆ•ฃๅ‘ๅ‡บไธ€็งๅฅ‡็‰น็š„ๆฐ”ๆฏ๏ผŒไปฟไฝ›ไผผไนŽ่ƒฝๅคŸๆŒฝๆ•‘ๆˆ‘ไปฌ้šพไปฅ่งฃๅ†ณ็š„่ฐœๅ›ขใ€‚


ๅปถ่ฟŸ่ฏ„ไผฐ

ๆ‚จๅœจ่ฎบๆ–‡ไธญๆๅˆฐ็š„ๅปถ่ฟŸๆ•ฐๆฎๅ…ทไฝ“ๆ˜ฏ่ฟ่กŒ้‚ฃไธชๆ–‡ไปถๅพ—ๅˆฐ็š„๏ผš
image

Eval Loss NaN on Llama-2

Hi,

By anychance, have you tried literally run on the llama-2 model?

I tried using default llama parameters for pruning and post-training, resulting in similar wikitext2 score (~19) but much worse score for ptb (~70).

Also, when running post-training with the parameter set of llama, llama-2 loss explodes after ~0.2 epoch. Tried using smaller lr (1e-5) yet eval loss exploded to nan.

It would be of great help if you could provide some insights on both pruning and post-training parameters.

Thanks.

Issue: Missing Generation of `pytorch_model.bin` File During Model Tuning

Thank you for sharing your interesting project!

Recently, when I ran bash ./script/llama_prune.sh, the pruning step worked perfectly fine. However, during the tuning step, although there were no error information, the generated structure only included the following:

  • checkpoints-200
    • model.safetensors
    • optimizer.pt
    • rng_state.pth
    • scheduler.pt
    • trainer_state.json
    • training_args.bin

I noticed that the pytorch_model.bin file was not saved. I haven't modified the code, and I am using PyTorch version 2.1.2+cu121. Could you suggest what the possible reason for this might be?

ConnectionError: Couldn't reach https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))

100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 667/667 [00:53<00:00, 12.35it/s]
{'wikitext2': 20.046345644076645}
/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument trust_remote_code=True.
Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets.
warnings.warn(
HF google storage unreachable. Downloading and preparing it from source
2024-01-14 05:26:19 - WARNING : HF google storage unreachable. Downloading and preparing it from source
Traceback (most recent call last):
File "/home/iotsc01/xinpengq/LLM-Pruner-main/hf_prune.py", line 314, in
main(args)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/hf_prune.py", line 267, in main
ppl = PPLMetric(model, tokenizer, ['wikitext2', 'ptb'], args.max_seq_len, device=args.eval_device)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/LLMPruner/evaluator/ppl.py", line 10, in PPLMetric
_, test_loader = get_loaders(dataset, tokenizer, seq_len=seq_len, batch_size = batch_size)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/LLMPruner/datasets/ppl_dataset.py", line 50, in get_loaders
train_data, test_data = get_ptb(seq_len, tokenizer)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/LLMPruner/datasets/ppl_dataset.py", line 19, in get_ptb
traindata = load_dataset('ptb_text_only', 'penn_treebank', split='train')
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/builder.py", line 1767, in _download_and_prepare
super()._download_and_prepare(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/builder.py", line 1078, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/iotsc01/.cache/huggingface/modules/datasets_modules/datasets/ptb_text_only/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f/ptb_text_only.py", line 131, in _split_generators
data_dir = dl_manager.download_and_extract(my_urls)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 562, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 426, in download
downloaded_path_or_paths = map_nested(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 466, in map_nested
mapped = [
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 467, in
_single_map_nested((function, obj, types, None, True, None))
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 370, in _single_map_nested
return function(data_struct)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 451, in _download
out = cached_path(url_or_filename, download_config=download_config)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 188, in cached_path
output_path = get_from_cache(
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 573, in get_from_cache
raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
ConnectionError: Couldn't reach https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))

Gain using more data

Hi! Thanks for sharing the work again!

I'm wondering how is your test on using more data going?
image

I tried using 300K data for post-training on the pruned LLAMA, and the PPL results are basically the same as using 50K data only. Is it not proper to evaluate using PPL, or maybe the training parameters need to be tuned more carefully? Thanks for any advice or discussion!

Error when using GPU for pruning

Hi when I try to implement the following:

python hf_prune.py --pruning_ratio 0.25 --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --pruner_type taylor --test_after_train --device cuda --eval_device cuda --save_ckpt_log_name lama_prune

I got an error:
Traceback (most recent call last):
File "hf_prune.py", line 296, in
main(args)
File "hf_prune.py", line 129, in main
loss = model(example_prompts, labels=example_prompts).loss
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 689, in forward
outputs = self.model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 532, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2156, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Is it not supported for GPU pruning?

PS., May I ask that how much memory is needed overall?

I encountered the following error message when I assign iterative_steps = 2 during baichuan-7B pruning

Traceback (most recent call last):
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/examples/baichuan.py", line 342, in
main(args)
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/examples/baichuan.py", line 229, in main
pruner.step()
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 186, in step
for group in self.prune_local():
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 245, in prune_local
imp = self.estimate_importance(group, ch_groups=ch_groups, consecutive_groups=consecutive_groups)
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 190, in estimate_importance
return self.importance(group, ch_groups=ch_groups, consecutive_groups=consecutive_groups)
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/honor/yangdong/LLM-Pruner-main/LLMPruner/pruner/hf_baichuan_pruner.py", line 180, in call
local_norm = local_norm[idxs]
IndexError: index 3840 is out of bounds for dimension 0 with size 3840

Use LLM-Pruner for Baichuan model

Hi, I am trying to use LLM-Pruner on Baichuan-13B model (https://github.com/baichuan-inc/Baichuan-13B). It is also llama structured so I thought it should work instantly, but I got some errors... I am still trying to debug, but slowly... Any help or advice would be very appreciated!

Specifically, I ran "CUDA_VISIBLE_DEVICES=0,1 python hf_prune_baichuan.py --base_model models/baichuan-13b-chat --pruning_ratio 0.25 --device cpu --eval_device cuda --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --save_ckpt_log_name baichuan_13b_chat_0.2 --pruner_type taylor --test_after_train --taylor param_first --save_model",

and I got the following output:
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [02:14<00:00, 44.74s/it]
2023-07-17 02:29:09 - INFO : Use taylor pruner...
2023-07-17 02:29:09 - INFO : Pruning Attention Layer = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
2023-07-17 02:29:09 - INFO : Pruning MLP Layer = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
/dfs/data/LLM-Pruner/LLMPruner/torch_pruning/dependency.py:362: UserWarning: Unwrapped parameters detected: ['model.layers.10.input_layernorm.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.34.input_layernorm.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.29.post_attention_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.38.post_attention_layernorm.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.33.post_attention_layernorm.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.35.input_layernorm.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.30.post_attention_layernorm.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.32.input_layernorm.weight', 'model.layers.39.post_attention_layernorm.weight', 'model.layers.34.post_attention_layernorm.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.36.input_layernorm.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.28.input_layernorm.weight', 'model.layers.35.post_attention_layernorm.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.37.input_layernorm.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.32.post_attention_layernorm.weight', 'model.norm.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.36.post_attention_layernorm.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.38.input_layernorm.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.28.post_attention_layernorm.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.33.input_layernorm.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.37.post_attention_layernorm.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.39.input_layernorm.weight', 'model.layers.1.input_layernorm.weight', 'model.layers.8.post_attention_layernorm.weight'].
Torch-Pruning will prune the last non-singleton dimension of a parameter. If you wish to customize this behavior, please provide an unwrapped_parameters argument.
warnings.warn("Unwrapped parameters detected: {}.\n Torch-Pruning will prune the last non-singleton dimension of a parameter. If you wish to customize this behavior, please provide an unwrapped_parameters argument.".format([_param_to_name[p] for p in unwrapped_detected]))
2023-07-17 02:30:02 - INFO : Start Pruning
2023-07-17 02:30:02 - WARNING : Found cached dataset bookcorpus (/dfs/data/data/bookcorpus/bookcorpus/plain_text/1.0.0/eddee3cae1cc263a431aa98207d4d27fd8a73b0a9742f692af0e6c65afa4d75f)
2023-07-17 02:30:45 - INFO : Start Backwarding in iterative steps = 0...
2023-07-17 02:33:56 - INFO : Loss = 3.644896984100342
Traceback (most recent call last):
File "hf_prune_baichuan.py", line 299, in
main(args)
File "hf_prune_baichuan.py", line 136, in main
pruner.step()
File "/dfs/data/LLM-Pruner/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 179, in step
for group in self.prune_local():
File "/dfs/data/LLM-Pruner/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 238, in prune_local
imp = self.estimate_importance(group, ch_groups=ch_groups, consecutive_groups=consecutive_groups)
File "/dfs/data/LLM-Pruner/LLMPruner/torch_pruning/pruner/algorithms/metapruner.py", line 183, in estimate_importance
return self.importance(group, ch_groups=ch_groups, consecutive_groups=consecutive_groups)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/pruner/hf_baichuan_pruner.py", line 306, in call
local_norm = local_norm[idxs]
IndexError: index 10240 is out of bounds for dimension 0 with size 5120

I modified "hf_prune_llama.py" and "LLMPruner/pruner/hf_llama_pruner.py":
1ใ€replacing the model loading part as:
tokenizer = AutoTokenizer.from_pretrained(args.base_model, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(args.base_model, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained(args.base_model)
2ใ€replacing all "q_proj, k_proj, v_proj" with "W_pack"

Do you have any advice on quick fixing? Thank you very much!

Examples on the Huggingface Hub

Apologies if this has been asked before, but do you have pruned models that we can test and run locally? Anything on the huggingface hub?

I'd like to test some of your models directly before investing time in your pruning code. Thanks!

cannot import name 'SiLUActivation' from 'transformers.activations'

python test_speedup.py --model_type pretrain
Traceback (most recent call last):
File "/home/azuryl/project/llm-pruner/LLM-Pruner/test_speedup.py", line 10, in
from transformers.activations import SiLUActivation
ImportError: cannot import name 'SiLUActivation' from 'transformers.activations' (/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/transformers/activations.py)

Question related to the model tuning

Hi,

Great work first!

I am confused with the model tuning part.

According to the code, it seemed that you used the lora method.
This, in my opinion, will destroy the sparsity you have made in the original model after merging the lora weights to the model weights.

could you explain this?

Thanks
Shawn

the new pytorch.bin is bigger than original model issue

When I choose save model, I found some strange thingsใ€‚The new pytorch.bin is bigger than original modelใ€‚I choose Baichuan-7B ,--pruning_ratio 0.5 for test, and add --save_model for save the model after pruning. but the new pytorch.bin is 17GB, and the original model is only 13GB?
Could you please tell me why? Thank you!

Pruning MQA?

How to pruning LLMs with Multi Query Attention?

Adding quantization

If I use the multiple strategies such as GPTQ + LLM-Pruner + LoRA, maybe the compressing ratio of LLM will be greatly improved with an acceptable performance?

Error occurs when pruning LLaMa2-7b

With cmd like:
CUDA_VISIBLE_DEVICES=0 python hf_prune.py --base_model path_to_cached_hf_llama2-7b --pruning_ratio 0.25 --device cpu --eval_device cuda --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --pruner_type taylor --test_after_train --taylor param_first --save_model
It throws an error :"addmm_impl_cpu_" not implemented for 'Half'
image

torch==2.0.0
transformers==4.31.0

RuntimeError of test_speedup.py

Hi! I ran the following command:
python test_speedup.py --model_type pruneLLM --base_model ../llama-7b-hf --ckpt prune_log/llama_prune/pytorch_model.bin

and got the following error:
Warning: module Embedding is treated as a zero-op.
Warning: module LlamaRotaryEmbedding is treated as a zero-op.
Warning: module LlamaMLP is treated as a zero-op.
Warning: module LlamaDecoderLayer is treated as a zero-op.
Warning: module LlamaModel is treated as a zero-op.
Warning: module LlamaForCausalLM is treated as a zero-op.
Traceback (most recent call last):
File "test_speedup.py", line 91, in
main(args)
File "test_speedup.py", line 69, in main
macs, params = get_model_complexity_info(model, (64,), as_strings=True,
File "/opt/conda/lib/python3.8/site-packages/ptflops/flops_counter.py", line 30, in get_model_complexity_info
flops_count, params_count = get_flops_pytorch(model, input_res,
File "/opt/conda/lib/python3.8/site-packages/ptflops/pytorch_engine.py", line 44, in get_flops_pytorch
_ = flops_model(batch)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 689, in forward
outputs = self.model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/dfs/data/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 532, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2156, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Could you please help to have a look?

Calculating Importance of 'param_mix'

Hello. First of all, thank you for sharing great research.

I have a question about calculating the importance of parameters.

In class TaylorImportance in hf_llama_pruner.py, line 274,

  1. Could you please tell me why the importance about mixed-order is calculated as follow:
salience = salience - 0.5 * layer.weight * layer.weight.acc_grad * layer.weight

(not as the sum of 1st and 2nd orders)

  1. Is higher-order term neglected?

OSError: Can't load tokenizer for 'baffo32/decapoda-research-llama-7B-hf'.

bash scripts/llama_prune.sh

[START] - Start Pruning Model
Traceback (most recent call last):
File "/home/iotsc01/xinpengq/LLM-Pruner-main/hf_prune.py", line 314, in
main(args)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/hf_prune.py", line 39, in main
tokenizer = LlamaTokenizer.from_pretrained(args.base_model)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2012, in from_pretrained
raise EnvironmentError(
OSError: Can't load tokenizer for 'baffo32/decapoda-research-llama-7B-hf'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'baffo32/decapoda-research-llama-7B-hf' is the correct path to a directory containing all relevant files for a LlamaTokenizer tokenizer.
[FINISH] - Finish Pruning Model
[START] - Start Tuning
Traceback (most recent call last):
File "/home/iotsc01/xinpengq/LLM-Pruner-main/post_training.py", line 262, in
main(args)
File "/home/iotsc01/xinpengq/LLM-Pruner-main/post_training.py", line 33, in main
pruned_dict = torch.load(args.prune_model, map_location='cpu')
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/torch/serialization.py", line 986, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/torch/serialization.py", line 435, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/torch/serialization.py", line 416, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'prune_log/llama_prune/pytorch_model.bin'
[FINISH] - Finish Prune and Post-Training.
[INFO] - The pruned model is at {prune_log/llama_prune/pytorch_model.bin}, and the recovery weight is at {tune_log/llama_0.2}/
You can use the command:
python generate.py --model_type tune_prune_LLM --ckpt prune_log/llama_prune/pytorch_model.bin --lora_ckpt tune_log/llama_0.2
to use the pruned model

Can not import LlamaConfig

from transformers import LlamaConfig
ImportError: cannot import name 'LlamaConfig' from 'transformers' (/home/aelkordy/.local/lib/python3.8/site-packages/transformers/init.py)
(test) aelkordy@g1lmd1:/vault/aelkordy/NLP_projects/pruning/LLM-Pruner$ python hf_prune.py --pruning_ratio 0.25 --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --pruner_type taylor --test_after_train --device cpu --eval_device cuda --save_ckpt_log_name llama_prune
Traceback (most recent call last):
File "hf_prune.py", line 14, in
from LLMPruner.models.hf_llama.modeling_llama import LlamaForCausalLM, LlamaRMSNorm, LlamaAttention, LlamaMLP
File "/vault/aelkordy/NLP_projects/pruning/LLM-Pruner/LLMPruner/models/hf_llama/modeling_llama.py", line 33, in
from transformers import LlamaConfig
ImportError: cannot import name

Reload the prunered Model failed

i used the default scripts on README to pruner vicuna7b and get the prunered output "pytorch.bin".
but i can not reload the pytorch.bin use the .from_pretrained()

what shou i do to inference ?

i tried both "AutoModelForCausalLM" and "LlamaForCausalLM",none of them worked

# from transformers import (AutoModelForCausalLM, AutoTokenizer, )
from transformers import LlamaTokenizer
from LLMPruner.models.hf_llama.modeling_llama import LlamaForCausalLM
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme โ”‚
โ”‚ rs/modeling_utils.py:442 in load_state_dict                                  โ”‚
โ”‚                                                                              โ”‚
โ”‚    439 โ”‚   โ”‚   โ”‚   )                                                         โ”‚
โ”‚    440 โ”‚   โ”‚   return safe_load_file(checkpoint_file)                        โ”‚
โ”‚    441 โ”‚   try:                                                              โ”‚
โ”‚ โฑ  442 โ”‚   โ”‚   return torch.load(checkpoint_file, map_location="cpu")        โ”‚
โ”‚    443 โ”‚   except Exception as e:                                            โ”‚
โ”‚    444 โ”‚   โ”‚   try:                                                          โ”‚
โ”‚    445 โ”‚   โ”‚   โ”‚   with open(checkpoint_file) as f:                          โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/torch/seri โ”‚
โ”‚ alization.py:809 in load                                                     โ”‚
โ”‚                                                                              โ”‚
โ”‚    806 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   return _load(opened_zipfile, map_location, _w โ”‚
โ”‚    807 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   except RuntimeError as e:                         โ”‚
โ”‚    808 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise pickle.UnpicklingError(UNSAFE_MESSAGE + โ”‚
โ”‚ โฑ  809 โ”‚   โ”‚   โ”‚   โ”‚   return _load(opened_zipfile, map_location, pickle_mod โ”‚
โ”‚    810 โ”‚   โ”‚   if weights_only:                                              โ”‚
โ”‚    811 โ”‚   โ”‚   โ”‚   try:                                                      โ”‚
โ”‚    812 โ”‚   โ”‚   โ”‚   โ”‚   return _legacy_load(opened_file, map_location, _weigh โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/torch/seri โ”‚
โ”‚ alization.py:1172 in _load                                                   โ”‚
โ”‚                                                                              โ”‚
โ”‚   1169 โ”‚                                                                     โ”‚
โ”‚   1170 โ”‚   unpickler = UnpicklerWrapper(data_file, **pickle_load_args)       โ”‚
โ”‚   1171 โ”‚   unpickler.persistent_load = persistent_load                       โ”‚
โ”‚ โฑ 1172 โ”‚   result = unpickler.load()                                         โ”‚
โ”‚   1173 โ”‚                                                                     โ”‚
โ”‚   1174 โ”‚   torch._utils._validate_loaded_sparse_tensors()                    โ”‚
โ”‚   1175                                                                       โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/torch/seri โ”‚
โ”‚ alization.py:1165 in find_class                                              โ”‚
โ”‚                                                                              โ”‚
โ”‚   1162 โ”‚   โ”‚   โ”‚   โ”‚   except KeyError:                                      โ”‚
โ”‚   1163 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   pass                                              โ”‚
โ”‚   1164 โ”‚   โ”‚   โ”‚   mod_name = load_module_mapping.get(mod_name, mod_name)    โ”‚
โ”‚ โฑ 1165 โ”‚   โ”‚   โ”‚   return super().find_class(mod_name, name)                 โ”‚
โ”‚   1166 โ”‚                                                                     โ”‚
โ”‚   1167 โ”‚   # Load the data (which may in turn use `persistent_load` to load  โ”‚
โ”‚   1168 โ”‚   data_file = io.BytesIO(zip_file.get_record(pickle_file))          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
ModuleNotFoundError: No module named 'LLMPruner'

During handling of the above exception, another exception occurred:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme โ”‚
โ”‚ rs/modeling_utils.py:446 in load_state_dict                                  โ”‚
โ”‚                                                                              โ”‚
โ”‚    443 โ”‚   except Exception as e:                                            โ”‚
โ”‚    444 โ”‚   โ”‚   try:                                                          โ”‚
โ”‚    445 โ”‚   โ”‚   โ”‚   with open(checkpoint_file) as f:                          โ”‚
โ”‚ โฑ  446 โ”‚   โ”‚   โ”‚   โ”‚   if f.read(7) == "version":                            โ”‚
โ”‚    447 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise OSError(                                    โ”‚
โ”‚    448 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   "You seem to have cloned a repository without โ”‚
โ”‚    449 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   "git-lfs and run `git lfs install` followed b โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/codecs.py:322 in decode  โ”‚
โ”‚                                                                              โ”‚
โ”‚    319 โ”‚   def decode(self, input, final=False):                             โ”‚
โ”‚    320 โ”‚   โ”‚   # decode input (taking the buffer into account)               โ”‚
โ”‚    321 โ”‚   โ”‚   data = self.buffer + input                                    โ”‚
โ”‚ โฑ  322 โ”‚   โ”‚   (result, consumed) = self._buffer_decode(data, self.errors, f โ”‚
โ”‚    323 โ”‚   โ”‚   # keep undecoded input until the next call                    โ”‚
โ”‚    324 โ”‚   โ”‚   self.buffer = data[consumed:]                                 โ”‚
โ”‚    325 โ”‚   โ”‚   return result                                                 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid
start byte

During handling of the above exception, another exception occurred:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /home/usr-fy/PycharmProjects/API_test/local_test.py:5 in <module>            โ”‚
โ”‚                                                                              โ”‚
โ”‚    2                                                                         โ”‚
โ”‚    3 ini = InitModel()                                                       โ”‚
โ”‚    4 # ini.load_model_tokenizer('/home/usr-fy/Desktop/mymodel_zoo/vicuna-7b- โ”‚
โ”‚ โฑ  5 ini.load_model_tokenizer('/home/usr-fy/llama_prune')                    โ”‚
โ”‚    6                                                                         โ”‚
โ”‚    7 model, tokenizer = ini.get_model()                                      โ”‚
โ”‚    8                                                                         โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/usr-fy/PycharmProjects/API_test/InferenceAPI/InitModel.py:20 in        โ”‚
โ”‚ load_model_tokenizer                                                         โ”‚
โ”‚                                                                              โ”‚
โ”‚   17 โ”‚                                                                       โ”‚
โ”‚   18 โ”‚   def load_model_tokenizer(self, model_path):                         โ”‚
โ”‚   19 โ”‚   โ”‚   self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_ โ”‚
โ”‚ โฑ 20 โ”‚   โ”‚   self.model = AutoModelForCausalLM.from_pretrained(              โ”‚
โ”‚   21 โ”‚   โ”‚   โ”‚   model_path,                                                 โ”‚
โ”‚   22 โ”‚   โ”‚   โ”‚   low_cpu_mem_usage=True,                                     โ”‚
โ”‚   23 โ”‚   โ”‚   โ”‚   torch_dtype=torch.float16,                                  โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme โ”‚
โ”‚ rs/models/auto/auto_factory.py:471 in from_pretrained                        โ”‚
โ”‚                                                                              โ”‚
โ”‚   468 โ”‚   โ”‚   โ”‚   )                                                          โ”‚
โ”‚   469 โ”‚   โ”‚   elif type(config) in cls._model_mapping.keys():                โ”‚
โ”‚   470 โ”‚   โ”‚   โ”‚   model_class = _get_model_class(config, cls._model_mapping) โ”‚
โ”‚ โฑ 471 โ”‚   โ”‚   โ”‚   return model_class.from_pretrained(                        โ”‚
โ”‚   472 โ”‚   โ”‚   โ”‚   โ”‚   pretrained_model_name_or_path, *model_args, config=con โ”‚
โ”‚   473 โ”‚   โ”‚   โ”‚   )                                                          โ”‚
โ”‚   474 โ”‚   โ”‚   raise ValueError(                                              โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme โ”‚
โ”‚ rs/modeling_utils.py:2560 in from_pretrained                                 โ”‚
โ”‚                                                                              โ”‚
โ”‚   2557 โ”‚   โ”‚   if from_pt:                                                   โ”‚
โ”‚   2558 โ”‚   โ”‚   โ”‚   if not is_sharded and state_dict is None:                 โ”‚
โ”‚   2559 โ”‚   โ”‚   โ”‚   โ”‚   # Time to load the checkpoint                         โ”‚
โ”‚ โฑ 2560 โ”‚   โ”‚   โ”‚   โ”‚   state_dict = load_state_dict(resolved_archive_file)   โ”‚
โ”‚   2561 โ”‚   โ”‚   โ”‚                                                             โ”‚
โ”‚   2562 โ”‚   โ”‚   โ”‚   # set dtype to instantiate the model under:               โ”‚
โ”‚   2563 โ”‚   โ”‚   โ”‚   # 1. If torch_dtype is not None, we use that dtype        โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/usr-fy/anaconda3/envs/FastChat/lib/python3.10/site-packages/transforme โ”‚
โ”‚ rs/modeling_utils.py:458 in load_state_dict                                  โ”‚
โ”‚                                                                              โ”‚
โ”‚    455 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   "model. Make sure you have saved the model pr โ”‚
โ”‚    456 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   ) from e                                          โ”‚
โ”‚    457 โ”‚   โ”‚   except (UnicodeDecodeError, ValueError):                      โ”‚
โ”‚ โฑ  458 โ”‚   โ”‚   โ”‚   raise OSError(                                            โ”‚
โ”‚    459 โ”‚   โ”‚   โ”‚   โ”‚   f"Unable to load weights from pytorch checkpoint file โ”‚
โ”‚    460 โ”‚   โ”‚   โ”‚   โ”‚   f"at '{checkpoint_file}'. "                           โ”‚
โ”‚    461 โ”‚   โ”‚   โ”‚   โ”‚   "If you tried to load a PyTorch model from a TF 2.0 c โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
OSError: Unable to load weights from pytorch checkpoint file for 
'/home/usr-fy/llama_prune/pytorch_model.bin' at 
'/home/usr-fy/llama_prune/pytorch_model.bin'. If you tried to load a PyTorch 
model from a TF 2.0 checkpoint, please set from_tf=True.

401 Client Error: Unauthorized for url: https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer_config.json

bash scripts/llama_prune.sh
[START] - Start Pruning Model
Traceback (most recent call last):
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
response.raise_for_status()
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/transformers/utils/hub.py", line 389, in cached_file
resolved_file = hf_hub_download(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1374, in hf_hub_download
raise head_call_error
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
metadata = get_hf_file_metadata(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
r = _request_wrapper(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
response = _request_wrapper(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 426, in _request_wrapper
hf_raise_for_status(response)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65813e67-47dd0288795e66057f2cb0d0;16159d73-3032-4e61-8a3f-18bc009609a8)

Repository Not Found for url: https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer_config.json.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/azuryl/project/LLM-Pruner/hf_prune.py", line 314, in
main(args)
File "/home/azuryl/project/LLM-Pruner/hf_prune.py", line 39, in main
tokenizer = LlamaTokenizer.from_pretrained(args.base_model)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1951, in from_pretrained
resolved_config_file = cached_file(
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/transformers/utils/hub.py", line 410, in cached_file
raise EnvironmentError(
OSError: decapoda-research/llama-7b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>
[FINISH] - Finish Pruning Model
[START] - Start Tuning
Traceback (most recent call last):
File "/home/azuryl/project/LLM-Pruner/post_training.py", line 262, in
main(args)
File "/home/azuryl/project/LLM-Pruner/post_training.py", line 33, in main
pruned_dict = torch.load(args.prune_model, map_location='cpu')
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/torch/serialization.py", line 986, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/torch/serialization.py", line 435, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/azuryl/anaconda3/envs/llamaprune/lib/python3.10/site-packages/torch/serialization.py", line 416, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'prune_log/llama_prune/pytorch_model.bin'
[FINISH] - Finish Prune and Post-Training.
[INFO] - The pruned model is at {prune_log/llama_prune/pytorch_model.bin}, and the recovery weight is at {tune_log/llama_0.2}/
You can use the command:
python generate.py --model_type tune_prune_LLM --ckpt prune_log/llama_prune/pytorch_model.bin --lora_ckpt tune_log/llama_0.2
to use the pruned model

Zero-shot Evaluation

Hi, thanks for your great work!
I wonder how to evaluate the pruned and fintuned model in lm-evaluation-harness?

Evaluation metric (acc vs. acc_norm) for lm-evaluation-harness tasks

Hi, thank you very much for generously open-sourcing your excellent work.

I've run the evaluation code you kindly shared and obtained the below results. I have a question regarding the metric for each task. Could you please clarify which one between acc or acc_norm [ref] was used for the PIQA, HellaSwag, ARC-e, ARC-c, and OBQA tasks? Thanks for taking the time to check this inquiry.

20%-pruned -> post-trained LLaMA from scripts/llama_prune.sh

Task BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
LLM-Pruner paper 66.79 77.58 68.48 64.96 64.06 37.88 39.00
Reproduced (LLM-Pruner code): acc 65.20 77.15 53.19 63.69 64.77 36.26 28.80
Reproduced (LLM-Pruner): acc_norm n/a 76.93 68.63 n/a 52.27 36.95 40.40
Task BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
Using different ver: acc 66.24 77.58 53.54 66.14 70.54 37.54 31.40
Using different ver: acc_norm n/a 78.13 71.39 n/a 65.95 39.33 41.20

Original LLaMA-7B

Task BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
LLaMA paper 76.5 79.8 76.1 70.1 72.8 47.6 57.2
LLM-Pruner paper 73.18 78.35 72.99 67.01 67.45 41.38 42.40
Reproduced (LLM-Pruner code): acc 73.06 78.35 56.42 67.01 67.34 38.14 28.20
Reproduced (LLM-Pruner): acc_norm n/a 77.37 72.99 n/a 52.48 41.38 42.40
Task BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
Using different ver: acc 75.05 78.67 56.93 69.93 75.29 41.89 34.60
Using different ver: acc_norm n/a 79.16 76.22 n/a 72.85 44.71 44.40

Note

  • Underlined: reported metrics in the LLM-Pruner paper
  • The scores reported in the LLM-Pruner paper are fully reproducible using this repo, and the lm-evaluation-harness version affects the scores because of recent updates.
  • [Table 1 of LLM-Pruner Paper] The evaluation is performed under different prompts, which is lower than the official results in the LLaMA paper.

param_first and param_mix result the same ppl

I simply use the following commands to run:
python hf_prune.py --pruning_ratio 0.62785 --block_wise --block_mlp_layer_start 0 --block_mlp_layer_end 32 --block_attention_layer_start 32 --block_attention_layer_end 32 --pruner_type taylor --base_model /mnt/petrelfs/xxx/llama2-7b --device cpu --eval_device cuda --taylor param_first --save_ckpt_log_name llama_prune --save_model --num_examples 128
python hf_prune.py --pruning_ratio 0.62785 --block_wise --block_mlp_layer_start 0 --block_mlp_layer_end 32 --block_attention_layer_start 32 --block_attention_layer_end 32 --pruner_type taylor --base_model /mnt/petrelfs/xxx/llama2-7b --device cpu --eval_device cuda --taylor param_mix --save_ckpt_log_name llama_prune --save_model --num_examples 128
But they result in the same ppl.

Reproducing paper results

I run LLM pruner with the command specified in the ReadMe to prune LLama-7B

python hf_prune.py --pruning_ratio 0.25 \
      --block_wise \
      --block_mlp_layer_start 4 --block_mlp_layer_end 30 \
      --block_attention_layer_start 4 --block_attention_layer_end 30 \
      --pruner_type taylor \
      --test_after_train \
      --device cpu  --eval_device cuda \
      --save_ckpt_log_name llama_prune

I get the following results

#Param before: 6738415616, #Param after: 5422977024, Ratio = 80.4785%
PPL after pruning: {'wikitext2': 19.96819234893607, 'ptb': 80.37625124290746}

Perplexities reported in Table 1 in the paper are WikiText2 - 19.09 and PTB - 34.21. Is there any reason for the difference in thses perplexities especially PTB? Thanks

recover training

when I start recover training Baichuan-7B, I meet the bug.

Exception has occurred: RuntimeError
Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/LLMPruner/peft/peft_model.py", line 664, in forward
return self.base_model(
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/LLMPruner/models/hf_baichuan/baichuan7B/modeling_baichuan_7B.py", line 610, in forward
outputs = self.model(
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/LLMPruner/models/hf_baichuan/baichuan7B/modeling_baichuan_7B.py", line 452, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/opt/miniconda3/envs/flash/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper__index_select)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/post_training.py", line 199, in main
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/home/jovyan/honor/xcg/LLM-Pruner-main/post_training.py", line 244, in
main(args)

How can I fix it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.