mit-han-lab / offsite-tuning Goto Github PK

View Code? Open in Web Editor NEW

362.0 362.0 36.0 2.03 MB

Offsite-Tuning: Transfer Learning without Full Model

Home Page: https://arxiv.org/abs/2302.04870

License: MIT License

Shell 29.20% Python 70.80%

deep-learning transfer-learning

offsite-tuning's People

Contributors

Stargazers

Watchers

offsite-tuning's Issues

The authors should consider changing the term “adapter” to avoid potential confusion with adapter-tuning.

Great work with an elegant but effective idea! Thanks for sharing. However, I have a minor suggestion.

It is well-known that in the LLM finetuning paradigm, adapter-tuning [1] — done by inserting lightweight modules between transformer layers and only updating such modules upon downstream tasks — is a popular approach. In this work, the “adapters” the authors refer to are not such modules, but rather a selection of layers from the pertained model. The authors clearly know this term overlap, as there are even combo experiments on offsite-tuning + adapter-tuning (Table 5).

Given both approaches are within the realm of parameter-efficient finetuning. I’d encourage the authors to find an alternative term for your “adapter” to avoid potential confusion and ambiguities.

A couple of preliminary examples I can come up with are “bridging/pluggable/relay/alignment/shared + layers/units/components.” Hope it helps!

[1] Houlsby et al., Parameter-efficient transfer learning for NLP. ICML 2019.

Usage of Pile dataset to train the emulator

Hi,

I noticed that you trained the NLP emulator with the first 30 chunks of Pile dataset. I wonder how large are the 30 chunks? Or in other words, how many chunks does Pile have? The original Pile dataset is over 800G, it is too big for the labs...

Besides, did you try to use smaller datasets, such as Wikitext? What is the performance of using these smaller datasets?

Thanks

Evaluation metrics is only perplexity?

Hi!

Thanks for releasing the code. I have one question about the evaluation. It seems in the current version of the code, you only evaluate perplexity? For example, I think Table 1 of the paper, its metric should be Accuracy for most QA tasks? It seems current eval_harness.py only considers ppl.

NotImplementedError

offsite_tuning/run_image_classification.py

def to_teacher(model, args):
l = args.student_l_pad
print(type(model.model))
if isinstance(model, OPTForCausalLM):
r = len(model.model.decoder.layers) - args.student_r_pad
model.model.decoder.layers = model.model.decoder.layers[
:l] + model.teacher + model.model.decoder.layers[r:]
elif isinstance(model, GPT2LMHeadModel):
r = len(model.transformer.h) - args.student_r_pad
model.transformer.h = model.transformer.h[:l] +
model.teacher + model.transformer.h[r:]
elif isinstance(model, BloomForCausalLM):
r = len(model.transformer.h) - args.student_r_pad
model.transformer.h = model.transformer.h[:l] +
model.teacher + model.transformer.h[r:]
elif isinstance(model, ViTForImageClassification):
r = len(model.vit.encoder.layer) - args.student_r_pad
model.vit.encoder.layer = model.vit.encoder.layer[:l] +
model.teacher + model.vit.encoder.layer[r:]
elif isinstance(model, CLIPViTForImageClassification):
r = len(model.vit.encoder.layers) - args.student_r_pad
model.vit.encoder.layers = model.vit.encoder.layers[:l] +
model.teacher + model.vit.encoder.layers[r:]
elif isinstance(model, EVAViTForImageClassification):
r = len(model.blocks) - args.student_r_pad
model.blocks = model.blocks[:l] +
model.teacher + model.blocks[r:]
else:
raise NotImplementedError

<class 'torch.nn.parallel.distributed.DistributedDataParallel'>
Traceback (most recent call last):
File "offsite_tuning/run_image_classification.py", line 564, in
main()
File "offsite_tuning/run_image_classification.py", line 413, in main
model = to_teacher(model, args)
File "/root/paddlejob/workspace/env_run/offsite-tuning-main/offsite_tuning/utils.py", line 714, in to_teacher
raise NotImplementedError
NotImplementedError

RuntimeError: expected scalar type Half but found Float

Hi there,

I am currently trying to reproduce the experiments in the paper using your code. However, when running the quantization.sh script, I am encountering the following errors:

03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.self_attn_layer_norm.weight with shape torch.Siz
e([2048]) and dtype torch.float32                                                                                                    
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.self_attn_layer_norm.bias with shape torch.Size(
[2048]) and dtype torch.float32                                                                                                      
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.fc1.weight with shape torch.Size([8192, 2048]) a
nd dtype torch.float32                                                                                                               
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.fc1.bias with shape torch.Size([8192]) and dtype
 torch.float32                                                                                                                       
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.fc2.weight with shape torch.Size([2048, 8192]) a
nd dtype torch.float32                                                                                                               
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.fc2.bias with shape torch.Size([2048]) and dtype
 torch.float32                                                                                                                       
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.final_layer_norm.weight with shape torch.Size([2
048]) and dtype torch.float32                                                                                                        
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.final_layer_norm.bias with shape torch.Size([204
8]) and dtype torch.float32                                                                                                          
03/06/2023 21:58:55 - INFO - __main__ - ***** Running training *****                                                                 
03/06/2023 21:58:55 - INFO - __main__ -   Num examples = 4700                                                                        
03/06/2023 21:58:55 - INFO - __main__ -   Num Epochs = 10                                                                            
03/06/2023 21:58:55 - INFO - __main__ -   Instantaneous batch size per device = 4                                                    
03/06/2023 21:58:55 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 20                      
03/06/2023 21:58:55 - INFO - __main__ -   Gradient Accumulation steps = 1                                                            
03/06/2023 21:58:55 - INFO - __main__ -   Total optimization steps = 2350                                                            
Traceback (most recent call last):
  File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 498, in <module>
    main()
  File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 353, in main 
_, teacher_zero_shot_perplexity = eval_epoch()
  File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 336, in eval_epoch
    outputs = model(**batch)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forw
ard
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 932, in forwar
d
    outputs = self.model.decoder( 
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 697, in forwar
d
    layer_outputs = decoder_layer(
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 323, in forwar
d
    hidden_states = self.self_attn_layer_norm(hidden_states)
	File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 189, in forward
    return F.layer_norm(
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: expected scalar type Half but found Float
Traceback (most recent call last):
  File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 498, in <module>
    main()
  File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 353, in main
    _, teacher_zero_shot_perplexity = eval_epoch()
  File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 336, in eval_epoch
    outputs = model(**batch)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forw
ard
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 932, in forward
    outputs = self.model.decoder( 
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 697, in forwar
d
    layer_outputs = decoder_layer(
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 323, in forwar
d
    hidden_states = self.self_attn_layer_norm(hidden_states)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 189, in forward
    return F.layer_norm(
  File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: expected scalar type Half but found Float
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4087765 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4087766 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4087767 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 4087768) of binary: /home/wz341/anaconda
3/envs/offsite/bin/python

I am unsure what is causing this issue and would appreciate any assistance in resolving it. Can you please advise on what steps I can take to fix this error and successfully run the quantization.sh script?

Thank you for your help!

Need to add a Loop here?

In data.py, process_text2text_datasets

‘’‘python
def process_text2text_datasets(raw_datasets, args, tokenizer, accelerator):
task = task_dict[args.dataset_name]

column_names = raw_datasets["train"].column_names

def tokenize_function(examples):
    context = task.get_context(examples)
    target = task.get_target(examples)

    context = tokenizer(context)
    target = tokenizer(target)

    
    # if context is ending with special token, remove it
    if len(context['input_ids'][0]) > 0 and context['input_ids'][0][-1] in tokenizer.all_special_ids:
        print('1')
        context['input_ids'] = [i[:-1] for i in context['input_ids']]
        context['attention_mask'] = [a[:-1]
                                     for a in context['attention_mask']]

    # if target is starting with special token, remove it
    if len(target['input_ids'][0]) > 0 and target['input_ids'][0][0] in tokenizer.all_special_ids:
        print('2')
        target['input_ids'] = [i[1:] for i in target['input_ids']]
        target['attention_mask'] = [a[1:]
                                    for a in target['attention_mask']]

'''

Do we need to add an outer loop for the special token removing codes, since "example" is a batch of samples?

LM loss is nan

I use the offsite-tuning code to run scripts/figure4/layerdrop.sh, here is a mistake：
RuntimeError: expected scalar type Half but found Float

then， I modify the code: offsite_tuning/utils.py, line 659:
change the code：model.adapter = layers[:l] + layers[r:]
to: model.adapter = layers[:l].half() + layers[r:].half()

it run ok，but，Another error has occurred：
Epoch 0 - Step 19 - LR: 1.90e-09 - LM loss: nan - KD loss: 0.0000: 1%
the LM loss is nan.

the system is ubuntu22.04, gpu is nvidia T4, 16G.
model is: facebook/opt-1.3b
datasets: wikitext-2-raw-v1
train_module: adapter

Can you help me find the cause of the problem？
thanks.

The choose_layers_by_changes function is not implemented in the utils.py file.

There should be a problem with the import choose_layers_by_changes function in run_clm.py.

Suggestion to upload large files to cloud drive

Hi there,

I noticed that some of the files in this repository are quite large and it's causing issues with downloading them. I wanted to suggest that you consider uploading these files to a cloud drive like Google Drive or Dropbox so that users can download them more easily.

Thanks!

Downloading emulators/CLIP-ViT-H-14-laion2B-s32B-b79K/16_4_4/all_results.json (220 B)
Error downloading object: emulators/CLIP-ViT-H-14-laion2B-s32B-b79K/16_4_4/all_results.json (7b025d0): Smudge error: Error downloading emulators/CLIP-ViT-H-14-laion2B-s32B-b79K/16_4_4/all_results.json (7b025d0ce3e61c7118f81f02585ccfe75b281fffcde0847a9ef3680461daaaf3): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /root/kyzhang/studio/offsite-tuning/.git/lfs/logs/20230329T110502.481473204.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: emulators/CLIP-ViT-H-14-laion2B-s32B-b79K/16_4_4/all_results.json: smudge filter lfs failed

How to run distributed evaluation for big models used in this paper?

It seems all the eval for LLMs are done using 1 GPUs can you suggest ways to run distributed eval?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.