mit-han-lab / offsite-tuning Goto Github PK
View Code? Open in Web Editor NEWOffsite-Tuning: Transfer Learning without Full Model
Home Page: https://arxiv.org/abs/2302.04870
License: MIT License
Offsite-Tuning: Transfer Learning without Full Model
Home Page: https://arxiv.org/abs/2302.04870
License: MIT License
Great work with an elegant but effective idea! Thanks for sharing. However, I have a minor suggestion.
It is well-known that in the LLM finetuning paradigm, adapter-tuning [1] — done by inserting lightweight modules between transformer layers and only updating such modules upon downstream tasks — is a popular approach. In this work, the “adapters” the authors refer to are not such modules, but rather a selection of layers from the pertained model. The authors clearly know this term overlap, as there are even combo experiments on offsite-tuning + adapter-tuning (Table 5).
Given both approaches are within the realm of parameter-efficient finetuning. I’d encourage the authors to find an alternative term for your “adapter” to avoid potential confusion and ambiguities.
A couple of preliminary examples I can come up with are “bridging/pluggable/relay/alignment/shared + layers/units/components.” Hope it helps!
[1] Houlsby et al., Parameter-efficient transfer learning for NLP. ICML 2019.
Hi,
I noticed that you trained the NLP emulator with the first 30 chunks of Pile dataset. I wonder how large are the 30 chunks? Or in other words, how many chunks does Pile have? The original Pile dataset is over 800G, it is too big for the labs...
Besides, did you try to use smaller datasets, such as Wikitext? What is the performance of using these smaller datasets?
Thanks
Hi!
Thanks for releasing the code. I have one question about the evaluation. It seems in the current version of the code, you only evaluate perplexity? For example, I think Table 1 of the paper, its metric should be Accuracy for most QA tasks? It seems current eval_harness.py only considers ppl.
offsite_tuning/run_image_classification.py
def to_teacher(model, args):
l = args.student_l_pad
print(type(model.model))
if isinstance(model, OPTForCausalLM):
r = len(model.model.decoder.layers) - args.student_r_pad
model.model.decoder.layers = model.model.decoder.layers[
:l] + model.teacher + model.model.decoder.layers[r:]
elif isinstance(model, GPT2LMHeadModel):
r = len(model.transformer.h) - args.student_r_pad
model.transformer.h = model.transformer.h[:l] +
model.teacher + model.transformer.h[r:]
elif isinstance(model, BloomForCausalLM):
r = len(model.transformer.h) - args.student_r_pad
model.transformer.h = model.transformer.h[:l] +
model.teacher + model.transformer.h[r:]
elif isinstance(model, ViTForImageClassification):
r = len(model.vit.encoder.layer) - args.student_r_pad
model.vit.encoder.layer = model.vit.encoder.layer[:l] +
model.teacher + model.vit.encoder.layer[r:]
elif isinstance(model, CLIPViTForImageClassification):
r = len(model.vit.encoder.layers) - args.student_r_pad
model.vit.encoder.layers = model.vit.encoder.layers[:l] +
model.teacher + model.vit.encoder.layers[r:]
elif isinstance(model, EVAViTForImageClassification):
r = len(model.blocks) - args.student_r_pad
model.blocks = model.blocks[:l] +
model.teacher + model.blocks[r:]
else:
raise NotImplementedError
<class 'torch.nn.parallel.distributed.DistributedDataParallel'>
Traceback (most recent call last):
File "offsite_tuning/run_image_classification.py", line 564, in
main()
File "offsite_tuning/run_image_classification.py", line 413, in main
model = to_teacher(model, args)
File "/root/paddlejob/workspace/env_run/offsite-tuning-main/offsite_tuning/utils.py", line 714, in to_teacher
raise NotImplementedError
NotImplementedError
Hi there,
I am currently trying to reproduce the experiments in the paper using your code. However, when running the quantization.sh
script, I am encountering the following errors:
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.self_attn_layer_norm.weight with shape torch.Siz
e([2048]) and dtype torch.float32
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.self_attn_layer_norm.bias with shape torch.Size(
[2048]) and dtype torch.float32
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.fc1.weight with shape torch.Size([8192, 2048]) a
nd dtype torch.float32
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.fc1.bias with shape torch.Size([8192]) and dtype
torch.float32
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.fc2.weight with shape torch.Size([2048, 8192]) a
nd dtype torch.float32
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.fc2.bias with shape torch.Size([2048]) and dtype
torch.float32
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.final_layer_norm.weight with shape torch.Size([2
048]) and dtype torch.float32
03/06/2023 21:58:50 - INFO - __main__ - Trainable parameter: model.decoder.layers.23.final_layer_norm.bias with shape torch.Size([204
8]) and dtype torch.float32
03/06/2023 21:58:55 - INFO - __main__ - ***** Running training *****
03/06/2023 21:58:55 - INFO - __main__ - Num examples = 4700
03/06/2023 21:58:55 - INFO - __main__ - Num Epochs = 10
03/06/2023 21:58:55 - INFO - __main__ - Instantaneous batch size per device = 4
03/06/2023 21:58:55 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 20
03/06/2023 21:58:55 - INFO - __main__ - Gradient Accumulation steps = 1
03/06/2023 21:58:55 - INFO - __main__ - Total optimization steps = 2350
Traceback (most recent call last):
File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 498, in <module>
main()
File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 353, in main
_, teacher_zero_shot_perplexity = eval_epoch()
File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 336, in eval_epoch
outputs = model(**batch)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forw
ard
return module_to_run(*inputs[0], **kwargs[0])
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 932, in forwar
d
outputs = self.model.decoder(
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 697, in forwar
d
layer_outputs = decoder_layer(
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 323, in forwar
d
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: expected scalar type Half but found Float
Traceback (most recent call last):
File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 498, in <module>
main()
File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 353, in main
_, teacher_zero_shot_perplexity = eval_epoch()
File "/nfs-share/wz341/offsite-ori/offsite-tuning/offsite_tuning/run_clm.py", line 336, in eval_epoch
outputs = model(**batch)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forw
ard
return module_to_run(*inputs[0], **kwargs[0])
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 932, in forward
outputs = self.model.decoder(
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 697, in forwar
d
layer_outputs = decoder_layer(
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 323, in forwar
d
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/home/wz341/anaconda3/envs/offsite/lib/python3.10/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: expected scalar type Half but found Float
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4087765 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4087766 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4087767 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 4087768) of binary: /home/wz341/anaconda
3/envs/offsite/bin/python
I am unsure what is causing this issue and would appreciate any assistance in resolving it. Can you please advise on what steps I can take to fix this error and successfully run the quantization.sh
script?
Thank you for your help!
In data.py, process_text2text_datasets
‘’‘python
def process_text2text_datasets(raw_datasets, args, tokenizer, accelerator):
task = task_dict[args.dataset_name]
column_names = raw_datasets["train"].column_names
def tokenize_function(examples):
context = task.get_context(examples)
target = task.get_target(examples)
context = tokenizer(context)
target = tokenizer(target)
# if context is ending with special token, remove it
if len(context['input_ids'][0]) > 0 and context['input_ids'][0][-1] in tokenizer.all_special_ids:
print('1')
context['input_ids'] = [i[:-1] for i in context['input_ids']]
context['attention_mask'] = [a[:-1]
for a in context['attention_mask']]
# if target is starting with special token, remove it
if len(target['input_ids'][0]) > 0 and target['input_ids'][0][0] in tokenizer.all_special_ids:
print('2')
target['input_ids'] = [i[1:] for i in target['input_ids']]
target['attention_mask'] = [a[1:]
for a in target['attention_mask']]
'''
Do we need to add an outer loop for the special token removing codes, since "example" is a batch of samples?
I use the offsite-tuning code to run scripts/figure4/layerdrop.sh, here is a mistake:
RuntimeError: expected scalar type Half but found Float
then, I modify the code: offsite_tuning/utils.py, line 659:
change the code:model.adapter = layers[:l] + layers[r:]
to: model.adapter = layers[:l].half() + layers[r:].half()
it run ok,but,Another error has occurred:
Epoch 0 - Step 19 - LR: 1.90e-09 - LM loss: nan - KD loss: 0.0000: 1%
the LM loss is nan.
the system is ubuntu22.04, gpu is nvidia T4, 16G.
model is: facebook/opt-1.3b
datasets: wikitext-2-raw-v1
train_module: adapter
Can you help me find the cause of the problem?
thanks.
Hi there,
I noticed that some of the files in this repository are quite large and it's causing issues with downloading them. I wanted to suggest that you consider uploading these files to a cloud drive like Google Drive or Dropbox so that users can download them more easily.
Thanks!
Downloading emulators/CLIP-ViT-H-14-laion2B-s32B-b79K/16_4_4/all_results.json (220 B)
Error downloading object: emulators/CLIP-ViT-H-14-laion2B-s32B-b79K/16_4_4/all_results.json (7b025d0): Smudge error: Error downloading emulators/CLIP-ViT-H-14-laion2B-s32B-b79K/16_4_4/all_results.json (7b025d0ce3e61c7118f81f02585ccfe75b281fffcde0847a9ef3680461daaaf3): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
Errors logged to /root/kyzhang/studio/offsite-tuning/.git/lfs/logs/20230329T110502.481473204.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: emulators/CLIP-ViT-H-14-laion2B-s32B-b79K/16_4_4/all_results.json: smudge filter lfs failed
It seems all the eval for LLMs are done using 1 GPUs can you suggest ways to run distributed eval?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.