Comments (34)
@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175
sorry, it's incompatible for me, my torch version is 2.1.2+cu118
@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit. could you try install ds by
git clone https://github.com/inkcherry/DeepSpeed.git
pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install
Do you have any ideas else? need I offer you my train codes? but it's not allowed by my company
- https://github.com/inkcherry/DeepSpeed/tree/for_5175
- https://github.com/microsoft/DeepSpeed/tree/d5a7c1e0b494fbd0958bf8274bde0bacb2c16854
Hi, @janelu9
I'm not sure why one of these two versions runs smoothly while the other encounters the "con't import torch._six" issue. I think this shouldn't happen if you've correctly installed both versions (or maybe I missed some installation steps causing this problem? @loadams ).
I think this commit didn't make significant changes to the default computation path. If it affects performance, the only possible change might be switching from grad to None to grad.zero(). Both of these operations shouldn't significantly impact time, but the only difference could potentially affect memory usage and consequently performance.
Could you please try making some modifications to your installation location? In the file
deepspeed/runtime/bf16_optimizer.py
, under this function:
def clear_lp_grads(self):
for group in self.bf16_groups:
for param in group:
if param.grad is not None:
# Using zero_() fixed memory address for graph replay
param.grad.zero_()
change it to
def clear_lp_grads():
for group in self.bf16_groups:
for param in group:
param.grad = None
from deepspeed.
Thanks @inkcherry for the debug work on this, please tag me here when you have a PR with a fix, and we can work on getting that tested and merged!
I just send a fix as we disscussed, hope it's helpful for you @janelu9 , and thanks for the help of @loadams !
from deepspeed.
Thanks for your help, @inkcherry
from deepspeed.
Thanks @inkcherry for the debug work on this, please tag me here when you have a PR with a fix, and we can work on getting that tested and merged!
I just send a fix as we disscussed, hope it's helpful for you @janelu9 , and thanks for the help of @loadams !
@inkcherry, this is greatly appreciated. Thanks!
from deepspeed.
Hi @janelu9 - could you let us know what baseline DeepSpeed version you are comparing with? And that this is the same "issue" as #4881 ?
from deepspeed.
Hi @janelu9 - could you let us know what baseline DeepSpeed version you are comparing with? And that this is the same "issue" as #4881 ?
this version is faster https://github.com/microsoft/DeepSpeed/files/13825076/deepspeed-0.12.4.zip
from deepspeed.
Thanks, so something between 0.12.4 and 0.12.5, I'll git bisect between those and see if I can tell, can you share your torch version as well so I'm running the same tests? As well as if you have a small repro case or one from DeepSpeed Examples?
from deepspeed.
Thanks, so something between 0.12.4 and 0.12.5, I'll git bisect between those and see if I can tell, can you share your torch version as well so I'm running the same tests? As well as if you have a small repro case or one from DeepSpeed Examples?
My torch version==2.1.2+cu118, I don't have a repro published, I test them with Llama-7B of 8 stages in pipeline engine on one node of 8 A100-40G.
from deepspeed.
@janelu9 - I'll have to test a repro with V100s probably and a smaller model, and it is possible that it won't repro there in the same way.
If it is easy for you to run your test, could you try with pip install git+https://github.com/microsoft/DeepSpeed.git@a7900bcc3d2f7789cc734aa28a11d2f3b3d8b04f
to help bisect the search space?
from deepspeed.
@janelu9 - I'll have to test a repro with V100s probably and a smaller model, and it is possible that it won't repro there in the same way.
If it is easy for you to run your test, could you try with
pip install git+https://github.com/microsoft/DeepSpeed.git@a7900bcc3d2f7789cc734aa28a11d2f3b3d8b04f
to help bisect the search space?
Well, this version is the faster one whose speed was 10.3 samples/sec while the latest one was 9.9 samples/sec. Additionally, I quoted flash-attn in my model codes.
from deepspeed.
The one installed from the git link was also faster? If so, could you also try this one?
pip install git+https://github.com/microsoft/DeepSpeed.git@b83b1c2e1c4dc4c91c4ad78773dc2232ca9f7070
I suspect the issue may come from transformers since we updated our transformers requirement in the timeframe of 0.12.4 to 0.12.5, but if you're able to test with the above version and let us know if it appears to show the performance degradation or not, that would help, thanks!
from deepspeed.
The one installed from the git link was also faster? If so, could you also try this one?
pip install git+https://github.com/microsoft/DeepSpeed.git@b83b1c2e1c4dc4c91c4ad78773dc2232ca9f7070
I suspect the issue may come from transformers since we updated our transformers requirement in the timeframe of 0.12.4 to 0.12.5, but if you're able to test with the above version and let us know if it appears to show the performance degradation or not, that would help, thanks!
this version is fastest up to now , its avg speed is about 10.4 samples/sec
transformers==4.38.1
from deepspeed.
Thanks @janelu9 -
If you're able, lets try this one then?
pip install git+https://github.com/microsoft/DeepSpeed.git@449e454f83bb6a14b0de359660d4b206d5c3feed
This only contains a few additional fixes but I want to make sure they don't cause an impact, the remaining PRs do change some user args behavior so if this still does not reproduce the performance degradation we can look into that.
Thanks!
from deepspeed.
Thanks @janelu9 -
If you're able, lets try this one then?
pip install git+https://github.com/microsoft/DeepSpeed.git@449e454f83bb6a14b0de359660d4b206d5c3feed
This only contains a few additional fixes but I want to make sure they don't cause an impact, the remaining PRs do change some user args behavior so if this still does not reproduce the performance degradation we can look into that.
Thanks!
this version is about 10.25 samples/sec,faster than the latest
from deepspeed.
Thanks, can you test this as well?
pip install git+https://github.com/microsoft/DeepSpeed.git@65b7727758a4c0ee08597a88ab4f051abcfc2a8a
That should only contain two additional PRs that refactor user arg parsing, so should not impact anything, but would be good to run the test as well. Thanks!
from deepspeed.
Thanks, can you test this as well?
pip install git+https://github.com/microsoft/DeepSpeed.git@65b7727758a4c0ee08597a88ab4f051abcfc2a8a
That should only contain two additional PRs that refactor user arg parsing, so should not impact anything, but would be good to run the test as well. Thanks!
this version hadn't slowdown yet
from deepspeed.
Thanks, that should have all the changes from v0.12.5, can you test with this one as well to confirm, since this should be after we update the tag:
pip install git+https://github.com/microsoft/DeepSpeed.git@4d866bd55a6b2b924987603b599c1f8f35911c4b
from deepspeed.
Thanks, that should have all the changes from v0.12.5, can you test with this one as well to confirm, since this should be after we update the tag:
pip install git+https://github.com/microsoft/DeepSpeed.git@4d866bd55a6b2b924987603b599c1f8f35911c4b
What's puzzling is that all the speeds of 0.12.4<deepspeed<=0.12.5 is now the same, maybe it also be related to torch, but the versions>=0.12.6 are still slower .
from deepspeed.
Thanks, that should have all the changes from v0.12.5, can you test with this one as well to confirm, since this should be after we update the tag:
pip install git+https://github.com/microsoft/DeepSpeed.git@4d866bd55a6b2b924987603b599c1f8f35911c4b
This version is faster.
@loadams Are you tracing the problem using the idea of binary search?
from deepspeed.
@janelu9 that's correct, but we've now completed the entire search space between 0.12.4 and 0.12.5 without seeing a significant slowdown, correct?
from deepspeed.
@janelu9 that's correct, but we've now completed the entire search space between 0.12.4 and 0.12.5 without seeing a significant slowdown, correct?
Yes,no one has a significant slowdown
from deepspeed.
@janelu9 that's correct, but we've now completed the entire search space between 0.12.4 and 0.12.5 without seeing a significant slowdown, correct?
@loadams please give me all the tags between 0.12.5 and 0.12.6 order by commit times. I'd like to test them to find where the problem is.
from deepspeed.
@janelu9 - here is that list of commits, v0.12.5...v0.12.6
It looks to only be 13 commits, so should be fairly easy to binary search that space. Please let me know if you need anything else.
from deepspeed.
@janelu9 - here is that list of commits, v0.12.5...v0.12.6
It looks to only be 13 commits, so should be fairly easy to binary search that space. Please let me know if you need anything else.
Well, I have found the key commit leads to the slowdown:
https://github.com/microsoft/DeepSpeed/tree/d5a7c1e0b494fbd0958bf8274bde0bacb2c16854
The speeds of versions later than this commit are all the same as only 9.8~9.9 samples/sec, while previous are 10.24+ samples/sec test on same GPUs node.
@loadams, please check it out!
from deepspeed.
Thanks, @inkcherry any idea why this might be slower? Nothing in !4318 that appears to cause a slowdown, and to confirm, @janelu9 you don't have graph harvesting enabled?
from deepspeed.
hi, @loadams @janelu9
I use 8 * A100 80G, with this script https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/pretrain_llama2_distributed.sh
I only changed the setting from pp=2 tp=2 to pp=8 tp=1.(of course, this script is not the best configuration for this hardware. It's just for comparing the performance before and after the modification) it seems that there is no performance issue.(I didn't open graph harvesting)
Could you share more parameters and configuration information about your workload?
run1
- log with ds before this commit:
steps: 50 loss: 8.4555 iter time (s): 1.464 samples/sec: 21.860
steps: 60 loss: 8.4218 iter time (s): 1.456 samples/sec: 21.975
steps: 70 loss: 8.1501 iter time (s): 1.460 samples/sec: 21.916
- log with ds on this commit:
steps: 50 loss: 8.4555 iter time (s): 1.427 samples/sec: 22.431
steps: 60 loss: 8.4218 iter time (s): 1.433 samples/sec: 22.335
steps: 70 loss: 8.1501 iter time (s): 1.400 samples/sec: 22.862
run2:
- log with ds before this commit:
steps: 50 loss: 8.4555 iter time (s): 1.448 samples/sec: 22.106
steps: 60 loss: 8.4218 iter time (s): 1.425 samples/sec: 22.458
steps: 70 loss: 8.1501 iter time (s): 1.446 samples/sec: 22.124
- log with ds on this commit:
steps: 50 loss: 8.4555 iter time (s): 1.447 samples/sec: 22.121
steps: 60 loss: 8.4218 iter time (s): 1.443 samples/sec: 22.173
steps: 70 loss: 8.1501 iter time (s): 1.457 samples/sec: 21.965
from deepspeed.
Thanks, @inkcherry any idea why this might be slower? Nothing in !4318 that appears to cause a slowdown, and to confirm, @janelu9 you don't have graph harvesting enabled?
I'm not sure if I opened graph harvesting, how to confirm it? I just build the model using PipelineModule
and train model with bfloat16
.
topo = ProcessTopology(['data','pipe','model'], [1,8,1])
from deepspeed.
@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175
from deepspeed.
@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175
sorry, it's incompatible for me, my torch version is 2.1.2+cu118
from deepspeed.
@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175
sorry, it's incompatible for me, my torch version is 2.1.2+cu118
@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit.
could you try install ds by
git clone https://github.com/inkcherry/DeepSpeed.git
pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install
from deepspeed.
@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175
sorry, it's incompatible for me, my torch version is 2.1.2+cu118
@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit.
could you try install ds by
git clone https://github.com/inkcherry/DeepSpeed.git
pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install
con't import torch._six
from deepspeed.
@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175
sorry, it's incompatible for me, my torch version is 2.1.2+cu118
@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit. could you try install ds by
git clone https://github.com/inkcherry/DeepSpeed.git
pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install
Do you have any ideas else? need I offer you my train codes? but it's not allowed by my company
from deepspeed.
@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175
sorry, it's incompatible for me, my torch version is 2.1.2+cu118
@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit. could you try install ds by
git clone https://github.com/inkcherry/DeepSpeed.git
pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install
Do you have any ideas else? need I offer you my train codes? but it's not allowed by my company
- https://github.com/inkcherry/DeepSpeed/tree/for_5175
- https://github.com/microsoft/DeepSpeed/tree/d5a7c1e0b494fbd0958bf8274bde0bacb2c16854
Hi, @janelu9
I'm not sure why one of these two versions runs smoothly while the other encounters the "con't import torch._six" issue. I think this shouldn't happen if you've correctly installed both versions (or maybe I missed some installation steps causing this problem? @loadams ).I think this commit didn't make significant changes to the default computation path. If it affects performance, the only possible change might be switching from grad to None to grad.zero(). Both of these operations shouldn't significantly impact time, but the only difference could potentially affect memory usage and consequently performance.
Could you please try making some modifications to your installation location? In the file
deepspeed/runtime/bf16_optimizer.py
, under this function:def clear_lp_grads(self): for group in self.bf16_groups: for param in group: if param.grad is not None: # Using zero_() fixed memory address for graph replay param.grad.zero_()
change it to
def clear_lp_grads(): for group in self.bf16_groups: for param in group: param.grad = None
Yeah, that is where the key is. It speeds up obviously from 9.97 to 10.24 after I changed it.
@loadams @inkcherry
from deepspeed.
Thanks @inkcherry for the debug work on this, please tag me here when you have a PR with a fix, and we can work on getting that tested and merged!
from deepspeed.
Related Issues (20)
- Get a error when use deepspeed training with torch.compile HOT 1
- [BUG]AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
- [BUG] fp_quantizer is not correctly built when non-jit installation HOT 1
- [BUG]CUDA error in pipeline parallel HOT 3
- [BUG] FlopsProfiler upsample flops compute bug
- [BUG] Version >0.14.0 leads to `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!` HOT 15
- [BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training HOT 9
- [REQUEST] How to install only the torch CPU version when I execute `pip install deepspeed`.
- [REQUEST] DeepSpeed-Ulysses with the Pure Deepspeed Zero
- [BUG] Error with nn.transformers layer size with Zero stage 3 HOT 1
- [Question]how to run the mixtral inference in multi-node?
- [BUG] deepspeed overlap_comm data race HOT 1
- [BUG] Code blocking when training on multi-nodes using DS-Chat.
- [BUG] Cannot replace pytorch.checkpoint with deepspeed.runtime.activation_checkpointing.checkpointing in accelerate
- Failed to install Fused_adam op on CPU HOT 9
- [BUG] when use 'DS_BUILD_FUSED_ADAM=1 pip3 install deepspeed',it cant install fused_adam HOT 3
- [BUG] A small misspelling bug in the functino set_none_gradients_to_zero of optimizer
- nv-nightly CI test failure
- [BUG] The specified pointer resides on host memory and is not registered with any CUDA device. HOT 1
- [BUG] Memory consumption higher when weight layers are clipped after .step() HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.