about 5% slowdown when training with pipeline parallelism and megatron-deepspeed

<a class="user-mention notranslate

Thanks <a class="user-mention notranslate" data-hovercard-type="user" dat

Thanks for your help, <a class="user-mention notranslate" data-hovercard-type="user" d

Thanks <a class="user-mention notranslate" data-hovercard-ty

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[BUG]Training speed of deepspeed>0.12.5 becomed slower than before!,about microsoft/deepspeed

Comments (34)

inkcherry commented on June 2, 2024 1

@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175

sorry， it's incompatible for me, my torch version is 2.1.2+cu118

@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit. could you try install ds by git clone https://github.com/inkcherry/DeepSpeed.git pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install

Do you have any ideas else? need I offer you my train codes? but it's not allowed by my company

https://github.com/inkcherry/DeepSpeed/tree/for_5175
https://github.com/microsoft/DeepSpeed/tree/d5a7c1e0b494fbd0958bf8274bde0bacb2c16854
Hi, @janelu9
I'm not sure why one of these two versions runs smoothly while the other encounters the "con't import torch._six" issue. I think this shouldn't happen if you've correctly installed both versions (or maybe I missed some installation steps causing this problem? @loadams ).

I think this commit didn't make significant changes to the default computation path. If it affects performance, the only possible change might be switching from grad to None to grad.zero(). Both of these operations shouldn't significantly impact time, but the only difference could potentially affect memory usage and consequently performance.

Could you please try making some modifications to your installation location? In the file
deepspeed/runtime/bf16_optimizer.py, under this function:

def clear_lp_grads(self):
    for group in self.bf16_groups:
        for param in group:
            if param.grad is not None:
                # Using zero_() fixed memory address for graph replay
                param.grad.zero_()

change it to

def clear_lp_grads():
    for group in self.bf16_groups:
        for param in group:
            param.grad = None

from deepspeed.

inkcherry commented on June 2, 2024 1

Thanks @inkcherry for the debug work on this, please tag me here when you have a PR with a fix, and we can work on getting that tested and merged!

I just send a fix as we disscussed, hope it's helpful for you @janelu9 , and thanks for the help of @loadams !

from deepspeed.

loadams commented on June 2, 2024 1

Thanks for your help, @inkcherry

from deepspeed.

tjruwase commented on June 2, 2024 1

Thanks @inkcherry for the debug work on this, please tag me here when you have a PR with a fix, and we can work on getting that tested and merged!

I just send a fix as we disscussed, hope it's helpful for you @janelu9 , and thanks for the help of @loadams !

@inkcherry, this is greatly appreciated. Thanks!

from deepspeed.

loadams commented on June 2, 2024

Hi @janelu9 - could you let us know what baseline DeepSpeed version you are comparing with? And that this is the same "issue" as #4881 ?

from deepspeed.

janelu9 commented on June 2, 2024

Hi @janelu9 - could you let us know what baseline DeepSpeed version you are comparing with? And that this is the same "issue" as #4881 ?

this version is faster https://github.com/microsoft/DeepSpeed/files/13825076/deepspeed-0.12.4.zip

from deepspeed.

loadams commented on June 2, 2024

Thanks, so something between 0.12.4 and 0.12.5, I'll git bisect between those and see if I can tell, can you share your torch version as well so I'm running the same tests? As well as if you have a small repro case or one from DeepSpeed Examples?

from deepspeed.

janelu9 commented on June 2, 2024

Thanks, so something between 0.12.4 and 0.12.5, I'll git bisect between those and see if I can tell, can you share your torch version as well so I'm running the same tests? As well as if you have a small repro case or one from DeepSpeed Examples?

My torch version==2.1.2+cu118, I don't have a repro published, I test them with Llama-7B of 8 stages in pipeline engine on one node of 8 A100-40G.

from deepspeed.

loadams commented on June 2, 2024

@janelu9 - I'll have to test a repro with V100s probably and a smaller model, and it is possible that it won't repro there in the same way.

If it is easy for you to run your test, could you try with pip install git+https://github.com/microsoft/DeepSpeed.git@a7900bcc3d2f7789cc734aa28a11d2f3b3d8b04f to help bisect the search space?

from deepspeed.

janelu9 commented on June 2, 2024

@janelu9 - I'll have to test a repro with V100s probably and a smaller model, and it is possible that it won't repro there in the same way.

If it is easy for you to run your test, could you try with pip install git+https://github.com/microsoft/DeepSpeed.git@a7900bcc3d2f7789cc734aa28a11d2f3b3d8b04f to help bisect the search space?

Well, this version is the faster one whose speed was 10.3 samples/sec while the latest one was 9.9 samples/sec. Additionally, I quoted flash-attn in my model codes.

from deepspeed.

loadams commented on June 2, 2024

The one installed from the git link was also faster? If so, could you also try this one?

pip install git+https://github.com/microsoft/DeepSpeed.git@b83b1c2e1c4dc4c91c4ad78773dc2232ca9f7070

I suspect the issue may come from transformers since we updated our transformers requirement in the timeframe of 0.12.4 to 0.12.5, but if you're able to test with the above version and let us know if it appears to show the performance degradation or not, that would help, thanks!

from deepspeed.

janelu9 commented on June 2, 2024

The one installed from the git link was also faster? If so, could you also try this one?

pip install git+https://github.com/microsoft/DeepSpeed.git@b83b1c2e1c4dc4c91c4ad78773dc2232ca9f7070

I suspect the issue may come from transformers since we updated our transformers requirement in the timeframe of 0.12.4 to 0.12.5, but if you're able to test with the above version and let us know if it appears to show the performance degradation or not, that would help, thanks!

this version is fastest up to now , its avg speed is about 10.4 samples/sec
transformers==4.38.1

from deepspeed.

loadams commented on June 2, 2024

Thanks @janelu9 -

If you're able, lets try this one then?

pip install git+https://github.com/microsoft/DeepSpeed.git@449e454f83bb6a14b0de359660d4b206d5c3feed

This only contains a few additional fixes but I want to make sure they don't cause an impact, the remaining PRs do change some user args behavior so if this still does not reproduce the performance degradation we can look into that.

Thanks!

from deepspeed.

janelu9 commented on June 2, 2024

Thanks @janelu9 -

If you're able, lets try this one then?

pip install git+https://github.com/microsoft/DeepSpeed.git@449e454f83bb6a14b0de359660d4b206d5c3feed

This only contains a few additional fixes but I want to make sure they don't cause an impact, the remaining PRs do change some user args behavior so if this still does not reproduce the performance degradation we can look into that.

Thanks!

this version is about 10.25 samples/sec，faster than the latest

from deepspeed.

loadams commented on June 2, 2024

Thanks, can you test this as well?

pip install git+https://github.com/microsoft/DeepSpeed.git@65b7727758a4c0ee08597a88ab4f051abcfc2a8a

That should only contain two additional PRs that refactor user arg parsing, so should not impact anything, but would be good to run the test as well. Thanks!

from deepspeed.

janelu9 commented on June 2, 2024

Thanks, can you test this as well?

pip install git+https://github.com/microsoft/DeepSpeed.git@65b7727758a4c0ee08597a88ab4f051abcfc2a8a

That should only contain two additional PRs that refactor user arg parsing, so should not impact anything, but would be good to run the test as well. Thanks!

this version hadn't slowdown yet

from deepspeed.

loadams commented on June 2, 2024

Thanks, that should have all the changes from v0.12.5, can you test with this one as well to confirm, since this should be after we update the tag:

pip install git+https://github.com/microsoft/DeepSpeed.git@4d866bd55a6b2b924987603b599c1f8f35911c4b

from deepspeed.

janelu9 commented on June 2, 2024

Thanks, that should have all the changes from v0.12.5, can you test with this one as well to confirm, since this should be after we update the tag:
pip install git+https://github.com/microsoft/DeepSpeed.git@4d866bd55a6b2b924987603b599c1f8f35911c4b

What's puzzling is that all the speeds of 0.12.4<deepspeed<=0.12.5 is now the same, maybe it also be related to torch, but the versions>=0.12.6 are still slower .

from deepspeed.

janelu9 commented on June 2, 2024

Thanks, that should have all the changes from v0.12.5, can you test with this one as well to confirm, since this should be after we update the tag:
pip install git+https://github.com/microsoft/DeepSpeed.git@4d866bd55a6b2b924987603b599c1f8f35911c4b

This version is faster.
@loadams Are you tracing the problem using the idea of binary search?

from deepspeed.

loadams commented on June 2, 2024

@janelu9 that's correct, but we've now completed the entire search space between 0.12.4 and 0.12.5 without seeing a significant slowdown, correct?

from deepspeed.

janelu9 commented on June 2, 2024

@janelu9 that's correct, but we've now completed the entire search space between 0.12.4 and 0.12.5 without seeing a significant slowdown, correct?

Yes，no one has a significant slowdown

from deepspeed.

janelu9 commented on June 2, 2024

@janelu9 that's correct, but we've now completed the entire search space between 0.12.4 and 0.12.5 without seeing a significant slowdown, correct?

@loadams please give me all the tags between 0.12.5 and 0.12.6 order by commit times. I'd like to test them to find where the problem is.

from deepspeed.

loadams commented on June 2, 2024

@janelu9 - here is that list of commits, v0.12.5...v0.12.6

It looks to only be 13 commits, so should be fairly easy to binary search that space. Please let me know if you need anything else.

from deepspeed.

janelu9 commented on June 2, 2024

@janelu9 - here is that list of commits, v0.12.5...v0.12.6

It looks to only be 13 commits, so should be fairly easy to binary search that space. Please let me know if you need anything else.

Well, I have found the key commit leads to the slowdown:
https://github.com/microsoft/DeepSpeed/tree/d5a7c1e0b494fbd0958bf8274bde0bacb2c16854
The speeds of versions later than this commit are all the same as only 9.8~9.9 samples/sec, while previous are 10.24+ samples/sec test on same GPUs node.
@loadams, please check it out!

from deepspeed.

loadams commented on June 2, 2024

Thanks, @inkcherry any idea why this might be slower? Nothing in !4318 that appears to cause a slowdown, and to confirm, @janelu9 you don't have graph harvesting enabled?

from deepspeed.

inkcherry commented on June 2, 2024

hi, @loadams @janelu9
I use 8 * A100 80G, with this script https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/pretrain_llama2_distributed.sh
I only changed the setting from pp=2 tp=2 to pp=8 tp=1.（of course, this script is not the best configuration for this hardware. It's just for comparing the performance before and after the modification） it seems that there is no performance issue.(I didn't open graph harvesting)
Could you share more parameters and configuration information about your workload?
run1
- log with ds before this commit:
steps: 50 loss: 8.4555 iter time (s): 1.464 samples/sec: 21.860
steps: 60 loss: 8.4218 iter time (s): 1.456 samples/sec: 21.975
steps: 70 loss: 8.1501 iter time (s): 1.460 samples/sec: 21.916
- log with ds on this commit:
steps: 50 loss: 8.4555 iter time (s): 1.427 samples/sec: 22.431
steps: 60 loss: 8.4218 iter time (s): 1.433 samples/sec: 22.335
steps: 70 loss: 8.1501 iter time (s): 1.400 samples/sec: 22.862

run2:
- log with ds before this commit:
steps: 50 loss: 8.4555 iter time (s): 1.448 samples/sec: 22.106
steps: 60 loss: 8.4218 iter time (s): 1.425 samples/sec: 22.458
steps: 70 loss: 8.1501 iter time (s): 1.446 samples/sec: 22.124

- log with ds on this commit:
steps: 50 loss: 8.4555 iter time (s): 1.447 samples/sec: 22.121
steps: 60 loss: 8.4218 iter time (s): 1.443 samples/sec: 22.173
steps: 70 loss: 8.1501 iter time (s): 1.457 samples/sec: 21.965

from deepspeed.

janelu9 commented on June 2, 2024

Thanks, @inkcherry any idea why this might be slower? Nothing in !4318 that appears to cause a slowdown, and to confirm, @janelu9 you don't have graph harvesting enabled?

I'm not sure if I opened graph harvesting, how to confirm it? I just build the model using PipelineModule and train model with bfloat16.

topo = ProcessTopology(['data','pipe','model'], [1,8,1])

from deepspeed.

inkcherry commented on June 2, 2024

@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175

from deepspeed.

janelu9 commented on June 2, 2024

@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175

sorry， it's incompatible for me, my torch version is 2.1.2+cu118

from deepspeed.

inkcherry commented on June 2, 2024

@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175

sorry， it's incompatible for me, my torch version is 2.1.2+cu118

@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit.
could you try install ds by
git clone https://github.com/inkcherry/DeepSpeed.git
pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install

from deepspeed.

janelu9 commented on June 2, 2024

@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175

sorry， it's incompatible for me, my torch version is 2.1.2+cu118

@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit.

could you try install ds by

git clone https://github.com/inkcherry/DeepSpeed.git

pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install

con't import torch._six

from deepspeed.

janelu9 commented on June 2, 2024

@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175

sorry， it's incompatible for me, my torch version is 2.1.2+cu118

@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit. could you try install ds by git clone https://github.com/inkcherry/DeepSpeed.git pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install

Do you have any ideas else? need I offer you my train codes? but it's not allowed by my company

from deepspeed.

janelu9 commented on June 2, 2024

@janelu9 It's not easy for me to reproduce your model. Could you try this branch? Will it solve your issue? https://github.com/inkcherry/DeepSpeed/tree/for_5175

sorry， it's incompatible for me, my torch version is 2.1.2+cu118

@janelu9 Thank you for your attempt. This is consistent with what you mentioned in the key commit version. You can see the commits list, with only one change where I added a patch in one commit. could you try install ds by git clone https://github.com/inkcherry/DeepSpeed.git pip uninstall deepspeed ; cd DeepSpeed ; git checkout for_5175; python setup.py install

Do you have any ideas else? need I offer you my train codes? but it's not allowed by my company

https://github.com/inkcherry/DeepSpeed/tree/for_5175

https://github.com/microsoft/DeepSpeed/tree/d5a7c1e0b494fbd0958bf8274bde0bacb2c16854
Hi, @janelu9
I'm not sure why one of these two versions runs smoothly while the other encounters the "con't import torch._six" issue. I think this shouldn't happen if you've correctly installed both versions (or maybe I missed some installation steps causing this problem? @loadams ).

I think this commit didn't make significant changes to the default computation path. If it affects performance, the only possible change might be switching from grad to None to grad.zero(). Both of these operations shouldn't significantly impact time, but the only difference could potentially affect memory usage and consequently performance.

Could you please try making some modifications to your installation location? In the file deepspeed/runtime/bf16_optimizer.py, under this function:
def clear_lp_grads(self):
    for group in self.bf16_groups:
        for param in group:
            if param.grad is not None:
                # Using zero_() fixed memory address for graph replay
                param.grad.zero_()
change it to
def clear_lp_grads():
    for group in self.bf16_groups:
        for param in group:
            param.grad = None

Yeah, that is where the key is. It speeds up obviously from 9.97 to 10.24 after I changed it.
@loadams @inkcherry

from deepspeed.

loadams commented on June 2, 2024

Thanks @inkcherry for the debug work on this, please tag me here when you have a PR with a fix, and we can work on getting that tested and merged!

from deepspeed.

[BUG]Training speed of deepspeed>0.12.5 becomed slower than before! about deepspeed HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent