Comments (10)
This is the configuration I am using for Mixer MLP
"activation": "mish",
"architecture": "mixer_mlp",
"depth": 12,
"expansion_factor": 2,
"expansion_factor_token": 0.5,
"feature_dropout": 0.2,
"latent_dim": 4096,
"normalization": "none",
"position_encoding": "none",
Feature size is token_size = 128, token_count = 16, this is roughly 200M parameters network.
from pytorch_optimizer.
Much faster but still taking 114 seconds per iteration. Same GPU model but a slightly bigger model (300M parameter) in this case as this is the GPU that just finished an epoch. For reference, 2 iterations per second on Nero.
Oh, thanks for testing. then, still, there's a problem with preconditioner
I guess. Maybe only the JAX implementation version could go well :sad-pepe: ( loop with the if-statement implementation of Schur-Newton Method
in Pytorch is really slow though :( ).
I'll do more investigations on that.
- rollback
schur-newton method
tosvd
- re-implement based on the old version of Shampoo optimizer
thanks in advance!
from pytorch_optimizer.
Let me know when you want me to test something.
from pytorch_optimizer.
Much better but still too slow for the depth I am working on at. Nero is doing a great job.
from pytorch_optimizer.
Thanks for reporting!
Could you tell me the model (e.g. resnet50) and the parameters of Shampoo optimizers?
Actually, I didn't test on many configurations, but it seems that pre-conditioning (based on the google impl) is really slow than i expected. I'll figured it out.
from pytorch_optimizer.
I'm working on #101 and tested it on my local machine (GTX 1060 6GB).
- backbone:
resmlp_12_distilled_224
(about 15M params) - batch size: 4 (bigger bs causes OOM on my machine :( )
- input size: (3, 224, 224)
- iteration: 100
It took 3.48s / iter
and I roughly guess that the speed came within the expected range while still compute_power()
function which calculates G^{-1/p}
using a coupled Newton iteration takes much time.
I'll check more and release the package with a new version v2.4.0
(maybe soon).
Here's a benchmark code.
from timm import create_model
from tqdm import tqdm
model = create_model('resmlp_12_distilled_224', pretrained=False, num_classes=1)
model.train()
model.cuda()
optimizer = load_optimizer('shampoo')(model.parameters())
inp = torch.zeros((4, 3, 224, 224), dtype=torch.float32).cuda()
y = torch.ones((4, 1), dtype=torch.float32).cuda()
for _ in tqdm(range(100)):
optimizer.zero_grad()
torch.nn.functional.binary_cross_entropy_with_logits(model(inp), y).backward()
optimizer.step()
from pytorch_optimizer.
I released a new version v2.4.0 with the fixes! please check still there's a performance issue with your settings!
best regards
from pytorch_optimizer.
Much faster but still taking 114 seconds per iteration. Same GPU model but slightly bigger model (300M parameter) in this case as this is the GPU that just finished an epoch. For reference, 2 iterations per seconds on Nero.
from pytorch_optimizer.
I just deployed a new version v2.4.1
with some improvements! Change Log
In short,
- In my experiments,
SVD
method is fast in a few cases. However, the Newton method is usually faster than SVD. (you can use SVD method to setuse_svd
option toTrue
) - Tuning
block_size
brings a meaningful speed gain. Schur-Newton
orSVD
take 99.99% of the time (optimizer part). And, I venture a guess that It's hard to boost more than this unless computing the inverse matrix in a distributed environment with lots of CPUs or XLA devices like the paper.- Old Shampoo optimizer is returned! (you can test both of them)
load_optimizer('shampoo')
-> old shampoo optimizerload_optimizer('scalableshamopoo')
-> new shampoo optimizer- or you can import them directly
from pytorch_optimizer import Shampoo, ScalableShampoo
Any feedbacks & requests are welcome!
Here are the benchmarks.
backbone: resmlp_12_distilled_224, bs: 16
x2.5 faster
- AdamP: 3.73 iter / s
- (old) Shampoo: over 25s / iter
- Scalable Shampoo w/ Schur-Newton (block size = 256): 1.68 s / iter
- Scalable Shampoo w/ Schur-Newton (block size = 512): 1.12 iter / s
- Scalable Shampoo w/ SVD (block size = 256): 1.60 iter / s
- Scalable Shampoo w/ SVD (block size = 512): 2.50 iter / s
backbone: mixer_b16_224, bs: 8
x0.5 faster
- AdamP: 3.15 iter / s
- Nero: 2.93 iter / s
- (old) Shampoo: over 2 mins / iter
- Scalable Shampoo w/ Schur-Newton (block size = 256): 5.33 s / iter
- Scalable Shampoo w/ Schur-Newton (block size = 512): 2.97 s / iter
- Scalable Shampoo w/ SVD (block size = 256): 11.26 s / iter
- Scalable Shampoo w/ SVD (block size = 512): 21.15 s / iter
from pytorch_optimizer.
@redknightlois I did more work (#128, #129) on the scalable shampoo optimizer (cleanup code, optimize pytorch code, change the default parameters, ...) and just released v2.6.0.
Maybe, it's much faster than before because I changed the default value for preconditioning_compute_steps
from 1 to 1000, which is the most compute-intensive part, while the authors said it doesn't have a significant effect on the convergence.
+) Also, I'm roughly guessing that the current implementation is the nearly optimal version of scalable shampoo (w/ synchronous precondition updates on a single GPU), So, how about closing this issue for now? (if there's news, I'll re-open or create another issue though)
if there're any requests, please feel free to use & feedback by any chance :)
Thank you!
from pytorch_optimizer.
Related Issues (20)
- get_chebyshev_schedule not working HOT 2
- Ranger21 causing loss to spike and model never converges HOT 8
- Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training HOT 2
- Prodigy: An Expeditiously Adaptive Parameter-Free Learner HOT 2
- LOMO: LOw-Memory Optimization HOT 1
- Can a variant of Lion named Tiger be added to your package? HOT 2
- sophiah bug HOT 5
- sophiah in https://github.com/booydar/LM-RMT HOT 8
- Adding the CAME optimizer HOT 2
- Lookahead is not a subclass of torch.optim.Optimizer HOT 4
- Empty Docs Sections HOT 6
- Request to add 4-bit AdamW HOT 3
- ipex failed for Adan from pytorch_optimizer HOT 1
- Improvement to SAM: SAM as an Optimal Relaxation of Bayes HOT 1
- FR: Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term (WSAM) HOT 2
- Ranger21 has undocumented required arguments HOT 3
- [Feature request]REX LR scheduler HOT 2
- Aida optimizer HOT 3
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection HOT 1
- Adalite HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch_optimizer.