Comments (3)
Of note, if I bypass this and handle everything manually in setup, it appears that my GPU doesn't actually take advantage of the quantization benefits. I am told my quantized model is 2gb, the trainer says 6gb, but when I go to train, my 16gb GPU overflows immediately. It even happens on my 24gb GPU.
I'm unsure what's going wrong.
from lightning.
An update:
When I load the model using just quantization, it takes up 2gb.
Lightning says it takes up 6gb. I assume lightning does a sample backward pass and that the excess is stored gradients.
I can use the trainer.init_module() context manager to keep lightning faithful to its own stated size.
However as soon as my model receives any textual data at all, it goes OOM. My suspicion now, is that the optimizer is not properly handling the quantization or respecting the frozen layers. I can think of no other reason that my 2gb double quantized 4bit model would OOM on a single backwards pass.
from lightning.
Solved: Trainer defaults to mixed precision when handed tags that are not explicitly set to "32-true" or similar.
This produces duplicate tensor overhead.
Additionally, the trainer appears to use as a base of the weights that the model was saved in, rather than the weights that the model is currently in, for its size prediction. It can be safely ignored.
from lightning.
Related Issues (20)
- Log `TensorBoard` histograms
- PTL 2.2 specifically causes torchscript errors when loaded in any environment not containing PTL 2.2
- When calling trainer.test() train_dataloader is also validated, which makes no sense HOT 2
- Support `ThunderModule` models HOT 2
- Calculated loss differs from logged loss in training_step (even if seed_everything, deterministic set to true and shuffle to false) HOT 1
- Trainer does not wait for neptune logger completion and logger connection stays open unless explicitly closed HOT 1
- Validation does not produce any output in PyTorch Lightning using my UNetTestModel
- Unable to extend FSDPStrategy to HPU accelerator HOT 7
- SaveConfigCallback.save_config is conflict with DDP HOT 1
- Logging Documentation Does not Detail How to Access the Logged Values during the fit loop
- Apply the ignore of the save_hyperparameters function to args as well.
- Cannot run in SLURM Interactive Session
- Resume from mid steps inside an epoch
- `DDPStrategy` fails when using accelerators other than CUDA
- PyTorch Lightning with T5 Model - RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn HOT 1
- Script freezes when Trainer is instantiated HOT 4
- Sanitize object params before they get logged from argument-free classes
- Support GAN based model training with deepspeed which need to setup fabric twice HOT 1
- IndexError: Pytorch-lightning CompositionalMetric require tensor.item() if dim=0 whether I did so
- Huge metrics jump between epochs && Step and epoch log not matched, when accumulate_grad_batches > 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightning.