nota-netspresso / netspresso-trainer Goto Github PK

View Code? Open in Web Editor NEW

62.0 5.0 5.0 25.16 MB

A library for training, compressing and deploying computer vision models (including ViT) with edge devices

Home Page: https://nota-netspresso.github.io/netspresso-trainer/

License: Apache License 2.0

Python 99.36% Shell 0.52% Dockerfile 0.12%

fx netspresso pytorch trainer computer-vision onnx tensorrt

netspresso-trainer's People

Contributors

Stargazers

Watchers

Forkers

aychun illian01 rinha7 cbpark-nota hglee98

netspresso-trainer's Issues

[BUG] Converted models are not saved

Describe the bug

When I executed example code in example_train.sh, I couldn't get fx and onnx files.

netspresso-train  --data config/data/beans.yaml  --augmentation config/augmentation/classification.yaml  --model config/model/resnet/resnet50-classification.yaml  --training config/training/classification.yaml  --logging config/logging.yaml  --environment config/environment.yaml

During training, fx and onnx files are not saved by setting save_converted_model value to False.

netspresso-trainer/src/netspresso_trainer/pipelines/base.py

Lines 177 to 180 in 244f3a2

    
           if with_checkpoint_saving: 
        
               assert with_valid_logging 
        
               self.save_checkpoint(epoch=num_epoch, save_converted_model=False) 
        
               self.save_summary()

Converted models are saved only when training is over.

netspresso-trainer/src/netspresso_trainer/pipelines/base.py

Lines 189 to 192 in 244f3a2

    
           if self.single_gpu_or_rank_zero: 
        
               self.train_logger.log_end_of_traning(final_metrics={'time_for_last_epoch': time_for_epoch}) 
        
               self.save_checkpoint(epoch=num_epoch, save_converted_model=True) 
        
               self.save_summary(end_training=True)

However, there is save_best_model check before saving converted models. There are two cases that save_best_model value becomes False in the last save step.

Validation loss of last epoch is not the best one.
Validation step is skipped in last epoch.

netspresso-trainer/src/netspresso_trainer/pipelines/base.py

Lines 260 to 261 in 244f3a2

    
           best_epoch = min(valid_losses, key=valid_losses.get) 
        
           save_best_model = best_epoch == epoch

netspresso-trainer/src/netspresso_trainer/pipelines/base.py

Lines 285 to 298 in 244f3a2

    
           if save_best_model: 
        
               torch.save(model.state_dict(), best_model_path.with_suffix(".pth")) 
        
               logger.info(f"Best model saved at {str(best_model_path.with_suffix('.pth'))}") 
        
               if save_converted_model: 
        
                   try: 
        
                       save_onnx(model, best_model_path.with_suffix(".onnx"), sample_input=self.sample_input) 
        
                       logger.info(f"ONNX model converting and saved at {str(best_model_path.with_suffix('.onnx'))}") 
        
                       save_graphmodule(model, (model_path.parent / f"{best_model_path.stem}_fx").with_suffix(".pt")) 
        
                       logger.info(f"PyTorch FX model tracing and saved at {str(best_model_path.with_suffix('.pt'))}") 
        
                   except Exception as e: 
        
                       logger.error(e) 
        
                       pass

Have you searched existing issues? 🔎

I have searched and found no existing issues

Reproduction

Simply execute instruction below.

netspresso-train  --data config/data/beans.yaml  --augmentation config/augmentation/classification.yaml  --model config/model/resnet/resnet50-classification.yaml  --training config/training/classification.yaml  --logging config/logging.yaml  --environment config/environment.yaml

Screenshot

Logs

No response

System Info

No response

[Sprint] PyNetsPresso compatibility: NetsPresso Launcher

Description

Apply ResNet50 with NP Launcher by PyNetsPresso
If applicable, apply NP-compressor-compressed model with NP Launcher by PyNetsPresso

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Saving the checkpoint when pausing the training

Description

Save the current checkpoint when pausing with KeyboardInterrupt
Resume training with the saved checkpoint

➕ Containing #132

Save the best model when validation score get breakthrough
Handle model saving interval through the config file

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Add model: MobileNetV3

Description

Add a model backbone of MobileNetV3
- I recommend the torchvision's mobilenetv3 because torchvision mostly supports fx tracing for model in their zoo.
- Only small model (mobilenet_v3_small) is enough.

Minimal condition of mobilenetv3 integration:

Classification training with mobilenetv3 backbone should succeed.
NP Compressor with trained fx checkpoint should succeed.
NP Launcher with trained onnx checkpoint should succeed.

Checklists

I would create a corresponding branch for this issue from the designated (mostly dev) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Gradio interface for experiment report & PyNetsPresso usage

Description

Add Gradio interface for using easily PyNetsPresso with trained models
Visualize model training result

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Add code generator for deploying at Jetson boards and other devices

Description

Add former launcher package features in NetsPresso in this repository
- Generate preprocess/postprocess code from training code

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

Loss logging policy update

As-is

Logger returns per-epoch total loss value
- However, it weirdly cleans all values from each objective per step

To-be

Prepare for returning loss value in each step
- including every value for each objective (preferred for tensorboard)
- backward for each step (preferred for normal training procedure)
Prepare for returning losses in every epoch
- using AverageMeter for getting averaged results from all samples in dataset
- returning averaged total loss (preferred for summary logging)

[Sprint] Support detection training with task-specific metric

Description

Support detection training with mAP metric

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Docs: add NetsPresso pages

Description

Add page for introducing NetsPresso
- Pure torch.fx compatibility
- NetsPresso
- NetsPresso Compressor
- Best Practice
  - Compress and Retrain model

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

Description

Apply pylinting with ruff

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Augmentation simulator with augmentation configuration

Description

Like #115, create a gradio demo for data augmentation

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

Segmentation - slide inference

[BUG] Can't find 'save_optimizer_state' key in conf.logging

Describe the bug

Error occurred when initialize BasePipeline,
self.save_optimizer_state = self.conf.logging.save_optimizer_state

I tried to find save_optimizer_state over the project, but there is no init statement of save_optimizer_state.
I pulled the project before execute.

Have you searched existing issues? 🔎

I have searched and found no existing issues

Reproduction

netspresso-train
--data config/data/beans.yaml
--augmentation config/augmentation/classification.yaml
--model config/model/resnet/resnet50-classification.yaml
--training config/training/classification.yaml
--logging config/logging.yaml
--environment config/environment.yaml

Screenshot

No response

Logs

(np_trainer) junho.shin@3090e:~/netspresso-trainer-dev$ netspresso-train\
>   --data config/data/beans.yaml\
>   --augmentation config/augmentation/classification.yaml\
>   --model config/model/resnet/resnet50-classification.yaml\
>   --training config/training/classification.yaml\
>   --logging config/logging.yaml\
>   --environment config/environment.yaml
2023-09-12_12:53:27 KST | INFO          | trainer:<trainer_common.py>:107 >>> Task: classification | Model: resnet50 | Training with torch.fx model? False
2023-09-12_12:53:27 KST | INFO          | build_dataset:<builder.py>:16 >>> ----------------------------------------
2023-09-12_12:53:27 KST | INFO          | build_dataset:<builder.py>:17 >>> Loading data...
2023-09-12_12:53:30 KST | INFO          | build_dataset:<builder.py>:88 >>> Summary | Dataset: <beans> (with huggingface format)
2023-09-12_12:53:30 KST | INFO          | build_dataset:<builder.py>:89 >>> Summary | Training dataset: 1034 sample(s)
2023-09-12_12:53:30 KST | INFO          | build_dataset:<builder.py>:91 >>> Summary | Validation dataset: 133 sample(s)
2023-09-12_12:53:30 KST | INFO          | build_dataset:<builder.py>:93 >>> Summary | Test dataset: 128 sample(s)
Traceback (most recent call last):
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/bin/netspresso-train", line 33, in <module>
    sys.exit(load_entry_point('netspresso-trainer', 'console_scripts', 'netspresso-train')())
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/train.py", line 6, in netspresso_train
    trainer(is_graphmodule_training=is_graphmodule_training)
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/trainer_common.py", line 132, in trainer
    trainer = build_pipeline(conf, task, model_name, model,
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/builder.py", line 9, in build_pipeline
    trainer = task_pipeline(conf, task, model_name, model, devices,
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/classification.py", line 17, in __init__
    super(ClassificationPipeline, self).__init__(conf, task, model_name, model, devices,
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/base.py", line 59, in __init__
    self.save_optimizer_state = self.conf.logging.save_optimizer_state
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 355, in __getattr__
    self._format_and_raise(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
    node = self._get_child(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
    child = self._get_node(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 480, in _get_node
    raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigAttributeError: Missing key save_optimizer_state
    full_key: logging.save_optimizer_state
    object_type=dict

System Info

No response

Migrate classification trainer

https://github.com/nota-github/modelsearch-trainer-classification

[Sprint] Add an easy-to-use Colab demo

Description

Add Colab demo to showcase the training and PyNetsPresso integration process

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Public pre-alpha release

Description

This issue contains a checklist for pre-releasing this repository.

Checklists

Before public release

After public release

#146
pypi install check
#164

[Sprint] Docs: add Model pages

Description

Add Models page
- What is MetaFormer?
- Our MetaFormer implementation
- Model List
  - Backbone list
  - Head list
  - Full list
- Model Compatibility Matrix

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[BUG] Import error occurred when use python=3.10.12

Describe the bug

Import error occurred when I try to execute code with python 3.10.12

ImportError: cannot import name 'Sequence' from 'collections'

It's because Sequence has been moved to collections.abc. This can be checked in docs.

I think this should be handled since python 3.10.12 version is in support coverage.

Have you searched existing issues? 🔎

I have searched and found no existing issues

Reproduction

Execute script on python 3.10.12 environment.

netspresso-train  --data config/data/beans.yaml  --augmentation config/augmentation/classification.yaml  --model config/model/resnet/resnet50-classification.yaml  --training config/training/classification.yaml  --logging config/logging.yaml  --environment config/environment.yaml

Screenshot

No response

Logs

(np_trainer_python3.10.12) junho.shin@3090e:~/netspresso-trainer-dev$ netspresso-train  --data config/data/beans.yaml  --augmentation config/augmentation/classification.yaml  --model config/model/resnet/resnet50-classification.yaml  --training config/training/classification.yaml  --logging config/logging.yaml  --environment config/environment.yaml
Traceback (most recent call last):
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer_python3.10.12/bin/netspresso-train", line 5, in <module>
    from netspresso_trainer.train import netspresso_train
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/__init__.py", line 3, in <module>
    from .trainer_common import parse_args_netspresso, set_arguments, trainer
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/trainer_common.py", line 9, in <module>
    from .dataloaders import build_dataloader, build_dataset
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/__init__.py", line 1, in <module>
    from .builder import build_dataloader, build_dataset
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/builder.py", line 6, in <module>
    from .detection import detection_collate_fn
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/detection/__init__.py", line 3, in <module>
    from .transforms import create_transform_detection
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/detection/transforms.py", line 7, in <module>
    from ..augmentation import custom as TC
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/augmentation/__init__.py", line 1, in <module>
    from .custom import (
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/augmentation/custom.py", line 2, in <module>
    from collections import Sequence
ImportError: cannot import name 'Sequence' from 'collections' (/ssd1/junho.shin/anaconda3/envs/np_trainer_python3.10.12/lib/python3.10/collections/__init__.py)

System Info

python=3.10.12

[Sprint] Model training yaml file simplification

Description

Suggest the easy-to-use command and configuration for example trainings

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[BUG] Typo in feature-request.md

Describe the bug
A typo in feature-request.md

To Reproduce
https://github.com/nota-github/netspresso-trainer-dev/blob/dev/.github/ISSUE_TEMPLATE/feature-request.md

Expected behavior
Should be bolded -> Is there any feature that you would like to add?

Screenshots
N/A

Desktop (please complete the following information):
N/A

Additional context
Add any other context about the problem here.

Align classification to be scalable

Apply custom logger

[Feature Request] There is no onnx model save in graph model training

I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?
The best model should be saved into onnx format, when graph model training.

netspresso-trainer/src/netspresso_trainer/pipelines/base.py

Lines 276 to 283 in ff5e014

    
           if self.is_graphmodule_training: 
        
               # Just save graphmodule checkpoint 
        
               torch.save(model, model_path.with_suffix(".pt")) 
        
               logger.debug(f"PyTorch FX model saved at {str(model_path.with_suffix('.pt'))}") 
        
               if save_best_model: 
        
                   torch.save(model, best_model_path.with_suffix(".pt")) 
        
                   logger.info(f"Best model saved at {str(best_model_path.with_suffix('.pt'))}") 
        
               return

Is there any suggestion (or solution) to solve this issue?
I think this can be implemented by simply adding save_onnx function call.

Additional context
All other contexts or screenshots are welcome.

Migrate segmentation trainer

Multi-GPU training

Tensorboard integration

[BUG] Run script example_train.sh is not proper.

Describe the bug
Seems run script of example_train.sh is not proper.

To Reproduce
classification example of example_train.sh is below. Instruction netspresso-train can't execute train.py

netspresso-train\
  --data config/data/beans.yaml\
  --augmentation config/augmentation/classification.yaml\
  --model config/model/resnet/resnet50-classification.yaml\
  --training config/training/classification.yaml\
  --logging config/logging.yaml\
  --environment config/environment.yaml

Expected behavior
Like example_train_fx.sh, example_train.sh need to be changed to execute python train.py. Below is the script of example_train_fx.sh.

python train_fx.py\
  --data config/data/beans.yaml\
  --augmentation config/augmentation/classification.yaml\
  --model config/model/resnet/resnet50-classification.yaml\
  --training config/training/classification.yaml\
  --logging config/logging.yaml\
  --environment config/environment.yaml\
  --fx-model-checkpoint classification_resnet_fx.pt

In addition, comment block of example_train.sh should be changed.
From

#### HuggingFace datasets training
# To use HuggingFace datasets, you need to additionally install requirements-data.txt
# `pip install -r requirements-data.txt`
#### (END)

#### HuggingFace datasets training
# To use HuggingFace datasets, you need to additionally install requirements-optional.txt
# `pip install -r requirements-optional.txt`
#### (END)

, because there is requirements-optional.txt instead of requirements-data.txt.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

[Sprint] PyNetsPresso compatibility check

Description

Check model training functionality
- Training command at example_train.sh
Check whether trained model can be applied with PyNetsPresso

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Add LRScheulders: StepLR, CosineAnnealingWarmRestart

Description

Implement StepLR, CosineAnnealingWarmRestart
- StepLR base code: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html#torch.optim.lr_scheduler.StepLR
- CosineAnnealingWarmRestart base code: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingWarmRestarts.html#torch.optim.lr_scheduler.CosineAnnealingWarmRestarts
  - CosineAnnealingWarmRestart should support timm baseline LRScheduler: https://timm.fast.ai/SGDR

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

Performance validation with the original repository

https://nota-workspace.slack.com/archives/C047JBNH7U6/p1678946243546009?thread_ts=1678943626.801589&cid=C047JBNH7U6

VOC2012
model	training crop	test (crop, stride)	mIoU-val (%)	pixAcc-val (%)	eval eph (val-best)	training time	batch size	gpu mem	gpu spec
segformer-b0	768	(768, 512)	65.83750734	91.92332466	186 / 250	13 h	8	13911MiB	3090f
segformer-b1	768	(768, 512)	71.9364075	93.54266384	152 / 250	17 h	8	16807MiB	3090f

ADE20K Exp2
model	training crop	test (crop, stride)	mIoU-val (%)	pixAcc-val (%)	eval eph (val-best)	training time	batch size	gpu mem	gpu spec
segformer-b0	512	(512, 512)	36.82355351	76.54075322	187 / 200	45 h	16	17089MiB x 1 gpu	a100
segformer-b1	512	(512, 512)	39.95565802	78.47017266	178 / 200	56 h	16	19659MiB x 1 gpu	a100

[Sprint] Add NetsPresso use case

Description

SegFormer training
PyNetsPresso compressor
SegFormer finetuning with compressed model

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint][Important] Transferring the repository

Description

This issue contains a checklist for transferring this repository to Nota-NetsPresso/netspresso-trainer.

Checklists

Before Transfer

Check there's no irrelevant (or outdated) tag or branches
Transfer the repository to Nota-NetsPresso organization
- We decided to transfer this repository to preserve all the issue and PR histories.
- The original repo at the same link will be removed.

After Transfer - private

Check all the absolute links (and relative links) in README.md and Docs
Check whether the docs page is rendered correctly
Check all issue statuses are kept as original
Check whether all the github actions work well

Model add - EfficientFormer

Rearrange pretrained checkpoints for each backbone

SegFormer
AtomixNet
ResNet

[Feature Request] Init of loss and metric module

I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?

Loss and metric are initialized on every epochs.
I think they don't need to be initialized frequently because they are not variable during train procedure.

for num_epoch in range(START_EPOCH_ZERO_OR_ONE, self.conf.training.epochs + START_EPOCH_ZERO_OR_ONE):
    self.timer.start_record(name=f'train_epoch_{num_epoch}')
    self.loss = build_losses(self.conf.model, ignore_index=self.ignore_index)
    self.metric = build_metrics(self.task, self.conf.model, ignore_index=self.ignore_index, num_classes=self.num_classes)

Is there any suggestion (or solution) to solve this issue?

Loss and metric can be initialized on set_train function.
If there are values to be 0 after a epoch, we can add reset method for that.

[BUG] 'target' of semantic segmentation loader doesn't match with class number

Describe the bug

In SegmentationCustomDataset line 35, label is read with PIL.
If I extract unique value of the label, I can get the below like array.

np.unique(np.array(label))
> array([  0, 147, 151, 220], dtype=uint8)

Even when the label passes through transform function (line 44), it still has large values.

out['mask'].unique()
> tensor([0, 151, 220])

It causes error when trainer try to compute loss (Cross entropy), because the model has 21 classes (including background) not over 200 classes.
The dataset have to convert mask pixel intensity to class values.

Have you searched existing issues? 🔎

I have searched and found no existing issues

Reproduction

I downloaded VOC2012 datasets, and executed instruction below.

python ./train.py --data config/data/voc12.yaml --augmentation config/augmentation/segmentation.yaml --model config/model/segformer/segformer-segmentation.yaml --training config/training/segmentation.yaml --logging config/logging.yaml --environment config/environment.yaml

Screenshot

No response

Logs

(np_trainer) junho.shin@3090e:~/netspresso-trainer-dev$ python ./train.py --data config/data/voc12.yaml --augmentation config/augmentation/segmentation.yaml --model config/model/segformer/segformer-segmentation.yaml --training config/training/segmentation.yaml --logging config/logging.yaml --environment config/environment.yaml
2023-09-12_14:53:39 KST | INFO          | trainer:<trainer_common.py>:107 >>> Task: segmentation | Model: segformer | Training with torch.fx model? False
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:16 >>> ----------------------------------------
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:17 >>> Loading data...
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:88 >>> Summary | Dataset: <voc2012> (with local format)
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:89 >>> Summary | Training dataset: 1464 sample(s)
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:91 >>> Summary | Validation dataset: 1449 sample(s)
2023-09-12_14:53:41 KST | INFO          | train:<base.py>:150 >>> ----------------------------------------
  0%|                                                                                                                                                                     | 0/183 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [416,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [417,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [418,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [419,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [420,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [800,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [801,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [802,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [803,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [804,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [805,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [806,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [807,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [928,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [929,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [930,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [931,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [932,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [288,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [289,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [290,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [291,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [292,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [293,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [448,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [449,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [450,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [451,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [452,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [453,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [280,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [281,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [282,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [283,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [284,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [285,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [286,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [287,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [480,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [735,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [128,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [129,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [130,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [131,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [132,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [133,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [134,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [694,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [695,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [696,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [697,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [698,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [699,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [700,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [701,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [702,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [703,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [91,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [92,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [93,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [94,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [95,0,0] Assertion `t >= 0 && t < n_classes` failed.
2023-09-12_14:53:43 KST | ERROR         | train:<base.py>:201 >>> CUDA error: device-side assert triggered                                                                                        
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "./train.py", line 5, in <module>
    trainer(is_graphmodule_training=False)
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/trainer_common.py", line 146, in trainer
    raise e
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/trainer_common.py", line 139, in trainer
    trainer.train()
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/base.py", line 202, in train
    raise e
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/base.py", line 162, in train
    self.train_one_epoch()
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/base.py", line 206, in train_one_epoch
    self.train_step(batch)
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/segmentation.py", line 39, in train_step
    self.loss_factory.backward()
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/losses/builder.py", line 54, in backward
    self.total_loss_for_backward.mean().backward()
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

System Info

No response

[Sprint] Model configuration with yaml file

Description

Add scalable model configuration and corresponding model families

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Prettier Multi-GPU CLI

Description

Run multi-gpu options without starting with torch.distributed.run or torchrun.
Equalize with netspresso-trainer or netspressor-trainer-fx and add --num-gpus options to select how many gpu(s) are used.

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Feature Request] TASK_MODEL_DICT is better to be on registry.py

I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?
src/netspresso_trainer/models/builder.py
TASK_MODEL_DICT is in load_backbone_and_head_model function, but there are registry.py already that contains most of model dicts.
I think TASK_MODEL_DICT should be moved to register.py.

Is there any suggestion (or solution) to solve this issue?
Move TASK_MODEL_DICT to register.py.

Additional context
All other contexts or screenshots are welcome.

[Sprint] Docs: configuration

Description

Add an detailed explanation about training configuration

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Docs: update Getting Started pages

Description

Multi-GPU training #101
~~Configuration example and description for dataset, model, optimizer, logging~~
- Split to #145

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

Model add - PIDNet

[Sprint] Add model: RepViT

Description

Model add: RepViT
- https://github.com/THU-MIG/RepViT

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Feature Request] Sync model calibration metric with PyNetsPresso

I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?

Until now, all insiders are using thop library as a FLOPs counter for PyTorch model
There would be some major changes in our service which contains that the library for checking FLOPs will be changed to fvcore.
- fvcore is widely known to have the best coverage for torch operators.

Is there any suggestion (or solution) to solve this issue?

Change thop to fvcore
Use fvcore.nn.flop_count API
- (Optional) Log FLOPs info with table formatted string returned by fvcore.nn.flop_count_table

Additional context
N/A

[BUG] Single GPU hook in checkpoint and summary saving

BTW, I also missed for saving checkpoint only in rank:0 gpu at multi-gpu training.
Hook with the flag self.single_gpu_or_rank_zero seems to be included in both save_checkpoint and save_summary.

I'll post another issue for these one.

Originally posted by @deepkyu in #157 (comment)

[BUG] Valid loss and metric as zero

Describe the bug
Valid loss and metric are represented as zero.

To Reproduce
Steps to reproduce the behavior:

Run pure train code

Expected behavior

Non-zero value for both loss and metric

Screenshots

2023-09-11_02:32:21 UTC | INFO          | train:<base.py>:150 >>> ----------------------------------------
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:27 >>> Epoch: 1 / 3                                                                                                                                                                                                                 
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:30 >>> learning rate: 0.0000200
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:32 >>> elapsed_time: 10.8038218
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:33 >>> training loss: 0.9278503
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:34 >>> training metric: [('Acc@1', 62.263513513513516), ('Acc@5', 100.0)]
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:37 >>> validation loss: 0.0000000
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:39 >>> validation metric: [('Acc@1', 0.0), ('Acc@5', 0.0)]
2023-09-11_02:32:33 UTC | INFO          | save_checkpoint:<base.py>:287 >>> Best model saved at outputs/classification_efficientformer/version_2/classification_efficientformer_best.pth
2023-09-11_02:32:33 UTC | INFO          | save_summary:<base.py>:324 >>> Model training summary saved at outputs/classification_efficientformer/version_2/training_summary.ckpt

[Sprint] Fully support for task head: segmentation

Description

Add segmentation support for existing backbones:
- ResNet
- ~~MobileViT~~
~~Add FAN segmentation head and make support all backbones for segmentation:~~

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Feature Request] Checkpoint and best model saving

I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?

The model saved only when the training is done.
I think the best model and middle phase model should be saved.

Is there any suggestion (or solution) to solve this issue?

Save the best model when validation score get breakthrough
Handle model saving interval through the config file

[Sprint] LR Scheduler simulator with training configuration

Description

Add configuration simulator with gradio to directly find out how learning rate scheduler works

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

[Sprint] Add model: DeiT

Description

Add DeiT model
- https://github.com/huggingface/transformers/blob/main/src/transformers/models/deit/modeling_deit.py

Checklists

I would create a corresponding branch for this issue from the designated (mostly master) branch.
This issue only contains a preceding agreement between project owners.

	if with_checkpoint_saving:
	assert with_valid_logging
	self.save_checkpoint(epoch=num_epoch, save_converted_model=False)
	self.save_summary()

	if self.single_gpu_or_rank_zero:
	self.train_logger.log_end_of_traning(final_metrics={'time_for_last_epoch': time_for_epoch})
	self.save_checkpoint(epoch=num_epoch, save_converted_model=True)
	self.save_summary(end_training=True)

	best_epoch = min(valid_losses, key=valid_losses.get)
	save_best_model = best_epoch == epoch

	if save_best_model:
	torch.save(model.state_dict(), best_model_path.with_suffix(".pth"))
	logger.info(f"Best model saved at {str(best_model_path.with_suffix('.pth'))}")

	if save_converted_model:
	try:
	save_onnx(model, best_model_path.with_suffix(".onnx"), sample_input=self.sample_input)
	logger.info(f"ONNX model converting and saved at {str(best_model_path.with_suffix('.onnx'))}")

	save_graphmodule(model, (model_path.parent / f"{best_model_path.stem}_fx").with_suffix(".pt"))
	logger.info(f"PyTorch FX model tracing and saved at {str(best_model_path.with_suffix('.pt'))}")
	except Exception as e:
	logger.error(e)
	pass

	if self.is_graphmodule_training:
	# Just save graphmodule checkpoint
	torch.save(model, model_path.with_suffix(".pt"))
	logger.debug(f"PyTorch FX model saved at {str(model_path.with_suffix('.pt'))}")
	if save_best_model:
	torch.save(model, best_model_path.with_suffix(".pt"))
	logger.info(f"Best model saved at {str(best_model_path.with_suffix('.pt'))}")
	return

nota-netspresso / netspresso-trainer Goto Github PK

netspresso-trainer's People

Contributors

Stargazers

Watchers

Forkers

netspresso-trainer's Issues

Describe the bug

Have you searched existing issues? 🔎

Reproduction

Screenshot

Logs

System Info

Description

Checklists

Description

Checklists

Description

Checklists

Description

Checklists

Description

Related Links

Checklists

Description

Checklists

Description

Checklists

Description

Related Links

Checklists

Description

Checklists

Describe the bug

Have you searched existing issues? 🔎

Reproduction

Screenshot

Logs

System Info

Description

Checklists

Description

Checklists

Before public release

After public release

Description

Checklists

Describe the bug

Have you searched existing issues? 🔎

Reproduction

Screenshot

Logs

System Info

Description

Checklists

Description

Checklists

Description

Checklists

Description

Checklists

Description

Checklists

Before Transfer

After Transfer - private

Describe the bug

Have you searched existing issues? 🔎

Reproduction

Screenshot

Logs

System Info

Description

Checklists

Description

Checklists

Description

Checklists

Description

Checklists

Description