Giter Site home page Giter Site logo

nota-netspresso / netspresso-trainer Goto Github PK

View Code? Open in Web Editor NEW
62.0 5.0 5.0 25.16 MB

A library for training, compressing and deploying computer vision models (including ViT) with edge devices

Home Page: https://nota-netspresso.github.io/netspresso-trainer/

License: Apache License 2.0

Python 99.36% Shell 0.52% Dockerfile 0.12%
fx netspresso pytorch trainer computer-vision onnx tensorrt

netspresso-trainer's People

Contributors

aychun avatar cbpark-nota avatar deepkyu avatar eric4991 avatar hglee98 avatar illian01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

netspresso-trainer's Issues

[BUG] Converted models are not saved

Describe the bug

When I executed example code in example_train.sh, I couldn't get fx and onnx files.

netspresso-train  --data config/data/beans.yaml  --augmentation config/augmentation/classification.yaml  --model config/model/resnet/resnet50-classification.yaml  --training config/training/classification.yaml  --logging config/logging.yaml  --environment config/environment.yaml

During training, fx and onnx files are not saved by setting save_converted_model value to False.

if with_checkpoint_saving:
assert with_valid_logging
self.save_checkpoint(epoch=num_epoch, save_converted_model=False)
self.save_summary()

Converted models are saved only when training is over.

if self.single_gpu_or_rank_zero:
self.train_logger.log_end_of_traning(final_metrics={'time_for_last_epoch': time_for_epoch})
self.save_checkpoint(epoch=num_epoch, save_converted_model=True)
self.save_summary(end_training=True)

However, there is save_best_model check before saving converted models. There are two cases that save_best_model value becomes False in the last save step.

  • Validation loss of last epoch is not the best one.
  • Validation step is skipped in last epoch.

best_epoch = min(valid_losses, key=valid_losses.get)
save_best_model = best_epoch == epoch

if save_best_model:
torch.save(model.state_dict(), best_model_path.with_suffix(".pth"))
logger.info(f"Best model saved at {str(best_model_path.with_suffix('.pth'))}")
if save_converted_model:
try:
save_onnx(model, best_model_path.with_suffix(".onnx"), sample_input=self.sample_input)
logger.info(f"ONNX model converting and saved at {str(best_model_path.with_suffix('.onnx'))}")
save_graphmodule(model, (model_path.parent / f"{best_model_path.stem}_fx").with_suffix(".pt"))
logger.info(f"PyTorch FX model tracing and saved at {str(best_model_path.with_suffix('.pt'))}")
except Exception as e:
logger.error(e)
pass

Have you searched existing issues? ๐Ÿ”Ž

  • I have searched and found no existing issues

Reproduction

Simply execute instruction below.

netspresso-train  --data config/data/beans.yaml  --augmentation config/augmentation/classification.yaml  --model config/model/resnet/resnet50-classification.yaml  --training config/training/classification.yaml  --logging config/logging.yaml  --environment config/environment.yaml

Screenshot

image

Logs

No response

System Info

No response

[Sprint] PyNetsPresso compatibility: NetsPresso Launcher

Description

  • Apply ResNet50 with NP Launcher by PyNetsPresso
  • If applicable, apply NP-compressor-compressed model with NP Launcher by PyNetsPresso

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint] Saving the checkpoint when pausing the training

Description

  • Save the current checkpoint when pausing with KeyboardInterrupt
  • Resume training with the saved checkpoint

โž• Containing #132

  • Save the best model when validation score get breakthrough
  • Handle model saving interval through the config file

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint] Add model: MobileNetV3

Description

  • Add a model backbone of MobileNetV3
    • I recommend the torchvision's mobilenetv3 because torchvision mostly supports fx tracing for model in their zoo.
    • Only small model (mobilenet_v3_small) is enough.

Minimal condition of mobilenetv3 integration:

  • Classification training with mobilenetv3 backbone should succeed.
  • NP Compressor with trained fx checkpoint should succeed.
  • NP Launcher with trained onnx checkpoint should succeed.

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly dev) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint] Gradio interface for experiment report & PyNetsPresso usage

Description

  • Add Gradio interface for using easily PyNetsPresso with trained models
  • Visualize model training result

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint] Add code generator for deploying at Jetson boards and other devices

Description

  • Add former launcher package features in NetsPresso in this repository
    • Generate preprocess/postprocess code from training code

Related Links

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

Loss logging policy update

As-is

  • Logger returns per-epoch total loss value
    • However, it weirdly cleans all values from each objective per step

To-be

  • Prepare for returning loss value in each step
    • including every value for each objective (preferred for tensorboard)
    • backward for each step (preferred for normal training procedure)
  • Prepare for returning losses in every epoch
    • using AverageMeter for getting averaged results from all samples in dataset
    • returning averaged total loss (preferred for summary logging)

[Sprint] Docs: add NetsPresso pages

Description

  • Add page for introducing NetsPresso
    • Pure torch.fx compatibility
    • NetsPresso
    • NetsPresso Compressor
    • Best Practice
      • Compress and Retrain model

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[BUG] Can't find 'save_optimizer_state' key in conf.logging

Describe the bug

Error occurred when initialize BasePipeline,
self.save_optimizer_state = self.conf.logging.save_optimizer_state

I tried to find save_optimizer_state over the project, but there is no init statement of save_optimizer_state.
I pulled the project before execute.

Have you searched existing issues? ๐Ÿ”Ž

  • I have searched and found no existing issues

Reproduction

netspresso-train
--data config/data/beans.yaml
--augmentation config/augmentation/classification.yaml
--model config/model/resnet/resnet50-classification.yaml
--training config/training/classification.yaml
--logging config/logging.yaml
--environment config/environment.yaml

Screenshot

No response

Logs

(np_trainer) junho.shin@3090e:~/netspresso-trainer-dev$ netspresso-train\
>   --data config/data/beans.yaml\
>   --augmentation config/augmentation/classification.yaml\
>   --model config/model/resnet/resnet50-classification.yaml\
>   --training config/training/classification.yaml\
>   --logging config/logging.yaml\
>   --environment config/environment.yaml
2023-09-12_12:53:27 KST | INFO          | trainer:<trainer_common.py>:107 >>> Task: classification | Model: resnet50 | Training with torch.fx model? False
2023-09-12_12:53:27 KST | INFO          | build_dataset:<builder.py>:16 >>> ----------------------------------------
2023-09-12_12:53:27 KST | INFO          | build_dataset:<builder.py>:17 >>> Loading data...
2023-09-12_12:53:30 KST | INFO          | build_dataset:<builder.py>:88 >>> Summary | Dataset: <beans> (with huggingface format)
2023-09-12_12:53:30 KST | INFO          | build_dataset:<builder.py>:89 >>> Summary | Training dataset: 1034 sample(s)
2023-09-12_12:53:30 KST | INFO          | build_dataset:<builder.py>:91 >>> Summary | Validation dataset: 133 sample(s)
2023-09-12_12:53:30 KST | INFO          | build_dataset:<builder.py>:93 >>> Summary | Test dataset: 128 sample(s)
Traceback (most recent call last):
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/bin/netspresso-train", line 33, in <module>
    sys.exit(load_entry_point('netspresso-trainer', 'console_scripts', 'netspresso-train')())
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/train.py", line 6, in netspresso_train
    trainer(is_graphmodule_training=is_graphmodule_training)
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/trainer_common.py", line 132, in trainer
    trainer = build_pipeline(conf, task, model_name, model,
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/builder.py", line 9, in build_pipeline
    trainer = task_pipeline(conf, task, model_name, model, devices,
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/classification.py", line 17, in __init__
    super(ClassificationPipeline, self).__init__(conf, task, model_name, model, devices,
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/base.py", line 59, in __init__
    self.save_optimizer_state = self.conf.logging.save_optimizer_state
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 355, in __getattr__
    self._format_and_raise(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
    node = self._get_child(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
    child = self._get_node(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 480, in _get_node
    raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigAttributeError: Missing key save_optimizer_state
    full_key: logging.save_optimizer_state
    object_type=dict

System Info

No response

[Sprint] Add an easy-to-use Colab demo

Description

  • Add Colab demo to showcase the training and PyNetsPresso integration process

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint] Docs: add Model pages

Description

  • Add Models page
    • What is MetaFormer?
    • Our MetaFormer implementation
    • Model List
      • Backbone list
      • Head list
      • Full list
    • Model Compatibility Matrix

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[BUG] Import error occurred when use python=3.10.12

Describe the bug

Import error occurred when I try to execute code with python 3.10.12

ImportError: cannot import name 'Sequence' from 'collections' 

It's because Sequence has been moved to collections.abc. This can be checked in docs.

I think this should be handled since python 3.10.12 version is in support coverage.

Have you searched existing issues? ๐Ÿ”Ž

  • I have searched and found no existing issues

Reproduction

Execute script on python 3.10.12 environment.

netspresso-train  --data config/data/beans.yaml  --augmentation config/augmentation/classification.yaml  --model config/model/resnet/resnet50-classification.yaml  --training config/training/classification.yaml  --logging config/logging.yaml  --environment config/environment.yaml

Screenshot

No response

Logs

(np_trainer_python3.10.12) junho.shin@3090e:~/netspresso-trainer-dev$ netspresso-train  --data config/data/beans.yaml  --augmentation config/augmentation/classification.yaml  --model config/model/resnet/resnet50-classification.yaml  --training config/training/classification.yaml  --logging config/logging.yaml  --environment config/environment.yaml
Traceback (most recent call last):
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer_python3.10.12/bin/netspresso-train", line 5, in <module>
    from netspresso_trainer.train import netspresso_train
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/__init__.py", line 3, in <module>
    from .trainer_common import parse_args_netspresso, set_arguments, trainer
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/trainer_common.py", line 9, in <module>
    from .dataloaders import build_dataloader, build_dataset
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/__init__.py", line 1, in <module>
    from .builder import build_dataloader, build_dataset
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/builder.py", line 6, in <module>
    from .detection import detection_collate_fn
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/detection/__init__.py", line 3, in <module>
    from .transforms import create_transform_detection
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/detection/transforms.py", line 7, in <module>
    from ..augmentation import custom as TC
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/augmentation/__init__.py", line 1, in <module>
    from .custom import (
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/dataloaders/augmentation/custom.py", line 2, in <module>
    from collections import Sequence
ImportError: cannot import name 'Sequence' from 'collections' (/ssd1/junho.shin/anaconda3/envs/np_trainer_python3.10.12/lib/python3.10/collections/__init__.py)

System Info

python=3.10.12

[Sprint] Model training yaml file simplification

Description

  • Suggest the easy-to-use command and configuration for example trainings

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Feature Request] There is no onnx model save in graph model training

  • I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?
The best model should be saved into onnx format, when graph model training.

if self.is_graphmodule_training:
# Just save graphmodule checkpoint
torch.save(model, model_path.with_suffix(".pt"))
logger.debug(f"PyTorch FX model saved at {str(model_path.with_suffix('.pt'))}")
if save_best_model:
torch.save(model, best_model_path.with_suffix(".pt"))
logger.info(f"Best model saved at {str(best_model_path.with_suffix('.pt'))}")
return

Is there any suggestion (or solution) to solve this issue?
I think this can be implemented by simply adding save_onnx function call.

Additional context
All other contexts or screenshots are welcome.

[BUG] Run script example_train.sh is not proper.

Describe the bug
Seems run script of example_train.sh is not proper.

To Reproduce
classification example of example_train.sh is below. Instruction netspresso-train can't execute train.py

netspresso-train\
  --data config/data/beans.yaml\
  --augmentation config/augmentation/classification.yaml\
  --model config/model/resnet/resnet50-classification.yaml\
  --training config/training/classification.yaml\
  --logging config/logging.yaml\
  --environment config/environment.yaml

Expected behavior
Like example_train_fx.sh, example_train.sh need to be changed to execute python train.py. Below is the script of example_train_fx.sh.

python train_fx.py\
  --data config/data/beans.yaml\
  --augmentation config/augmentation/classification.yaml\
  --model config/model/resnet/resnet50-classification.yaml\
  --training config/training/classification.yaml\
  --logging config/logging.yaml\
  --environment config/environment.yaml\
  --fx-model-checkpoint classification_resnet_fx.pt

In addition, comment block of example_train.sh should be changed.
From

#### HuggingFace datasets training
# To use HuggingFace datasets, you need to additionally install requirements-data.txt
# `pip install -r requirements-data.txt`
#### (END)

To

#### HuggingFace datasets training
# To use HuggingFace datasets, you need to additionally install requirements-optional.txt
# `pip install -r requirements-optional.txt`
#### (END)

, because there is requirements-optional.txt instead of requirements-data.txt.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

[Sprint] PyNetsPresso compatibility check

Description

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint] Add LRScheulders: StepLR, CosineAnnealingWarmRestart

Description

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

Performance validation with the original repository

https://nota-workspace.slack.com/archives/C047JBNH7U6/p1678946243546009?thread_ts=1678943626.801589&cid=C047JBNH7U6

VOC2012 ย  ย  ย  ย  ย  ย  ย  ย  ย 
model training crop test (crop, stride) mIoU-val (%) pixAcc-val (%) eval eph (val-best) training time batch size gpu mem gpu spec
segformer-b0 768 (768, 512) 65.83750734 91.92332466 186 / 250 13 h 8 13911MiB 3090f
segformer-b1 768 (768, 512) 71.9364075 93.54266384 152 / 250 17 h 8 16807MiB 3090f
ADE20K Exp2 ย  ย  ย  ย  ย  ย  ย  ย  ย 
model training crop test (crop, stride) mIoU-val (%) pixAcc-val (%) eval eph (val-best) training time batch size gpu mem gpu spec
segformer-b0 512 (512, 512) 36.82355351 76.54075322 187 / 200 45 h 16 17089MiB x 1 gpu a100
segformer-b1 512 (512, 512) 39.95565802 78.47017266 178 / 200 56 h 16 19659MiB x 1 gpu a100

[Sprint] Add NetsPresso use case

Description

  • SegFormer training
  • PyNetsPresso compressor
  • SegFormer finetuning with compressed model

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint][Important] Transferring the repository

Description

This issue contains a checklist for transferring this repository to Nota-NetsPresso/netspresso-trainer.

Checklists

Before Transfer

  • Check there's no irrelevant (or outdated) tag or branches
  • Transfer the repository to Nota-NetsPresso organization
    • We decided to transfer this repository to preserve all the issue and PR histories.
    • The original repo at the same link will be removed.

After Transfer - private

  • Check all the absolute links (and relative links) in README.md and Docs
  • Check whether the docs page is rendered correctly
  • Check all issue statuses are kept as original
  • Check whether all the github actions work well

[Feature Request] Init of loss and metric module

  • I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?

  • Loss and metric are initialized on every epochs.
  • I think they don't need to be initialized frequently because they are not variable during train procedure.
for num_epoch in range(START_EPOCH_ZERO_OR_ONE, self.conf.training.epochs + START_EPOCH_ZERO_OR_ONE):
    self.timer.start_record(name=f'train_epoch_{num_epoch}')
    self.loss = build_losses(self.conf.model, ignore_index=self.ignore_index)
    self.metric = build_metrics(self.task, self.conf.model, ignore_index=self.ignore_index, num_classes=self.num_classes)

Is there any suggestion (or solution) to solve this issue?

  • Loss and metric can be initialized on set_train function.
  • If there are values to be 0 after a epoch, we can add reset method for that.

[BUG] 'target' of semantic segmentation loader doesn't match with class number

Describe the bug

In SegmentationCustomDataset line 35, label is read with PIL.
If I extract unique value of the label, I can get the below like array.

np.unique(np.array(label))
> array([  0, 147, 151, 220], dtype=uint8)

Even when the label passes through transform function (line 44), it still has large values.

out['mask'].unique()
> tensor([0, 151, 220])

It causes error when trainer try to compute loss (Cross entropy), because the model has 21 classes (including background) not over 200 classes.
The dataset have to convert mask pixel intensity to class values.

Have you searched existing issues? ๐Ÿ”Ž

  • I have searched and found no existing issues

Reproduction

I downloaded VOC2012 datasets, and executed instruction below.

python ./train.py --data config/data/voc12.yaml --augmentation config/augmentation/segmentation.yaml --model config/model/segformer/segformer-segmentation.yaml --training config/training/segmentation.yaml --logging config/logging.yaml --environment config/environment.yaml

Screenshot

No response

Logs

(np_trainer) junho.shin@3090e:~/netspresso-trainer-dev$ python ./train.py --data config/data/voc12.yaml --augmentation config/augmentation/segmentation.yaml --model config/model/segformer/segformer-segmentation.yaml --training config/training/segmentation.yaml --logging config/logging.yaml --environment config/environment.yaml
2023-09-12_14:53:39 KST | INFO          | trainer:<trainer_common.py>:107 >>> Task: segmentation | Model: segformer | Training with torch.fx model? False
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:16 >>> ----------------------------------------
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:17 >>> Loading data...
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:88 >>> Summary | Dataset: <voc2012> (with local format)
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:89 >>> Summary | Training dataset: 1464 sample(s)
2023-09-12_14:53:39 KST | INFO          | build_dataset:<builder.py>:91 >>> Summary | Validation dataset: 1449 sample(s)
2023-09-12_14:53:41 KST | INFO          | train:<base.py>:150 >>> ----------------------------------------
  0%|                                                                                                                                                                     | 0/183 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [416,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [417,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [418,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [419,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [10,0,0], thread: [420,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [800,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [801,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [802,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [803,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [804,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [805,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [806,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [807,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [928,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [929,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [930,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [931,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [932,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [288,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [289,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [290,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [291,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [292,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [13,0,0], thread: [293,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [448,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [449,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [450,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [451,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [452,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [453,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [280,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [281,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [282,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [283,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [284,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [285,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [286,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [287,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [480,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [11,0,0], thread: [735,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [128,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [129,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [130,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [131,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [132,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [133,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [134,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [694,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [695,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [696,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [697,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [698,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [699,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [700,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [701,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [702,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [2,0,0], thread: [703,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [91,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [92,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [93,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [94,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [3,0,0], thread: [95,0,0] Assertion `t >= 0 && t < n_classes` failed.
2023-09-12_14:53:43 KST | ERROR         | train:<base.py>:201 >>> CUDA error: device-side assert triggered                                                                                        
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "./train.py", line 5, in <module>
    trainer(is_graphmodule_training=False)
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/trainer_common.py", line 146, in trainer
    raise e
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/trainer_common.py", line 139, in trainer
    trainer.train()
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/base.py", line 202, in train
    raise e
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/base.py", line 162, in train
    self.train_one_epoch()
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/base.py", line 206, in train_one_epoch
    self.train_step(batch)
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/pipelines/segmentation.py", line 39, in train_step
    self.loss_factory.backward()
  File "/ssd1/junho.shin/netspresso-trainer-dev/src/netspresso_trainer/losses/builder.py", line 54, in backward
    self.total_loss_for_backward.mean().backward()
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/ssd1/junho.shin/anaconda3/envs/np_trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

System Info

No response

[Sprint] Model configuration with yaml file

Description

  • Add scalable model configuration and corresponding model families

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint] Prettier Multi-GPU CLI

Description

  • Run multi-gpu options without starting with torch.distributed.run or torchrun.
  • Equalize with netspresso-trainer or netspressor-trainer-fx and add --num-gpus options to select how many gpu(s) are used.

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Feature Request] TASK_MODEL_DICT is better to be on registry.py

  • I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?
src/netspresso_trainer/models/builder.py
TASK_MODEL_DICT is in load_backbone_and_head_model function, but there are registry.py already that contains most of model dicts.
I think TASK_MODEL_DICT should be moved to register.py.

Is there any suggestion (or solution) to solve this issue?
Move TASK_MODEL_DICT to register.py.

Additional context
All other contexts or screenshots are welcome.

[Sprint] Docs: configuration

Description

  • Add an detailed explanation about training configuration

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Sprint] Docs: update Getting Started pages

Description

  • Multi-GPU training #101
  • Configuration example and description for dataset, model, optimizer, logging

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Feature Request] Sync model calibration metric with PyNetsPresso

  • I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?

  • Until now, all insiders are using thop library as a FLOPs counter for PyTorch model
  • There would be some major changes in our service which contains that the library for checking FLOPs will be changed to fvcore.
    • fvcore is widely known to have the best coverage for torch operators.

Is there any suggestion (or solution) to solve this issue?

  • Change thop to fvcore
  • Use fvcore.nn.flop_count API

Additional context
N/A

[BUG] Valid loss and metric as zero

Describe the bug
Valid loss and metric are represented as zero.

To Reproduce
Steps to reproduce the behavior:

  1. Run pure train code

Expected behavior

  • Non-zero value for both loss and metric

Screenshots

2023-09-11_02:32:21 UTC | INFO          | train:<base.py>:150 >>> ----------------------------------------
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:27 >>> Epoch: 1 / 3                                                                                                                                                                                                                 
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:30 >>> learning rate: 0.0000200
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:32 >>> elapsed_time: 10.8038218
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:33 >>> training loss: 0.9278503
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:34 >>> training metric: [('Acc@1', 62.263513513513516), ('Acc@5', 100.0)]
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:37 >>> validation loss: 0.0000000
2023-09-11_02:32:32 UTC | INFO          | __call__:<stdout.py>:39 >>> validation metric: [('Acc@1', 0.0), ('Acc@5', 0.0)]
2023-09-11_02:32:33 UTC | INFO          | save_checkpoint:<base.py>:287 >>> Best model saved at outputs/classification_efficientformer/version_2/classification_efficientformer_best.pth
2023-09-11_02:32:33 UTC | INFO          | save_summary:<base.py>:324 >>> Model training summary saved at outputs/classification_efficientformer/version_2/training_summary.ckpt

[Sprint] Fully support for task head: segmentation

Description

  • Add segmentation support for existing backbones:
    • ResNet
    • MobileViT
  • Add FAN segmentation head and make support all backbones for segmentation:

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

[Feature Request] Checkpoint and best model saving

  • I have searched to see if a similar issue already exists.

Is there any feature that you would like to add?

  • The model saved only when the training is done.
  • I think the best model and middle phase model should be saved.

Is there any suggestion (or solution) to solve this issue?

  • Save the best model when validation score get breakthrough
  • Handle model saving interval through the config file

[Sprint] LR Scheduler simulator with training configuration

Description

  • Add configuration simulator with gradio to directly find out how learning rate scheduler works

Checklists

  • I would create a corresponding branch for this issue from the designated (mostly master) branch.
  • This issue only contains a preceding agreement between project owners.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.