open-mmlab / mmengine Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 340.0 3.95 MB

OpenMMLab Foundational Library for Training Deep Learning Models

Home Page: https://mmengine.readthedocs.io/

License: Apache License 2.0

Python 99.90% Dockerfile 0.10%

ai computer-vision deep-learning machine-learning python pytorch

mmengine's People

Contributors

Stargazers

Watchers

Forkers

rangilyu satoshirobatofujimoto zhouzaida aptsunny bigwangyudong c1rn09 haochenye ristoranterist zwwwayne li-qingyun rangeking tau-j richardsonjf keichitse mengzhangli meijiawen matrixgame2018 yang-0201 xcnick liuyanyi wangjiangben-hw piaomai oldpan cxiang26 nijkah hit-cwh bingchunhuo zcmax wufan-tb yhna940 austinmw dumoio marsmiao liufeinuaa jbwang1997 xin-li-67 zhaoguochun1995 zhijiejia z-fran det-tu lxtccc triple-mu sanbuphy harold-lkk almagical zengyh1900 mengduanjinghua imazhou ly015 xiangxu-0103 gaopinghai 605671435 andrewsilver1997 dloves1314 mzr1996 cir7 xt-1997 fangyixiao18 taofuyu ice-tong baymaxbhl hhaandroid vansin sunyc11 yxzhao2022 sutantank tonysy fpshuang xiexinch morrisjdev parrotsdl twmht lyviva gaotongxiao air-jiang shanmo kylrzhou interpretty ajunlonglive liuxianyi kevinnunu righterty menglongyue zltjohn xuranhui junjie100 jetstramsam yaqi0510 wang-tf 547sana guwuyue sh0622-kim apacha i-aki-y sczwangxiao evdcush jingweizhang12 ai-tianlong hlqzc2008 akideliu

mmengine's Issues

A better way to load checkpoint from open-mmlab oss

Deal with import in config

See open-mmlab/mmcv#1802

`AverageModel` has bug in updating judgement

mmengine/mmengine/model/averaged_model.py

Lines 87 to 104 in ba6b061

    
           if self.steps % self.interval == 0: 
        
               avg_param = ( 
        
                   itertools.chain(self.module.parameters(), 
        
                                   self.module.buffers()) 
        
                   if self.update_buffers else self.parameters()) 
        
               src_param = ( 
        
                   itertools.chain(model.parameters(), model.buffers()) 
        
                   if self.update_buffers else model.parameters()) 
        
               for p_avg, p_src in zip(avg_param, src_param): 
        
                   device = p_avg.device 
        
                   p_src_ = p_src.detach().to(device) 
        
                   if self.steps == 0: 
        
                       p_avg.detach().copy_(p_src_) 
        
                   else: 
        
                       p_avg.detach().copy_( 
        
                           self.avg_func(p_avg.detach(), p_src_, 
        
                                         self.steps.to(device))) 
        
           self.steps += 1

self.steps starts from 0. Should we change this condition to (self.step + 1) % self.interval == 0?

Add documentation of evaluation on multiple dataset with multiple metric.

Describe the feature
Add documentation to show how to evaluate multiple datasets with multiple metrics and use one of the metrics of a dataset as the best indicator.

Motivation
Users might need to evaluate different metrics on multiple datasets.
In such a case, only one metric on one dataset needs to be selected to indicate whether the model is the best model and should be saved.
It is unnecessary to officially support this feature in MMEngine, but MMEngine supports users to create a new Loop class to support this feature. Therefore, we should update the documentation to show such an example.

Related resources
See a previous PR in MMSeg open-mmlab/mmsegmentation#1461

Additional context
Add any other context or screenshots about the feature request here.
If you would like to implement the feature and create a PR, please leave a comment here and that would be much appreciated.

TODO: `inner_iter` should be removed after refactoring LoggerHook.

TODO: inner_iter should be removed after refactoring LoggerHook.

Originally posted by @zhouzaida in #140 (comment)

Logger Hook allow to set iter or epoch number

See open-mmlab/mmdetection#7543.
TensorboardLoggerHook use iteration no matter whether the training is based on iterations or epoch.

Unify the copy directory of config and model-index.yml.

Unify the copy directory of config and model-index.yml, which are currently copied to package/.mim in mim. Consider copy to package/.mmengine?

Support MLUDDPWrapper in Runner

Describe the feature
MLU is supported in mmdet since open-mmlab/mmdetection#7578.
The new runner should also support MLUDDP.

See open-mmlab/mmdetection#7578

Update figures with unified color style with watermark

Before release, all the figures in docs should be updated by unified drawing tools, following a unified color style with a watermark.

Enhance logging in Optimizer Constructor

The original optimizer constructor does not log anything, which makes it difficult to debug or check the effectiveness of parameter-wise settings. We should enhance the logging when MMEngine is released.

Originally posted by @ZwwWayne in #25 (comment)

Considering using Runner rather than object in typehint of Hook

    def before_run(self, runner: Runner) -> None:

Originally posted by @zhouzaida in #47 (comment)

[Refactor] Recator unit tests of hooks.

Most of the hooks' unit tests use a mocked runner. Need to use a real runner to improve the reliability of these unit tests.

加一张图介绍 rank、local_rank、world_size、local_size 的区别

可以加一张图介绍一下 rank、local_rank、world_size、local_size 的区别，参考 https://user-images.githubusercontent.com/875518/77676984-4c81e400-6f4c-11ea-87d8-f2ff505a99da.png

Originally posted by @zhouzaida in #45 (comment)

Fully support of different file clients in `BaseDataset`.

Describe the feature
We have already supported multiple file clients in BaseDataset, but some arguments are still not.
Especially the usage of the os.path package may cause incompatibility in different file clients, please check it.

Deal with distributed group in distributed training

Sometimes the experiment contains multiple groups. However, the communication in many places use default groups, which is in-compatible with some settings.

Support finding free port in _init_dist_slurm()

Describe the feature

open-mmlab/mmcv#1846

'Runner' object has no attribute 'log_buffer'

When I run fcos_r50_caffe_fpn_gn-head_1x_coco.py which has the setting of default_hooks = dict(optimizer=dict(type='OptimizerHook', grad_clip=dict(max_norm=35, norm_type=2))), the program reports an error as below.

File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
    getattr(hook, fn_name)(self, **kwargs)
  File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
    outputs=self.runner.outputs)
    getattr(hook, fn_name)(self, **kwargs)
  File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
    getattr(hook, fn_name)(self, **kwargs)
  File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
  File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/runner/runner.py", line 1304, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
    getattr(hook, fn_name)(self, **kwargs)
  File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
    getattr(hook, fn_name)(self, **kwargs)
  File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
    getattr(hook, fn_name)(self, **kwargs)
  File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
    runner.log_buffer.update({'grad_norm': float(grad_norm)},
    runner.log_buffer.update({'grad_norm': float(grad_norm)},
    runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
    runner.log_buffer.update({'grad_norm': float(grad_norm)},
    runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
AttributeError: 'Runner' object has no attribute 'log_buffer'
AttributeError: 'Runner' object has no attribute 'log_buffer'
    runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
AttributeError: 'Runner' object has no attribute 'log_buffer'
    runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
    runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
phoenix-srun: error: SH-IDC1-10-140-0-252: tasks 0-7: Exited with exit code 1
phoenix-srun: Terminating job step 1084231.0

I consider if the log_buffer need to be replaced with message_hub or logger.

用户设置完了paramwise_cfg，如何知道是否符合预期，这个是否也要提供相应的脚本，用户运行后可以很容易的知道哪些参数被 frozen，不太参数组超参的不同。暂时没有时间开发的话，可以作为未来一个需求吧

用户设置完了，如何知道是否符合预期，这个是否也要提供相应的脚本，用户运行后可以很容易的知道哪些参数被 frozen，不太参数组超参的不同。暂时没有时间开发的话，可以作为未来一个需求吧

Originally posted by @hhaAndroid in #25 (comment)

Use `def get(self, key, default=None)`

Should we use def get(self, key, default=None)?

Originally posted by @zhouzaida in #29 (comment)

update basedataelement tutorials and dostring

[] update the code docstring and tutorials

Originally posted by @Harold-lkk in #143 (comment)

Clear some `type: ignore` flags

Describe the feature

Motivation
A clear and concise description of the motivation of the feature.
Ex1. It is inconvenient when [....].
Ex2. There is a recent paper [....], which is very helpful for [....].

Related resources
If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.

TODO: It would be helpful if a visualization tool can be provided.

Originally posted by @zhouzaida in #2 (comment)

Explain the priority design of meta info in code.

不行，如果这样，在lazy init=True情况，meta里的内容为用户传入meta（高优先级）与类属性 BaseDataset.META 字典（低优先级），之后调用full_init读取标注文件中的meta（中优先级），中优先级meta里的key不知道怎么覆盖高优先级与低优先级里的key

Originally posted by @GT9505 in #7 (comment)

CI codecov use `--source mmdet`

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug
A clear and concise description of what the bug is.

mmengine/.github/workflows/build.yml

Line 211 in 3adf4ea

coverage run --branch --source mmdet -m pytest tests/

Reproduction

What command or script did you run?

A placeholder for the command.

Did you make any modifications on the code or config? Did you understand what you have modified?
What dataset did you use?

Environment

Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback
If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Support auto-scaling LR in param_scheduler

Describe the feature

Motivation
It is quite common that users need to update LR based on their GPU numbers. A brief solution might be:
add an argument like default_batchsize somewhere, when start to initialize the param_scheduler, calculate the real batch_size then scale the LR based on their ratio. This enables different repos to set different default_batchsize for their own needs.

Related resources
See auto_scale_lr in mmdet

Change the type of `_prepare_data()` function in `BaseDataset`

We may change the type of _prepare_data() function in BaseDataset from private function to public function (i.e., prepare_data()), since the function may be overridden to add some data into data_info in some cases.

`to_dict` or `print` ignore the internal key of property

**to_dict or print ignore the internal key of the property

Motivation
to_dict or print ignore the internal key of property

Deal with sync_buffer in CheckpointHook

The SyncBuffer interface is not implemented, we should try to clean the code interface while avoiding BC-breaking in the future.

Originally posted by @ZwwWayne in #66 (comment)

Move the `build_evaluator` part into the building function of the registry `EVALUATORS`

I would suggest moving the build_evaluator part into the building function of the registry EVALUATORS so ComposedEvaluator can be directly built with EVALUATORS.build().

Originally posted by @ly015 in #46 (comment)

Distributed related modules should use the newest mmengine.dist.

When mmengine.dist is implemented, all distributed related modules should use the newest mmengine.dist.

Maybe an error of registry.

https://github.com/open-mmlab/mmengine/blob/main/mmengine/registry/registry.py#L352-L355

Referring to this part, the get function will find modules from children if the scope of key is a child of the current registry. However self._children[scope].get(real_key) will find modules from self._children[scope] to root.

Is this a desired feature?

Support to visualize learning rate status before training

Describe the feature

Motivation
A clear and concise description of the motivation of the feature.
Ex1. It is inconvenient when [....].
Ex2. There is a recent paper [....], which is very helpful for [....].

Related resources
If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.

Speed up Registry build

See discussion and adopt changes in open-mmlab/mmcv#1844.

Evaluator 能不能单独使用，如果已经有了 Model 的输出结果，如何调用Evaluator 计算 matrics

Evaluator 考虑支持单独使用，例如：如果已经有了 Model 的输出结果，如何调用Evaluator 计算 matrics

Originally posted by @Harold-lkk in #33 (comment)

As a workaround, Supporting `gather` communication for NCCL backend with GLOO backend which supports `gather`

TODO

Originally posted by @zhouzaida in #59 (comment)

TODO: design a better and flexible way to load checkpoints from openmmlab

Originally posted by @zhouzaida in #93 (comment)

Support FSDP

Support FSDP for large model training

Update MPI launch according to https://github.com/open-mmlab/mmcv/pull/1682

Leave a TODO here, this function may update according to open-mmlab/mmcv#1682

Originally posted by @ZwwWayne in #59 (comment)

Config 文档说明不同格式 config 之间功能性上的差异

YAML/JSON/PY 格式支持的 config 内容范围是不一样的，例如 JSON 格式中不支持 tuple，因此 PY config 中的 tuple 在dump 到 JSON 以后会变成 list。
除此之外，还有一些功能性接口是针对 python config 支持的，这种情况下，接口和文档应当予以说明。
在 API 文档或者 config 教程中，应当清晰地列出对不同格式 config 的支持程度，以及不同格式 config 的局限性/差异性。

Support shallow copy in config

See PR open-mmlab/mmcv#1796

UML 图里带水印，并且尺寸过大，需要调整

Support to record validation loss

Describe the feature

Motivation

Now, val_loop cannot return the loss, and it is inconvenient when users would like to record and monitor the validation loss.

Related resources

open-mmlab/mmsegmentation#1494

Should BaseDataset concat data_bytes and cumsum data_address the same as detectron2?

mmengine/mmengine/dataset/base_dataset.py

Line 517 in 9a61b38

data_address: np.ndarray = np.cumsum(address_list)

It seems serialized_data_infos_list has been serialized, why we should further concatenate it and access it by data_address? If data_bytes is not concatenated, we can directly access it by index.

命名规范将来应该链接到各个 repo 的文档或者独立成为一个命名规范文档

Originally posted by @Harold-lkk in #9 (comment)

Re-organize fileio modules and other related utils

We need to re-organize fileio modules and other related utils when time is allowed.

Originally posted by @ZwwWayne in #17 (comment)

Refine types in evaluator doc

type一般是类名，应该是大驼峰命名

Originally posted by @RangiLyu in #33 (comment)

Consider more than one param group in the logging of optimizer in OptimWrapper

What if there is more than one param group in the optimizer? And how to handle the generation task which has multiple optimizers for generator and discriminator?

Originally posted by @RangiLyu in #155 (comment)

Add example to introduce how to use mmengine to train cifar10

From zaida: 感觉需要举一个例子介绍怎么用 mmengine 训练 cifar10 分类任务，在这个例子中夹杂着简单介绍每个组件并给出对应文档的入口。

Originally posted by @ZwwWayne in #197 (comment)

Sync random seed in Distributed Sampler

Sync random seed in distributed sampler across ranks.
See open-mmlab/mmdetection#7432 and open-mmlab/mmdetection#7440

Enable automatically loading latest checkpoint from ceph

Describe the feature

Motivation
Since the storage is limited, more and more users save their checkpoints in ceph and leaves no checkpoints in the local working directory. However, when resuming the job, the auto-resume function is only able to find the checkpoint in the local path and cannot automatically load the checkpoints saved in ceph.

To solve this issue, a naive description can be as below:
When saving the checkpoints during training, no matter where the checkpoint is saved, save last_checkpoint.txt in the local&ceph working directory indicating the real path of the lastest checkpoint (can be either local storage or ceph). When auto-resuming the checkpoint in training, read the file and load the checkpoint based on the file string. Thus, users can safely use auto-resume using the command like below

sh ./tools/slurm_train.sh $PATITION $CONFIG $WORK_DIR --auto-resume

Or users can manually resume the model in a unified way no matter where the latest checkpoint is saved like below:

sh ./tools/slurm_train.sh $PATITION $CONFIG $WORK_DIR --load-from $WORK_DIR/last_checkpoint --resume

The last_checkpoint.txt serves as a soft like of the latest checkpoint across platforms and works for any kind of storages.

Related resources
If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.

Additional context
Detectron2 has similar design.

	if self.steps % self.interval == 0:
	avg_param = (
	itertools.chain(self.module.parameters(),
	self.module.buffers())
	if self.update_buffers else self.parameters())
	src_param = (
	itertools.chain(model.parameters(), model.buffers())
	if self.update_buffers else model.parameters())
	for p_avg, p_src in zip(avg_param, src_param):
	device = p_avg.device
	p_src_ = p_src.detach().to(device)
	if self.steps == 0:
	p_avg.detach().copy_(p_src_)
	else:
	p_avg.detach().copy_(
	self.avg_func(p_avg.detach(), p_src_,
	self.steps.to(device)))
	self.steps += 1

open-mmlab / mmengine Goto Github PK

mmengine's People

Contributors

Stargazers

Watchers

Forkers

mmengine's Issues

Recommend Projects

Recommend Topics

Recommend Org