open-mmlab / mmengine Goto Github PK
View Code? Open in Web Editor NEWOpenMMLab Foundational Library for Training Deep Learning Models
Home Page: https://mmengine.readthedocs.io/
License: Apache License 2.0
OpenMMLab Foundational Library for Training Deep Learning Models
Home Page: https://mmengine.readthedocs.io/
License: Apache License 2.0
mmengine/mmengine/model/averaged_model.py
Lines 87 to 104 in ba6b061
self.steps
starts from 0. Should we change this condition to (self.step + 1) % self.interval == 0
?
Describe the feature
Add documentation to show how to evaluate multiple datasets with multiple metrics and use one of the metrics of a dataset as the best indicator.
Motivation
Users might need to evaluate different metrics on multiple datasets.
In such a case, only one metric on one dataset needs to be selected to indicate whether the model is the best model and should be saved.
It is unnecessary to officially support this feature in MMEngine, but MMEngine supports users to create a new Loop class to support this feature. Therefore, we should update the documentation to show such an example.
Related resources
See a previous PR in MMSeg open-mmlab/mmsegmentation#1461
Additional context
Add any other context or screenshots about the feature request here.
If you would like to implement the feature and create a PR, please leave a comment here and that would be much appreciated.
TODO: inner_iter
should be removed after refactoring LoggerHook.
Originally posted by @zhouzaida in #140 (comment)
See open-mmlab/mmdetection#7543.
TensorboardLoggerHook use iteration no matter whether the training is based on iterations or epoch.
Unify the copy directory of config and model-index.yml, which are currently copied to package/.mim in mim. Consider copy to package/.mmengine?
Describe the feature
MLU is supported in mmdet since open-mmlab/mmdetection#7578.
The new runner should also support MLUDDP.
Before release, all the figures in docs should be updated by unified drawing tools, following a unified color style with a watermark.
The original optimizer constructor does not log anything, which makes it difficult to debug or check the effectiveness of parameter-wise settings. We should enhance the logging when MMEngine is released.
Originally posted by @ZwwWayne in #25 (comment)
def before_run(self, runner: Runner) -> None:
Originally posted by @zhouzaida in #47 (comment)
Most of the hooks' unit tests use a mocked runner. Need to use a real runner to improve the reliability of these unit tests.
可以加一张图介绍一下 rank、local_rank、world_size、local_size 的区别,参考 https://user-images.githubusercontent.com/875518/77676984-4c81e400-6f4c-11ea-87d8-f2ff505a99da.png
Originally posted by @zhouzaida in #45 (comment)
Describe the feature
We have already supported multiple file clients in BaseDataset
, but some arguments are still not.
Especially the usage of the os.path
package may cause incompatibility in different file clients, please check it.
Sometimes the experiment contains multiple groups. However, the communication in many places use default groups, which is in-compatible with some settings.
Describe the feature
When I run fcos_r50_caffe_fpn_gn-head_1x_coco.py which has the setting of default_hooks = dict(optimizer=dict(type='OptimizerHook', grad_clip=dict(max_norm=35, norm_type=2)))
, the program reports an error as below.
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
getattr(hook, fn_name)(self, **kwargs)
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
outputs=self.runner.outputs)
getattr(hook, fn_name)(self, **kwargs)
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
getattr(hook, fn_name)(self, **kwargs)
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/runner/runner.py", line 1304, in call_hook
getattr(hook, fn_name)(self, **kwargs)
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
getattr(hook, fn_name)(self, **kwargs)
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
getattr(hook, fn_name)(self, **kwargs)
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
getattr(hook, fn_name)(self, **kwargs)
File "/mnt/cache/wangjiabao1.vendor/workspace/refactor/mmengine/mmengine/hooks/optimizer_hook.py", line 98, in after_train_iter
runner.log_buffer.update({'grad_norm': float(grad_norm)},
runner.log_buffer.update({'grad_norm': float(grad_norm)},
runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
runner.log_buffer.update({'grad_norm': float(grad_norm)},
runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
AttributeError: 'Runner' object has no attribute 'log_buffer'
AttributeError: 'Runner' object has no attribute 'log_buffer'
runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
AttributeError: 'Runner' object has no attribute 'log_buffer'
runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
runner.log_buffer.update({'grad_norm': float(grad_norm)},
AttributeError: 'Runner' object has no attribute 'log_buffer'
phoenix-srun: error: SH-IDC1-10-140-0-252: tasks 0-7: Exited with exit code 1
phoenix-srun: Terminating job step 1084231.0
I consider if the log_buffer
need to be replaced with message_hub
or logger
.
用户设置完了,如何知道是否符合预期,这个是否也要提供相应的脚本,用户运行后可以很容易的知道哪些参数被 frozen,不太参数组超参的不同。暂时没有时间开发的话,可以作为未来一个需求吧
Originally posted by @hhaAndroid in #25 (comment)
Should we use def get(self, key, default=None)
?
Originally posted by @zhouzaida in #29 (comment)
[] update the code docstring and tutorials
Originally posted by @Harold-lkk in #143 (comment)
Describe the feature
Motivation
A clear and concise description of the motivation of the feature.
Ex1. It is inconvenient when [....].
Ex2. There is a recent paper [....], which is very helpful for [....].
Related resources
If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.
Additional context
Add any other context or screenshots about the feature request here.
If you would like to implement the feature and create a PR, please leave a comment here and that would be much appreciated.
TODO: It would be helpful if a visualization tool can be provided.
Originally posted by @zhouzaida in #2 (comment)
不行,如果这样,在lazy init=True情况,meta里的内容为用户传入meta(高优先级)与类属性 BaseDataset.META 字典(低优先级),之后调用full_init读取标注文件中的meta(中优先级),中优先级meta里的key不知道怎么覆盖高优先级与低优先级里的key
Originally posted by @GT9505 in #7 (comment)
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug
A clear and concise description of what the bug is.
mmengine/.github/workflows/build.yml
Line 211 in 3adf4ea
Reproduction
A placeholder for the command.
Environment
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here.$PATH
, $LD_LIBRARY_PATH
, $PYTHONPATH
, etc.)Error traceback
If applicable, paste the error trackback here.
A placeholder for trackback.
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
Describe the feature
Motivation
It is quite common that users need to update LR based on their GPU numbers. A brief solution might be:
add an argument like default_batchsize somewhere, when start to initialize the param_scheduler, calculate the real batch_size then scale the LR based on their ratio. This enables different repos to set different default_batchsize for their own needs.
Related resources
See auto_scale_lr
in mmdet
Additional context
Add any other context or screenshots about the feature request here.
If you would like to implement the feature and create a PR, please leave a comment here and that would be much appreciated.
We may change the type of _prepare_data()
function in BaseDataset
from private function to public function (i.e., prepare_data()
), since the function may be overridden to add some data into data_info
in some cases.
**to_dict
or print
ignore the internal key of the property
Motivation
to_dict
or print
ignore the internal key of property
The SyncBuffer interface is not implemented, we should try to clean the code interface while avoiding BC-breaking in the future.
Originally posted by @ZwwWayne in #66 (comment)
I would suggest moving the build_evaluator
part into the building function of the registry EVALUATORS
so ComposedEvaluator
can be directly built with EVALUATORS.build()
.
Originally posted by @ly015 in #46 (comment)
When mmengine.dist is implemented, all distributed related modules should use the newest mmengine.dist.
https://github.com/open-mmlab/mmengine/blob/main/mmengine/registry/registry.py#L352-L355
Referring to this part, the get
function will find modules from children if the scope of key
is a child of the current registry. However self._children[scope].get(real_key)
will find modules from self._children[scope]
to root.
Is this a desired feature?
Describe the feature
Motivation
A clear and concise description of the motivation of the feature.
Ex1. It is inconvenient when [....].
Ex2. There is a recent paper [....], which is very helpful for [....].
Related resources
If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.
Additional context
Add any other context or screenshots about the feature request here.
If you would like to implement the feature and create a PR, please leave a comment here and that would be much appreciated.
See discussion and adopt changes in open-mmlab/mmcv#1844.
Evaluator 考虑支持单独使用,例如:如果已经有了 Model 的输出结果,如何调用Evaluator 计算 matrics
Originally posted by @Harold-lkk in #33 (comment)
TODO
Originally posted by @zhouzaida in #59 (comment)
TODO: design a better and flexible way to load checkpoints from openmmlab
Originally posted by @zhouzaida in #93 (comment)
Support FSDP for large model training
Leave a TODO here, this function may update according to open-mmlab/mmcv#1682
Originally posted by @ZwwWayne in #59 (comment)
YAML/JSON/PY 格式支持的 config 内容范围是不一样的,例如 JSON 格式中不支持 tuple,因此 PY config 中的 tuple 在dump 到 JSON 以后会变成 list。
除此之外,还有一些功能性接口是针对 python config 支持的,这种情况下,接口和文档应当予以说明。
在 API 文档或者 config 教程中,应当清晰地列出对不同格式 config 的支持程度,以及不同格式 config 的局限性/差异性。
See PR open-mmlab/mmcv#1796
Describe the feature
Motivation
Now, val_loop cannot return the loss, and it is inconvenient when users would like to record and monitor the validation loss.
Related resources
mmengine/mmengine/dataset/base_dataset.py
Line 517 in 9a61b38
serialized_data_infos_list
has been serialized, why we should further concatenate it and access it by data_address
? If data_bytes
is not concatenated, we can directly access it by index.命名规范将来应该链接到各个 repo 的文档或者独立成为一个命名规范文档
Originally posted by @Harold-lkk in #9 (comment)
We need to re-organize fileio modules and other related utils when time is allowed.
Originally posted by @ZwwWayne in #17 (comment)
type一般是类名,应该是大驼峰命名
Originally posted by @RangiLyu in #33 (comment)
What if there is more than one param group in the optimizer? And how to handle the generation task which has multiple optimizers for generator and discriminator?
Originally posted by @RangiLyu in #155 (comment)
From zaida: 感觉需要举一个例子介绍怎么用 mmengine 训练 cifar10 分类任务,在这个例子中夹杂着简单介绍每个组件并给出对应文档的入口。
Originally posted by @ZwwWayne in #197 (comment)
Sync random seed in distributed sampler across ranks.
See open-mmlab/mmdetection#7432 and open-mmlab/mmdetection#7440
Describe the feature
Motivation
Since the storage is limited, more and more users save their checkpoints in ceph and leaves no checkpoints in the local working directory. However, when resuming the job, the auto-resume function is only able to find the checkpoint in the local path and cannot automatically load the checkpoints saved in ceph.
To solve this issue, a naive description can be as below:
When saving the checkpoints during training, no matter where the checkpoint is saved, save last_checkpoint.txt in the local&ceph working directory indicating the real path of the lastest checkpoint (can be either local storage or ceph). When auto-resuming the checkpoint in training, read the file and load the checkpoint based on the file string. Thus, users can safely use auto-resume using the command like below
sh ./tools/slurm_train.sh $PATITION $CONFIG $WORK_DIR --auto-resume
Or users can manually resume the model in a unified way no matter where the latest checkpoint is saved like below:
sh ./tools/slurm_train.sh $PATITION $CONFIG $WORK_DIR --load-from $WORK_DIR/last_checkpoint --resume
The last_checkpoint.txt serves as a soft like of the latest checkpoint across platforms and works for any kind of storages.
Related resources
If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.
Additional context
Detectron2 has similar design.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.