Giter Site home page Giter Site logo

oneflow-inc / libai Goto Github PK

View Code? Open in Web Editor NEW
376.0 43.0 55.0 28.7 MB

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

Home Page: https://libai.readthedocs.io

License: Apache License 2.0

Python 98.66% Shell 0.26% Makefile 0.01% C++ 1.01% Dockerfile 0.05%
oneflow nlp deep-learning large-scale data-parallelism model-parallelism distributed-training pipeline-parallelism transformer self-supervised-learning

libai's Introduction

LiBai

docs GitHub GitHub release PRs Welcome Python Checks Docs Release Status

Introduction

English | 简体中文

LiBai is a large-scale open-source model training toolbox based on OneFlow. The main branch works with OneFlow 0.7.0.

Highlights
  • Support a collection of parallel training components

    LiBai provides multiple parallelisms such as Data Parallelism, Tensor Parallelism, and Pipeline Parallelism. It's also extensible for other new parallelisms.

  • Varied training techniques

    LiBai provides many out-of-the-box training techniques such as Distributed Training, Mixed Precision Training, Activation Checkpointing, Recomputation, Gradient Accumulation, and Zero Redundancy Optimizer(ZeRO).

  • Support for both CV and NLP tasks

    LiBai has predefined data process for both CV and NLP datasets such as CIFAR, ImageNet, and BERT Dataset.

  • Easy to use

    LiBai's components are designed to be modular for easier usage as follows:

    • LazyConfig system for more flexible syntax and no predefined structures
    • Friendly trainer and engine
    • Used as a library to support building research projects on it. See projects/ for some projects that are built based on LiBai
  • High Efficiency

Installation

See Installation instructions.

Getting Started

See Quick Run for the basic usage of LiBai.

Documentation

See LiBai's documentation for full API documentation and tutorials.

ChangeLog

Beta 0.3.0 was released in 03/11/2024, the general changes in 0.3.0 version are as follows:

Features:

  • Support mock transformers, see Mock transformers
  • Support lm-evaluation-harness for model evaluation
  • User Experience Optimization

New Supported Models:

  • These models are natively supported by libai
Models 2D(tp+pp) Inference 3D Parallel Training
BLOOM -
ChatGLM
Couplets
DALLE2 -
Llama2
MAE
Stable_Diffusion - -

New Mock Models:

  • These models are extended and implemented by libai through mocking transformers.
Models Tensor Parallel Pipeline Parallel
BLOOM -
GPT2 -
LLAMA -
LLAMA2 -
Baichuan -
OPT -

See changelog for details and release history.

Contributing

We appreciate all contributions to improve LiBai. See CONTRIBUTING for the contributing guideline.

License

This project is released under the Apache 2.0 license.

Citation

If you find this project useful for your research, consider cite:

@misc{of2021libai,
  author =       {Xingyu Liao and Peng Cheng and Tianhe Ren and Depeng Liang and
                  Kai Dang and Yi Wang and Xiaoyu Xu},
  title =        {LiBai},
  howpublished = {\url{https://github.com/Oneflow-Inc/libai}},
  year =         {2021}
}

Join the WeChat group

LiBai_Wechat_QRcode

libai's People

Contributors

bbuf avatar cpflame avatar dangkai4u avatar digger-yu avatar hihippie avatar jackalcooper avatar khloe-zhang avatar l-xiafeng avatar l1aoxingyu avatar ldpe2g avatar leaves-zwx avatar liujuncheng avatar lixiang007666 avatar loxs123 avatar mard1no avatar marigoold avatar ofhwei avatar qiaolingchen00 avatar rentainhe avatar shaoshitong avatar strint avatar thinksoso avatar wyg1997 avatar xiezipeng-ml avatar yipeng1994 avatar yuguo-jack avatar zhangfantju avatar zhanggj821 avatar zsw256 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libai's Issues

libai lr_scheduler设计文档

lr_scheduler设计与讨论

OneFlow中的Scheduler

oneflow中的scheduler在eagergraph下的计算逻辑是独立的,eager下的计算逻辑是写在python层里的,graph下的计算逻辑是在C++层面实现,oneflow为了保证在eager模式和graph模式下,都可以执行相同的scheduler,得到同样的结果,会在Scheduler中定义一个_generate_conf_for_graph方法,这个方法的作用仅仅是为了初始化graph运行时的scheduler,所以初始化相同,C++计算逻辑和python计算逻辑相同的情况下,可以保证我graph和eager模式下用的scheduler,得到的结果都是一样的

Libai中的Scheduler设计讨论

libai最早的设计我是想仿照detectron2中的scheduler设计,detectron2中将scheduler的设计解耦成了以下几个部分:

  • ParamScheduler:主要复杂根据程序执行的不同阶段,返回一个scalar
  • LRMultiplier:继承自flow.optim.lr_scheduler._LRScheduler,主要需要传入的是optimizer参数以及ParamScheduler,然后在get_lr函数中,调用ParamScheduler,得到当前阶段的scalar,再把scalar乘上学习率

举一个简单的例子:

  • ConstantParamScheduler:表示我训练的任何阶段学习率都是一样的,所以我只需要初始化这个ParamScheduler,然后重写__call__方法,在任意阶段返回相同的scalar即可
class ConstantParamScheduler(ParamScheduler):
    """
    Returns a constant value for a param
    """

    def __init__(self, value: float) -> None:
        self._value = value
    
    def __call__(self, where: float) -> float:
        return self._value
  • LRMultiplier:会在get_lr函数中传入当前运行的step,以及总的step数,用当前的step除以总的step表示训练的进度,再调用传入的ParamScheduler来得到当前阶段的scalar,再乘上学习率
class LRMultiplier(flow.optim.lr_scheduler._LRScheduler):
    """
    A LRScheduler which uses :class:`ParamScheduler` to multiply the
    learning rate of each param in the optimizer.
    Every step, the learning rate of each parameter becomes its initial value
    multiplied by the output of the given :class:`ParamScheduler`.

    The absolute learning rate value of each parameter can be different.
    This scheduler can be used as long as the relative scale among them do
    not change during training.
    
    Examples:
    ::
        LRMultiplier(
            opt,
            WarmupParamScheduler(
                MultiStepParamScheduler(
                    [1, 0.1, 0.01],
                    milestones=[60000, 80000],
                    num_updates=90000,
                ), 0.001, 100 / 90000
            ),
            max_iter=90000
        )
    """

    # NOTES: in the most general case, every LR can use its own scheduler.
    # Supporting this requires interaction with the optimizer when its parameter
    # group is initialized. For example, classyvision implements its own optimizer
    # that allows different schedulers for every parameter group.
    # To avoid this complexity, we use this class to support the most common cases
    # where the relative scale among all LRs stay unchanged during training.  In this
    # case we only need a total of one scheduler that defines the relative LR multiplier.

    def __init__(
        self,
        optimizer: flow.optim.Optimizer,
        multiplier: ParamScheduler,
        max_iter: int,
        last_iter: int = -1,
    ):
        """
        Args:
            optimizer, last_iter: See ``flow.optim.lr_scheduler._LRScheduler``.
                ``last_iter`` is the same as ``last_epoch``.
            multiplier: a libai ParamScheduler that defines the multiplier on
                every LR of the optimizer
            max_iter: the total number of training iterations
        """
        if not isinstance(multiplier, ParamScheduler):
            raise ValueError(
                "_LRMultiplier(multiplier=) must be an instance of libai "
                f"ParamScheduler. Got {multiplier} instead."
            )
        self._multiplier = multiplier
        self._max_iter = max_iter
        super().__init__(optimizer, last_step=last_iter)
    
    def state_dict(self):
        # libai schedulers are stateless. Only keep oneflow scheduler states
        return {"base_lrs": self.base_lrs, "last_step": self.last_step}
    
    def get_lr(self) -> List[float]:
        multiplier = self._multiplier(self.last_step / self._max_iter)
        return [base_lr * multiplier for base_lr in self.base_lrs]

这种方案有个比较麻烦的点在于,我如果实现的scheduler需要同时兼容eager和graph的执行模式,我就必须要在LRMultiplier中定义一个_generate_conf_for_graph方法,并且在底层的ParamScheduler中也做同样的定义

举个例子

class ConstantParamScheduler(ParamScheduler):
    """
    Returns a constant value for a param
    """

    def __init__(self, value: float) -> None:
        self._value = value
    
    def __call__(self, cur_iter: int, max_iter: int) -> float:
        return self._value
    
    def _generator_conf_for_graph(self, opt_confs):
        ...
class LRMultiplier(flow.optim.lr_scheduler._LRScheduler):
    """
    A LRScheduler which uses :class:`ParamScheduler` to multiply the
    learning rate of each param in the optimizer.
    Every step, the learning rate of each parameter becomes its initial value
    multiplied by the output of the given :class:`ParamScheduler`.

    The absolute learning rate value of each parameter can be different.
    This scheduler can be used as long as the relative scale among them do
    not change during training.
    
    Examples:
    ::
        LRMultiplier(
            opt,
            WarmupParamScheduler(
                MultiStepParamScheduler(
                    [1, 0.1, 0.01],
                    milestones=[60000, 80000],
                    num_updates=90000,
                ), 0.001, 100 / 90000
            ),
            max_iter=90000
        )
    """

    # NOTES: in the most general case, every LR can use its own scheduler.
    # Supporting this requires interaction with the optimizer when its parameter
    # group is initialized. For example, classyvision implements its own optimizer
    # that allows different schedulers for every parameter group.
    # To avoid this complexity, we use this class to support the most common cases
    # where the relative scale among all LRs stay unchanged during training.  In this
    # case we only need a total of one scheduler that defines the relative LR multiplier.

    def __init__(
        self,
        optimizer: flow.optim.Optimizer,
        multiplier: ParamScheduler,
        max_iter: int,
        last_iter: int = -1,
    ):
        """
        Args:
            optimizer, last_iter: See ``flow.optim.lr_scheduler._LRScheduler``.
                ``last_iter`` is the same as ``last_epoch``.
            multiplier: a libai ParamScheduler that defines the multiplier on
                every LR of the optimizer
            max_iter: the total number of training iterations
        """
        if not isinstance(multiplier, ParamScheduler):
            raise ValueError(
                "_LRMultiplier(multiplier=) must be an instance of libai "
                f"ParamScheduler. Got {multiplier} instead."
            )
        self._multiplier = multiplier
        self._max_iter = max_iter
        super().__init__(optimizer, last_step=last_iter)
    
    def state_dict(self):
        # libai schedulers are stateless. Only keep oneflow scheduler states
        return {"base_lrs": self.base_lrs, "last_step": self.last_step}
    
    def get_lr(self) -> List[float]:
        multiplier = self._multiplier(self.last_step / self._max_iter)
        return [base_lr * multiplier for base_lr in self.base_lrs]
    
    # 在这里需要实现一个 _generate_conf_for_graph 方法
    def _generate_conf_for_graph(self, opt_confs):
        return self._multiplier._generate_conf_for_graph(opt_confs)
  • 这么做会导致代码层层封装,不太方便使用

Solution

个人感觉d2这种方案不太适合libai的场景,所以这里个人感觉还是采用oneflow内置的Scheduler,但是在libai重写一份,为没有添加_generate_conf_for_graph的scheduler添加一个这个方法,但是raise NotImplementedError("xxx scheduler only support eager mode now"),添加一个报错信息方便用户使用

  • 举个例子:在libai中重新copy一份LambdaLR
import oneflow as flow

class LambdaLR(flow.optim.lr_scheduler._LRScheduler):
    def __init__(self, optimizer, lr_lambda, last_step=-1, verbose=False):
        if not isinstance(lr_lambda, (list, tuple)):
            self.lr_lambdas = [lr_lambda] * len(optimizer.param_groups)
        else:
            assert len(lr_lambda) == len(
                optimizer.param_groups
            ), f"Expected {len(optimizer.param_groups)} lr_lambdas, but got {len(lr_lambda)}"
            self.lr_lambdas = list(lr_lambda)
        super().__init__(optimizer, last_step, verbose)

    def state_dict(self):
        """Returns the state of the scheduler as a :class:`dict`.

        It contains an entry for every variable in self.__dict__ which
        is not the optimizer.
        The learning rate lambda functions will only be saved if they are callable objects
        and not if they are functions or lambdas.
        """
        state_dict = {
            key: value
            for (key, value) in self.__dict__.items()
            if key not in ("optimizer", "lr_lambdas")
        }
        state_dict["lr_lambdas"] = [None] * len(self.lr_lambdas)
        for (idx, fn) in enumerate(self.lr_lambdas):
            if not isinstance(fn, types.FunctionType):
                state_dict["lr_lambdas"][idx] = fn.__dict__.copy()
        return state_dict

    def load_state_dict(self, state_dict):
        """Loads the schedulers state.

        Arguments:
            state_dict (dict): scheduler state. Should be an object returned
                from a call to :meth:`state_dict`.
        """
        lr_lambdas = state_dict.pop("lr_lambdas")
        self.__dict__.update(state_dict)
        state_dict["lr_lambdas"] = lr_lambdas
        for (idx, fn) in enumerate(lr_lambdas):
            if fn is not None:
                self.lr_lambdas[idx].__dict__.update(fn)

    def get_lr(self):
        return [
            base_lr * lmbda(self.last_step)
            for (lmbda, base_lr) in zip(self.lr_lambdas, self.base_lrs)
        ]

    def _generate_conf_for_graph(self, opt_confs):
        # 添加报错信息,或者做一些别的操作
        raise NotImplementedError("LambdaLR scheduler only support eager mode now")
  • 虽然从代码层面来说是和oneflow内置的几乎一样,但是我们可以拓展一些报错信息,以及统一一下Scheduler的命名规范,以及用户在使用的时候就可以直接from libai.optim import CosineLRScheduler,比较清晰

LiBai Docs问题汇总

Contents

这个issue主要记录LiBai文档中的问题,以及解决方案(如果有),目前相关联的PR如下:

待讨论和解决的问题

  • 文档中没有展示出类的__init__方法
  • GraphTrainer目前没有Docstring,需要补全
  • 感觉libai.trainer.trainer这个module命名有点奇怪

已解决的问题

  • 加载文档的时候类方法的排序是按照首字母排序, 没有按照原本类中的方法顺序排序,解决方案, 添加member-order: bysource如下:
libai.trainer.default module
---------------------------------
.. currentmodule:: libai.trainer
.. automodule:: libai.trainer
    :member-order: bysource
    :members: 
        default_setup,
        DefaultTrainer,

统一传参格式

由于各个组件是分别开发的,每个人的习惯可能不一样,导致参数格式不统一,因此需要统一一下格式。

  • 参数格式:

    • model:
    model = dict(
        model_name="BertModel",
        model_cfg=dict(vocab_size=123, hidden_size=768),
    )
    • tokenizer:
    tokenizer = dict(
        tokenizer_name="BertTokenizer",
        tokenizer_cfg=dict(vocab_file="vocab.txt"),
        append_eod=False,
    )
    • optimizer:
    optimizer = dict(
        optimizer_name="Adam",
        optimizer_cfg=dict(lr=1e-4, beta=(0.98, 0.99)),
        ...(用于构造的参数放在optimizer_cfg其他涉及的参数放在外面一层)
    )
    • scheduler:
    scheduler = dict(
        scheduler_name="WarmupCosineAnnealingLR",
        scheduler_cfg=dict(max_iters=1000, warmup_iters=50, warmup_method="linear"),
    )
    • dataloader: dataloader不支持注册机制,因此目前用不到这种格式。如果将来支持,请参考上面参数格式。
    • trainer:目前所有和训练、验证相关的参数都放在train里,不知道要不要拆分,例如dist是训练和推理通用的,可以考虑把这些拆分出来。
    • evaluator、generator等,由于还没完全做完,这里不给具体格式,之后仿照上面写。
  • 命名要求:

    • 由于支持了kwargs传参,调用模型时统一使用这种传参方式。要求模型的参数,仿照其他主流的库定义参数,dataset定义时的key,适配模型的参数名,而非模型适配dataset。
    • 在构造模型或模块或其他组件时,尽可能添加默认值,可以减少config的配置。

libai 新建projects教程

需求

如果开发者想进行新任务的开发(比如论文复现, finetune task的添加), 那么基于libai的库, 怎么实现最少的代码开发工作?

首先我们必须明确的是基于libai去进行新任务的开发, 有什么样的好处:

  1. 对于libai来说: 新添加的代码工作可以和libai的框架代码隔离开, 不干扰libai的框架代码的正确性, 避免和其他模块冲突, 引入未知bug, 而且可以最大程度保持libai框架的简洁性, 不会越来越臃肿
  2. 对于用户来说: libai的很多特性在projects里面都可以继承, 不用自己重新造轮子, 比如
    • 用户可以继承libaitrainerlazyconfig, 只需要自己重构modeldataloader, 可以实现代码最小化.
    • 用户可以享受到libai的特性: libai会保存配置文件, 可以很快地复现自己跑过的实验; libai在训练期间输出多样化的训练信息, 比如剩余训练时间, 当前iter进度, 吞吐量, loss信息, 当前learning rate等

models组的实习生也可以基于libai这个框架, 每一个模型复现的任务独立开发一个projects, 方便于代码的整合与管理

新建任务

我们以新建一个bert_finetune的task为例, 整体代码可见projects_demo

首先我们来看整体的结构, 对于一个新建的任务来说, 我们主要的工作其实在于:

  1. 编写config.py, 这个config是该任务独立的config, 不会干扰到其他的task和libai库
  2. 编写model.py, 与新建任务的模型相关
  3. 编写dataloader.py, 与新建模型的数据集相关
libai
    configs/
        ...
    libai/
        ...
    tests/
        ...
    tools/
        ...
    projects/
        your_task/
            dataset/
            model.py
            config.py
            finetune.py

函数入口

我们先来看finetune.py, 这个程序是主要的函数入口, 用户的自定义模块将在此处进行重载
在程序中, 我们只需要重载get_batch函数, 和trainer中的'build_train_valid_test_loader()'

一个简单的样例

import sys
import oneflow as flow

sys.path.append(".")
from your_dataset import build_train_valid_test_data_iterators

from libai.config import LazyConfig, default_argument_parser
from libai.utils import distributed as dist
from libai.utils.checkpoint import Checkpointer

from libai.trainer import DefaultTrainer, default_setup
import logging

# 用户需要自己重新定义get_batch函数, 函数的返回和model.forward()的入参匹配
def get_batch(data_iterator):
    """Build the batch."""

    if data_iterator is not None:
        data = next(data_iterator)
    else:
        data = None

    input_placement = dist.get_layer_placement(0)
    label_placement = dist.get_layer_placement(-1)
    sbp = dist.get_nd_sbp([flow.sbp.split(0), flow.sbp.broadcast])

    def to_consistent(tensor, placement):
        tensor = tensor.to_consistent(placement, sbp)
        return tensor

    # Unpack.
    tokens = to_consistent(data["text"].long(), input_placement)
    types = to_consistent(data["types"].long(), input_placement)
    padding_mask = to_consistent(data["padding_mask"].long(), input_placement)
    label = to_consistent(data["label"].long(), label_placement)

    return tokens, padding_mask, types, label

# 用户需要继承一下DefaultTrainer, 然后重载自己的dataloader函数
class Trainer(DefaultTrainer):
    @classmethod
    def build_train_valid_test_loader(cls, cfg):
        """
        Returns:
            iterable
        It now calls :func:`libai.data.build_reid_train_loader`.
        Overwrite it if you'd like a different data loader.
        """
        logger = logging.getLogger("libai."+__name__)
        logger.info("Prepare training set")
        return build_train_valid_test_data_iterators(cfg)
    

    def run_step(self):
        return super().run_step(get_batch)


def main(args):
    cfg = LazyConfig.load(args.config_file)
    cfg = LazyConfig.apply_overrides(cfg, args.opts)
    default_setup(cfg, args)

    if args.eval_only:
        model = Trainer.build_model(cfg)
        Checkpointer(model, save_dir=cfg.train.output_dir).resume_or_load(
            cfg.train.load_weight, resume=args.resume
        )
        graph = Trainer.build_graph(cfg, model, is_train=False)
        res = Trainer.test(cfg, graph)

    trainer = Trainer(cfg)
    return trainer.train()


if __name__ == "__main__":
    args = default_argument_parser().parse_args()
    main(args)

重定义model

接下来我们定义model.py, 这里注意除了定义model之外, 我们还需要重新定义一下graph(继承GraphBase)
其中model在定义时, 尽可能的使用libai.layers下的模块, 里面有配套的的sbp, 用户可以直接使用, 不用管sbp的配置.

from oneflow import nn
from libai.models.graph_bash import 


class MyModel(nn.Module):
    def __init__(self, cfg):
        ```
        这里的cfg里面的参数 对应config.py中的my_cfg
        ```
        ...

    def forward(self, tokens, padding_mask, types, label):
        ...
        return loss

class MyGraph(GraphBase):
    
    def build(self, tokens, padding_mask, types, label):
        ```
        这里入参和get_batch的返回值保持一致
        ```
        loss = self.model(tokens, padding_mask, types, label)
        loss.backward()
        return loss

    def set_activation_checkpoint(self):
        ```
        开启checkpointing, 如有需求, 可参考 libai.models.bert_model.py
        ```
        pass

    def set_pipeline_stage_id(self):
        ```
        流水并行设置stage_id, 如有需求, 可参考 libai.models.bert_model.py
        ```
        pass

重新定义config

libaiconfig比较特殊, 采用lazyconfig的形式, 可以是.py文件, 在保存时会保存为.yaml, 后续星宇应该会专门写一个issue来描述lazyconfig的特性
在这里先简单的讲讲完整的config.py, 以及怎么继承config.py.

首先, 在训练的任务里面, config.py有几个必要的字段:

  • train: 和训练相关的入参, 字典形式
  • model: 和模型结构相关, 在config里面直接指定生成方式, 由于lazyconfig的特性, 会在程序运行时再生成model
  • data: 和dataloader相关的入参, 字典形式
  • optim: 和优化器相关, 在config里面直接指定生成方式, 同上
  • lr_scheduler: 和学习率相关, 在config里面直接指定生成方式, 同上
  • graph: 和graph模式相关(oneflow特有), 在config里面直接指定生成方式, 同上

基于以上的了解, 我们先来看看一个完整的config.py, 应该是什么样子,

注意, 所有import的模块必须以libai的根目录为路径, 否则保存的yaml文件将无法保存模块的正确路径, 导致读取yaml报错从而不能复现实验

from libai.config import LazyCall as L
from libai.model_path.model import BaseModel, BaseGraph
import oneflow as flow
from libai.optim import get_default_optimizer_params, PolynomialLR
from libai.config import LazyCall as L

train = dict(
    output_dir="./demo_output/test_config",
    start_iter=0,
    train_iter=10000,
    micro_batch_size=32,
    ...
)

data = dict(
    seq_length=512,
    tokenizer_type="BertCNWWMTokenizer",
    dataloader_type="single",
    num_workers=4,
    ...
)

my_cfg = dict(
    vocab_size=30522,
    hidden_size=768,
    hidden_layers=24,
    ...
)
model = L(BaseModel)(cfg=my_cfg)

# 众所周知optimizer在初始化的时候, 需要model.parameters()作为入参.
# 但是model还未构建, 拿不到parameters(), 由于lazyconfig的特性, 在此处只会更新optim的以下参数, 在程序中等拿到了model.parameters()再延迟构建 
optim = L(flow.optim.AdamW)(
    parameters=L(get_default_optimizer_params)(
        # parameters.model is meant to be set to the model object, before instantiating the optimizer.
        clip_grad_max_norm=1.0,
        clip_grad_norm_type=2.0,
        weight_decay_norm=0.0,
        weight_decay_bias=0.0,
    ),
    lr=1e-4,
    weight_decay=0.01,
    betas=(0.9, 0.999),
    do_bias_correction=True,
)

lr_scheduler = L(flow.optim.lr_scheduler.WarmUpLR)(
    lrsch_or_optimizer=L(PolynomialLR)(steps=1000, end_learning_rate=1.0e-5,),
    warmup_factor=0,
    warmup_iters=100,
    warmup_method="linear",
)

graph = dict(
    # options for graph or eager mode
    enabled=True,
    train=L(BaseGraph)(
        fp16=True,
        is_eval=False,
    ),
    eval=L(BaseGraph)(
        fp16=True, 
        is_eval=True,),
)

接下来我们讲讲怎么继承config.py, 可想而知的是, 我们做finetune任务, 大部分的config其实都和pretrain的任务一样, 我们只需要改少数几个参数就可以了.

config.py中我们必须要有的六要素, 所以即使我们不需要改变某个要素, 我们也得把它import进来, 但不需要进行修改.

可以看到的是我们已经把model和graph的生成方式换成了自己的, 而且在train里面我们修改了老字段以及添加了新字段

from libai.projects.model import MyModel, MyGraph
from libai.config_dir.confg import data
from libai.config_dir.confg import train
from libai.config_dir.confg import optim
from libai.config_dir.confg import lr_scheduler
from libai.config_dir.confg import my_cfg

my_cfg.update(
    dict(
        # exist key
        hidden_size=1024,
        # new key
        num_classes=2,
    )
)
model = L(Classification)(cfg=my_cfg)

train.update(
    dict(
        # exist key
        output_dir="output/finetune_qqp/",
        micro_batch_size=16,
        global_batch_size=16,
        train_iter=10000,
        # new key
        train_data=["/home/chengpeng/train.tsv",],
        valid_data=["/home/chengpeng/dev.tsv",],
    )
)

graph = dict(
    # options for graph or eager mode
    enabled=True,
    train=L(MyGraph)(
        fp16=True,
        is_eval=False,
    ),
    eval=L(MyGraph)(
        fp16=True, 
        is_eval=True,),
)

构建好config.py以后, 我们在程序中想拿到对应的字段, 只需要进行类似的访问cfg.train.output_dir即可.

重新定义dataloader

这个部分还没有完全的ready 当前重新定义dataloader的方式需要重载如下的函数:
- dataset.py: 继承自oneflow.utils.data.Dataset
- build_train_valid_test_data_iterators(): 供finetune.py中的trainer调用

详细的代码可以参见projects_demo

开启训练

等上述的模块重载完以后, 我们就可以开启训练任务了.
bash projects/finetune.sh

CONFIG可以同时支持py文件和生成的yaml文件

#!/usr/bin/env bash

CONFIG=projects/your_task/config.py #output/your_task/config.yaml
GPUS=1
NODE=1
NODE_RANK=0
PORT=2345

python3 -m oneflow.distributed.launch \
    --nproc_per_node $GPUS \
    --nnodes $NODE \
    --node_rank $NODE_RANK \
    --master_addr $PORT \
    projects/your_task/finetune.py \
    --config-file $CONFIG \
    --num-gpus $GPUS

swin 数据并行,8卡线性加速比,比pytorch低

问题描述

根据 flowvision 仓库中用户的 issue 描述,libai swin 8卡加速比低的问题,在类脑上做了下实验,下面是 libai 和 官方swin数据对比

实验环境

类脑vs009
oneflow 版本:0.8.0+cu112.git.57869e9e39
libai 版本: de2c68f2692760e5de87ebb815541a98d1b8ebe7
pytorch 版本:1.10.1+cu102

libai

graph, fp16, batch 128

1卡吞吐: ~ 70
8卡吞吐: ~ 280
线性加速比 ~4倍

graph, fp32, batch 32

1卡吞吐:~ 70
8卡吞吐:~ 290
线性加速比 ~4倍

eager global, fp32, batch 32

1卡吞吐: ~50
8卡吞吐: ~200
线性加速比 ~4倍

普通 ddp https://github.com/Oneflow-Inc/swin-transformer/tree/swin_clean_ldp

eager ddp, fp32, batch 32

1卡吞吐: ~152.3
8卡吞吐: ~416.2
线性加速比 ~2.7倍

官方pytorch swin: https://github.com/microsoft/Swin-Transformer

amp, batch 128,

1卡吞吐: ~320
8卡吞吐: ~2048
线性加速比:~6.4 倍

fp32, batch 96

1卡吞吐: ~208
8卡吞吐: ~1536
线性加速比: ~7.3 倍

Swin graph 数据并行,打开 amp + zero stage1 报错

实验分支:#215

训练能跑起来,但是一个epoch结束到测试阶段,构图崩了

swin_cifar100.py 关键配置

train.train_micro_batch_size = 16
train.num_accumulation_steps = 1
train.test_micro_batch_size = 16

# parallel strategy settings
train.dist.data_parallel_size = 8
train.dist.tensor_parallel_size = 1
train.dist.pipeline_parallel_size = 1
train.dist.pipeline_num_layers = sum(model.depths)
train.output_dir="./output"

# Set fp16 ON
train.amp.enabled = True
train.zero_optimization.enabled = True
train.zero_optimization.stage = 1
graph.enabled = True
bash tools/train.sh tools/train_net.py configs/swin_cifar100.py 8

报错信息

[ERROR](GRAPH:GraphBase_1:GraphBase) building graph got error: <class 'NotImplementedError'> Not support weight with sbp: (oneflow.sbp.split(axis=0),)
[ERROR](GRAPH:GraphBase_1:GraphBase) building graph got error: <class 'NotImplementedError'> Not support weight with sbp: (oneflow.sbp.split(axis=0),)
[ERROR](GRAPH:GraphBase_1:GraphBase) building graph got error: <class 'NotImplementedError'> Not support weight with sbp: (oneflow.sbp.split(axis=0),)
[ERROR](GRAPH:GraphBase_1:GraphBase) building graph got error: <class 'NotImplementedError'> Not support weight with sbp: (oneflow.sbp.split(axis=0),)
[ERROR](GRAPH:GraphBase_1:GraphBase) building graph got error: <class 'NotImplementedError'> Not support weight with sbp: (oneflow.sbp.split(axis=0),)
[ERROR](GRAPH:GraphBase_1:GraphBase) building graph got error: <class 'NotImplementedError'> Not support weight with sbp: (oneflow.sbp.split(axis=0),)
[ERROR](GRAPH:GraphBase_1:GraphBase) building graph got error: <class 'NotImplementedError'> Not support weight with sbp: (oneflow.sbp.split(axis=0),)
[ERROR](GRAPH:GraphBase_1:GraphBase) building graph got error: <class 'NotImplementedError'> Not support weight with sbp: (oneflow.sbp.split(axis=0),)
Traceback (most recent call last):
  File "tools/train_net.py", line 61, in <module>
    main(args)
  File "tools/train_net.py", line 56, in main
    return trainer.train()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 459, in train
    super().train(self.start_iter, self.max_iter)
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    self._do_eval()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
    results = self._func()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    evaluator,
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    x = self.forward_features(images)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
Traceback (most recent call last):
  File "tools/train_net.py", line 61, in <module>
    main(args)
  File "tools/train_net.py", line 56, in main
    return trainer.train()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 459, in train
    super().train(self.start_iter, self.max_iter)
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    self._do_eval()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
    results = self._func()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    evaluator,
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    x = self.forward_features(images)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
�[4m�[5m�[31mERROR�[0m �[32m[03/29 17:34:50 lb.engine.trainer]: �[0mException during training:
Traceback (most recent call last):
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    self._do_eval()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
    results = self._func()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    evaluator,
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    x = self.forward_features(images)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
�[32m[03/29 17:34:50 lb.engine.hooks]: �[0mOverall training speed: 197 iterations in 0:00:35 (0.1793 s / it)
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
�[32m[03/29 17:34:50 lb.engine.hooks]: �[0mTotal training time: 0:00:35 (0:00:00 on hooks)
Traceback (most recent call last):
  File "tools/train_net.py", line 61, in <module>
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
    main(args)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
  File "tools/train_net.py", line 56, in main
    return trainer.train()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 459, in train
Traceback (most recent call last):
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "tools/train_net.py", line 61, in <module>
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
    main(args)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
  File "tools/train_net.py", line 56, in main
    super().train(self.start_iter, self.max_iter)
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    return trainer.train()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 459, in train
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    super().train(self.start_iter, self.max_iter)
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    self._do_eval()
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    results = self._func()
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
    self._do_eval()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    results = self._func()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
    evaluator,
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    evaluator,
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
Traceback (most recent call last):
  File "tools/train_net.py", line 61, in <module>
    result = self.__block_forward(*args, **kwargs)
    x = self.forward_features(images)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    main(args)
  File "tools/train_net.py", line 56, in main
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    return trainer.train()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 459, in train
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    super().train(self.start_iter, self.max_iter)
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    x = self.forward_features(images)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    result = self._origin.__class__.forward(self, *args, **kwargs)
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    self._do_eval()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    results = self._func()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    evaluator,
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    x = self.forward_features(images)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
Traceback (most recent call last):
  File "tools/train_net.py", line 61, in <module>
    main(args)
  File "tools/train_net.py", line 56, in main
    return trainer.train()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 459, in train
    super().train(self.start_iter, self.max_iter)
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    self._do_eval()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
    results = self._func()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    evaluator,
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    x = self.forward_features(images)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
Traceback (most recent call last):
  File "tools/train_net.py", line 61, in <module>
    main(args)
  File "tools/train_net.py", line 56, in main
    return trainer.train()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 459, in train
    super().train(self.start_iter, self.max_iter)
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    self._do_eval()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
    results = self._func()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    evaluator,
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    x = self.forward_features(images)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
Traceback (most recent call last):
  File "tools/train_net.py", line 61, in <module>
    main(args)
  File "tools/train_net.py", line 56, in main
    return trainer.train()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 459, in train
    super().train(self.start_iter, self.max_iter)
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 143, in train
    self.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/trainer.py", line 171, in after_step
    h.after_step()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 340, in after_step
    self._do_eval()
  File "/DATA/disk1/ldp/libai/libai/engine/hooks.py", line 312, in _do_eval
    results = self._func()
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 404, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.test_loader, model)
  File "/DATA/disk1/ldp/libai/libai/engine/default.py", line 739, in test
    evaluator,
  File "/DATA/disk1/ldp/libai/libai/evaluation/evaluator.py", line 193, in inference_on_dataset
    outputs = model(**paded_data)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 209, in __call__
    self._compile(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 650, in _compile
    eager_outputs = self.__build_graph(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/graph.py", line 744, in __build_graph
    outputs = self.build(*lazy_args, **lazy_kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/utils/graph_base.py", line 87, in build
    return self.model(**kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 679, in forward
    x = self.forward_features(images)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 658, in forward_features
    x = layer(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 485, in forward
    x = self.blocks[i](x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 297, in forward
    attn_windows = self.attn(x_windows, self.attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/models/swin_transformer.py", line 121, in forward
    self.qkv(x)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 209, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/nn/graph/block.py", line 238, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/DATA/disk1/ldp/libai/libai/layers/linear.py", line 154, in forward
    raise NotImplementedError(f"Not support weight with sbp: {self.weight.sbp}")
NotImplementedError: Not support weight with sbp: (oneflow.sbp.split(axis=0),)
Traceback (most recent call last):
  File "/home/ldp/miniconda3/envs/oneflow-dev-gcc7/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ldp/miniconda3/envs/oneflow-dev-gcc7/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/distributed/launch.py", line 237, in <module>
    main()
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/distributed/launch.py", line 225, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/ldp/.local/lib/python3.6/site-packages/oneflow/distributed/launch.py", line 194, in sigkill_handler
    returncode=last_return_code, cmd=cmd

添加docstring中的libai projects开发tutorial

需求

如果开发者想进行新任务的开发(比如论文复现,finetune task的添加),那么基于libai的库,怎么实现最少的代码开发工作?

首先我们必须明确的是基于libai去进行新任务的开发, 有什么样的好处:

  1. 对于用户来说:libai的很多特性在projects里面都可以继承,不用自己重新造轮子,比如:
    • 用户可以继承libaitrainerlazyconfig,只需要自己重构modeldataloader,可以实现代码最小化。
    • 用户可以享受到libai的特性:libai会保存配置文件,可以很快地复现自己跑过的实验;libai在训练期间输出多样化的训练信息,比如剩余训练时间,当前iter进度,吞吐量,loss信息,当前learning rate等。
    • 可以通过简单的设置分布式训练参数进行分布式训练。

新建任务

我们以新建一个bert_finetune的task为例,整体代码可见projects_QQP

首先我们来看整体的结构,对于一个新建的任务来说,我们主要的工作其实在于:

  • 编写config.py,这个config是该任务独立的config,其中包含了任务相关的类的定义,以及该任务用到的参数。以下是一些关于config文件的注意事项(以QQP-config.py为例):

    1. 这里面以LazyCall(LazyCall不会真正调用定义的对象,而是返回一个用于描述调用的dict)的形式定义了tokenizer、dataloader、model。
    2. 可以看到该文件导入了很多东西,从configs.common中导入的一般是libai已经写好的一些模型参数和配套的对象,方便我们直接使用或者修改部分后使用。
    3. 还可以看到该文件中有build_nlp_train_loader这个方法,这是nlp任务中可以直接帮我们构建好一个dataloader,而不用写过多的例如samplercollate_fn之类的东西。
  • 编写model.py, 与新建任务的模型相关,libai中的model的构建方式与平常oneflow中的大致相同。这部分有些注意事项如下(以QQP-model.py为例):

    1. 由于使用libai默认的训练方式会用到Graph静态图的训练,所以loss的计算需要放在模型内部完成。
    2. 模型的forward返回值需要是字典的形式,可以看libai中trainer.py,以及grapg_base.py就可以看到loss的backward方式。
    3. 模型内部定义tensor时需要to_global,将local的tensor转为一致性视角下的tensor
    4. 定义相关layers或者是activation可直接调用libai.layers中写好的,因为这些layers已经定义好了sbp的划分方式。
  • 编写dataset.py, 与新建模型的数据集相关,这部分的相关注意事项有(以QQP_dataset.py为例):

    1. dataset对象的构建方法基本与平时使用oneflow一致。
    2. 不同的地方在于我们需要用到DistTensorDataInstance这两个方法,其中DistTensorData主要是将Dataset中的__getitem__的返回数据进行to_gloablInstance将需要返回的数据打包成一个元数据的形式,就像demo.py中的做法一样。
    3. Dataset中的每个batch形状必须一致,因为需要构建静态图训练。
    4. Dataset中__getitem__方法返回的key必须与model中forward函数参数名一致。
libai
    configs/
        ...
    libai/
        ...
    tests/
        ...
    tools/
        ...
    projects/
        your_task/
            configs/
              config.py
            dataset/
              dataset.py
            modeling/
              model.py

函数入口

tools.train_net.py是libai中提供的默认的主要函数入口,所以我们基于该文件只需要改写部分函数即可使用。

当projects较为复杂时,可能需要重新定义一些指标的计算,train_net.py样例:

import os
import sys

import numpy as np
from scipy.stats import spearmanr

from libai.config import LazyConfig, default_argument_parser, try_get_key
from libai.evaluation import DatasetEvaluator
from libai.trainer import DefaultTrainer, default_setup
from libai.utils import distributed as dist
from libai.utils.checkpoint import Checkpointer

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))


def spearman_target(cos_sim, labels):
    # 计算spearman指标
    return spearmanr(cos_sim, labels).correlation


class MyEvaluator(DatasetEvaluator):
    # 重写默认的evaluator
    def __init__(self, cfg):
        self.cfg = cfg
        self._predictions = []

    def reset(self):
        self._predictions = []

    def process(self, inputs, outputs):
        # inputs是model每个batch接收的数据,outputs是model每个batch输出的内容(这里是eval环节)
        cos_sim = outputs["cos_sim"]
        labels = outputs["labels"]
        # 每个batch的结果添加到self._predictions
        self._predictions.append({"cos_sim": cos_sim, "labels": labels})

    def evaluate(self):
        if not dist.is_main_process():
            return {}
        else:
            predictions = self._predictions
        sim_array = np.array([])
        label_array = np.array([])
        for prediction in predictions:
            sim_array = np.append(sim_array, dist.tton(prediction["cos_sim"]))
            label_array = np.append(label_array, dist.tton(prediction["labels"]))
        self._results = spearman_target(sim_array, label_array)
        return {"spearman": self._results}


class Trainer(DefaultTrainer):
    # 重写evaluator后需要在trainer中重载该函数
    @classmethod
    def build_evaluator(cls, cfg):
        return MyEvaluator(cfg)


def main(args):
    cfg = LazyConfig.load(args.config_file)
    cfg = LazyConfig.apply_overrides(cfg, args.opts)
    default_setup(cfg, args)

    if args.fast_dev_run:
        cfg.train.train_epoch = 0
        cfg.train.train_iter = 20
        cfg.train.eval_period = 10
        cfg.train.log_period = 1

    if args.eval_only:
        tokenizer = None
        if try_get_key(cfg, "tokenization.setup", default=False):
            tokenizer = Trainer.build_tokenizer(cfg)
        model = Trainer.build_model(cfg)
        Checkpointer(model, save_dir=cfg.train.output_dir).resume_or_load(
            cfg.train.load_weight, resume=args.resume
        )
        graph = Trainer.build_graph(cfg, model, is_train=False)
        test_loader = Trainer.build_test_loader(cfg, tokenizer)
        res = Trainer.test(cfg, test_loader, graph)  # noqa
        return

    trainer = Trainer(cfg)
    return trainer.train()


if __name__ == "__main__":
    args = default_argument_parser().parse_args()
    main(args)

重定义model

接下来我们定义model.py,这里不需要重新定义Graph模型,因为在trainer中会自动根据model创建Graph,以下是一个示例:

from oneflow import nn

class MyModel(nn.Module):
    def __init__(self, cfg):
        ```
        这里的cfg里面的参数 对应config.py中的my_cfg
        ```
        ...

    def forward(self, tokens, padding_mask, types, label):
        ...
        return {‘loss’: loss}

重新定义datasets

定义自己的dataset.py十分简单,下面是一个dataset示例:

class TrainDataset(Dataset):
    def __init__(self, data):
        self.data = data
        self.tokenizer = tokenizer
        self.pad_id = self.tokenizer.pad_token_id
        self.cls_id = self.tokenizer.cls_token_id
        self.sep_id = self.tokenizer.sep_token_id

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        ...
        return Instance(
        input_ids=DistTensorData(flow.tensor(data["input_ids"], dtype=flow.long)),
        attention_mask=DistTensorData(flow.tensor(data["attention_mask"], dtype=flow.long)),
        )

重新定义config

libaiconfig比较特殊, 采用lazyconfig的形式, 可以是.py文件, 在保存时会保存为.yaml
在这里先简单的讲讲完整的config.py, 以及怎么继承config.py

首先, 在训练的任务里面, config.py有几个必要的字段:

  • train: 和训练相关的入参,字典形式。
  • model: 和模型结构相关, 在config里面直接指定生成方式, 由于lazyconfig的特性, 会在程序运行时再生成model。
  • optim: 和优化器相关, 在config里面直接指定生成方式,同上。
  • lr_scheduler: 和学习率相关, 在config里面直接指定生成方式,没有定义的话则会采用默认的lr_scheduler
  • graph: 和graph模式相关(oneflow特有), 在config里面直接指定生成方式,同上。

基于以上的了解, 我们先来看看一个完整的config.py, 应该是什么样子,

注意, 所有import的模块必须以libai的根目录为路径,否则保存的yaml文件将无法保存模块的正确路径,导致读取yaml报错从而不能复现实验

from omegaconf import OmegaConf
from configs.common.data.bert_dataset import tokenization
from configs.common.models.bert import cfg as my_cfg
from configs.common.models.graph import graph
from configs.common.optim import optim
from configs.common.train import train
from libai.config import LazyCall
from libai.data.build import build_nlp_test_loader, build_nlp_train_loader
from libai.tokenizer import BertTokenizer
from libai.optim import get_default_optimizer_params, PolynomialLR
from projects.MyProjects.dataset.dataset import TrainDataset, TestDataset
from projects.MyProjects.modeling import MyModel


tokenization.tokenizer = LazyCall(BertTokenizer)(
    vocab_file=".../vocab.txt",
)

dataloader = OmegaConf.create()
dataloader.train = LazyCall(build_nlp_train_loader)(
    dataset=[
        LazyCall(TrainDataset)(
            path=".../train.txt",
            tokenizer=LazyCall(BertTokenizer)(
                vocab_file=".../vocab.txt"
            ),
        )
    ],
)

dataloader.test = [
    LazyCall(build_nlp_test_loader)(
        dataset=LazyCall(TestDataset)(
            path=".../test.txt",
            tokenizer=LazyCall(BertTokenizer)(
                vocab_file=".../vocab.txt"
            ),
        ),
    ),
]

my_cfg.update(
    dict(
        vocab_size=21128,
        hidden_size=768,
        hidden_layers=12,
        layernorm_eps=1e-12,
        intermediate_size=3072,
    )
)

model = LazyCall(MyModel)(cfg=my_cfg)

# 众所周知optimizer在初始化的时候, 需要model.parameters()作为入参.
# 但是model还未构建, 拿不到parameters(), 由于lazyconfig的特性, 在此处只会更新optim的以下参数, 在程序中等拿到了model.parameters()再延迟构建 
optim = LazyCall(flow.optim.AdamW)(
    parameters=LazyCall(get_default_optimizer_params)(
        # parameters.model is meant to be set to the model object, before instantiating the optimizer.
        clip_grad_max_norm=1.0,
        clip_grad_norm_type=2.0,
        weight_decay_norm=0.0,
        weight_decay_bias=0.0,
    ),
    lr=1e-4,
    weight_decay=0.01,
    betas=(0.9, 0.999),
    do_bias_correction=True,
)

lr_scheduler = LazyCall(flow.optim.lr_scheduler.WarmUpLR)(
    lrsch_or_optimizer=LazyCall(PolynomialLR)(steps=1000, end_learning_rate=1.0e-5,),
    warmup_factor=0,
    warmup_iters=100,
    warmup_method="linear",
)

train.update(
    dict(
        output_dir=".../result",
        train_micro_batch_size=64,
        test_micro_batch_size=64,
        train_epoch=1,
        train_iter=10000,
        eval_period=1000,
        dist=dict(
            data_parallel_size=1,
            tensor_parallel_size=1,
            pipeline_parallel_size=1,
        ),
    )
)

构建好config.py以后, 我们在程序中想拿到对应的字段, 只需要进行类似的访问cfg.train.output_dir即可.

开启训练

等上述的模块重载完以后, 我们就可以开启训练任务了.
bash projects/my_projects/train.sh projects/my_projects/config.py 1

CONFIG可以同时支持py文件和生成的yaml文件

#!/usr/bin/env bash

CONFIG=projects/your_task/config.py #output/your_task/config.yaml
GPUS=1
NODE=1
NODE_RANK=0
PORT=2345

python3 -m oneflow.distributed.launch \
    --nproc_per_node $GPUS \
    --nnodes $NODE \
    --node_rank $NODE_RANK \
    --master_addr $PORT \
    projects/your_task/finetune.py \
    --config-file $CONFIG \
    --num-gpus $GPUS

运行tools/train.sh脚本报错:Check failed: num_device > 0 (0 vs. 0) No IB device found

使用的oneflow版本:0.8.0+cu102,使用的libai版本:最新commit

目前正在尝试利用oneflow-libai跑gpt-2,按照tutorial的指示,仅修改了dataset相关的配置信息,运行bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 2脚本时程序报错,完整log如下:

(oneflow) root@28c67ac89ed8:/home/gehao/OneFlow/libai# bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 2
loaded library: /usr/lib/libibverbs.so.1
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
loaded library: loaded library: /usr/lib/libibverbs.so.1/usr/lib/libibverbs.so.1

[08/02 12:39:21 libai]: Rank of current process: 0. World size: 2
[08/02 12:39:21 libai]: Command line arguments: Namespace(config_file='configs/gpt2_pretrain.py', resume=False, eval_only=False, fast_dev_run=False, opts=[])
[08/02 12:39:21 libai]: Contents of args.config_file=configs/gpt2_pretrain.py:
from libai.config import LazyCall
from libai.evaluation import PPLEvaluator
from .common.models.gpt import pretrain_model as model
from .common.train import train
from .common.optim import optim
from .common.data.gpt_dataset import dataloader, tokenization

from .common.models.graph import graph

# vocab_file = "./data_test/gpt_data/gpt2-vocab.json"
# merge_files = "./data_test/gpt_data/gpt2-merges.txt"
# data_prefix = "./data_test/gpt_data/loss_compara_content_sentence"
merge_files = "/home/gehao/dataset/gpt/hf-GPT2Data/merges.txt"
vocab_file = "/home/gehao/dataset/gpt/hf-GPT2Data/vocab.json"
data_prefix = "/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document"

tokenization.tokenizer.vocab_file = vocab_file
tokenization.tokenizer.merges_file = merge_files
dataloader.train.dataset[0].data_prefix = data_prefix
dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix

# GPT-2 model config
model.cfg.embedding_dropout_prob = 0.1
model.cfg.attention_dropout_prob = 0.1
model.cfg.num_attention_heads = 16
model.cfg.hidden_size = 384
model.cfg.ffn_hidden_size = 1536
model.cfg.num_layers = 6
model.cfg.max_seq_length = 1024

train.input_placement_device = "cpu"

train.dist.pipeline_num_layers = model.cfg.num_layers

for ds in dataloader.train.dataset:
    ds.max_seq_length = model.cfg.max_seq_length

optim.lr = 1.5e-4

train.train_micro_batch_size = 4
train.amp.enabled = True

train.evaluation.evaluator = LazyCall(PPLEvaluator)()

train.output_dir = "./output/gpt2_output"

[08/02 12:39:21 libai]: Full config saved to ./output/gpt2_output/config.yaml
[08/02 12:39:21 lb.engine.default]: > compiling dataset index builder ...
make: Entering directory '/home/gehao/OneFlow/libai/libai/data/data_utils'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/gehao/OneFlow/libai/libai/data/data_utils'
[08/02 12:39:21 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.094 seconds
[08/02 12:39:21 lb.engine.default]: >>> done with compiling. Compilation time: 0.096 seconds
[08/02 12:39:21 lb.engine.default]: Prepare training, validating, testing set
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: building dataset index ...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: warming up index mmap file...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading sizes...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading pointers...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading document index...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: warming up data mmap file...
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap...
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer...
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 7.357359 seconds
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: indexed dataset stats:
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: number of documents: 8013769
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: number of sentences: 8013769
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_doc_idx.npy
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_sample_idx.npy
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_shuffle_idx.npy
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.017 seconds
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:     total number of samples: 8828142
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:     total number of epochs: 1
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_doc_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_sample_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_shuffle_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.002 seconds
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     total number of samples: 26484426
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     total number of epochs: 3
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_doc_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_sample_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_shuffle_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.002 seconds
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     total number of samples: 8828142
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     total number of epochs: 1
F20220802 12:39:32.804811 201821 ibverbs_comm_network.cpp:112] Check failed: num_device > 0 (0 vs. 0) No IB device found
*** Check failure stack trace: ***
    @     0x7fced2e9962a  google::LogMessage::Fail()
    @     0x7fced2e99912  google::LogMessage::SendToLog()
    @     0x7fced2e99197  google::LogMessage::Flush()
    @     0x7fced2e9bd09  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fceca30f910  oneflow::IBVerbsCommNet::IBVerbsCommNet()
    @     0x7fcecc4a95de  oneflow::InitRDMA()
    @     0x7fcf319b35a3  (unknown)
    @     0x7fcf31ba1de9  (unknown)
    @     0x5581bfd408f4  cfunction_call
    @     0x5581bfcfa47f  _PyObject_MakeTpCall
    @     0x5581bfd982e9  _PyEval_EvalFrameDefault
    @     0x5581bfd55be4  _PyFunction_Vectorcall
    @     0x5581bfcbd300  _PyEval_EvalFrameDefault.cold.2983
    @     0x5581bfd54fe3  _PyEval_EvalCode
    @     0x5581bfd55cb4  _PyFunction_Vectorcall
    @     0x5581bfd409aa  _PyObject_FastCallDictTstate
    @     0x5581bfd4a429  slot_tp_init
    @     0x5581bfcfa52f  _PyObject_MakeTpCall
    @     0x5581bfd93d57  _PyEval_EvalFrameDefault
    @     0x5581bfd55be4  _PyFunction_Vectorcall
    @     0x5581bfcbc088  _PyEval_EvalFrameDefault.cold.2983
    @     0x5581bfd54fe3  _PyEval_EvalCode
    @     0x5581bfe01a7c  PyEval_EvalCodeEx
    @     0x5581bfd55dbb  PyEval_EvalCode
    @     0x5581bfe01b2b  run_eval_code_obj
    @     0x5581bfe32155  run_mod
    @     0x5581bfcd31f7  pyrun_file.cold.3078
    @     0x5581bfe3772f  PyRun_SimpleFileExFlags
    @     0x5581bfe37df8  Py_RunMain
    @     0x5581bfe37ff9  Py_BytesMain
    @     0x7fcf3acb6b97  __libc_start_main
    @     0x5581bfdbf6a0  (unknown)
F20220802 12:39:33.046900 201822 ibverbs_comm_network.cpp:112] Check failed: num_device > 0 (0 vs. 0) No IB device found
*** Check failure stack trace: ***
    @     0x7f6e5c4b662a  google::LogMessage::Fail()
    @     0x7f6e5c4b6912  google::LogMessage::SendToLog()
    @     0x7f6e5c4b6197  google::LogMessage::Flush()
    @     0x7f6e5c4b8d09  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6e5392c910  oneflow::IBVerbsCommNet::IBVerbsCommNet()
    @     0x7f6e55ac65de  oneflow::InitRDMA()
    @     0x7f6ebafd05a3  (unknown)
    @     0x7f6ebb1bede9  (unknown)
    @     0x55652276c8f4  cfunction_call
    @     0x55652272647f  _PyObject_MakeTpCall
    @     0x5565227c42e9  _PyEval_EvalFrameDefault
    @     0x556522781be4  _PyFunction_Vectorcall
    @     0x5565226e9300  _PyEval_EvalFrameDefault.cold.2983
    @     0x556522780fe3  _PyEval_EvalCode
    @     0x556522781cb4  _PyFunction_Vectorcall
    @     0x55652276c9aa  _PyObject_FastCallDictTstate
    @     0x556522776429  slot_tp_init
    @     0x55652272652f  _PyObject_MakeTpCall
    @     0x5565227bfd57  _PyEval_EvalFrameDefault
    @     0x556522781be4  _PyFunction_Vectorcall
    @     0x5565226e8088  _PyEval_EvalFrameDefault.cold.2983
    @     0x556522780fe3  _PyEval_EvalCode
    @     0x55652282da7c  PyEval_EvalCodeEx
    @     0x556522781dbb  PyEval_EvalCode
    @     0x55652282db2b  run_eval_code_obj
    @     0x55652285e155  run_mod
    @     0x5565226ff1f7  pyrun_file.cold.3078
    @     0x55652286372f  PyRun_SimpleFileExFlags
    @     0x556522863df8  Py_RunMain
    @     0x556522863ff9  Py_BytesMain
    @     0x7f6ec42d3b97  __libc_start_main
    @     0x5565227eb6a0  (unknown)
Killing subprocess 201821
Killing subprocess 201822
Traceback (most recent call last):
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 231, in <module>
    main()
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 219, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 187, in sigkill_handler
    raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/gehao/anaconda3/envs/mlsys/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'configs/gpt2_pretrain.py']' died with <Signals.SIGABRT: 6>.

SimCSE的调研

SimSCE

1、基本介绍

SimSCE是Danqi Chen近期等人提出的一种句子表示学习方法,其中训练方式有两种:监督学习和无监督学习进行对比学习,也就是得到sup-simcse和unsup-simcse。其中无监督学习的方法是采用输入句子并在对比目标中预测自身,仅使用标准 dropout 作为噪声,这种无监督学习的方法与现有的监督学习方式性能相当。其中监督学习利用了NLI数据集,将‘entailment’作为正样本,‘contradiction‘作为负样本进行监督学习。在STS评估任务上评估SimCSE,采用bert-base模型,无监督学习与有监督学习下Spearman相关性分别达到74.5%与81.6%,相比于以前的SOTA提高了7.9与4.6个百分点。

论文链接:https://arxiv.org/pdf/2104.08821.pdf

github链接:https://github.com/princeton-nlp/SimCSE

数据(2022.2):

  • star: 1.7k
  • fork: 232
  • watch: 32
  • issues: 136
  • pr: 7

协议:MIT License

2、文件目录结构

simcse main:

└─SimCSE-main
    ├─data
    ├─demo
    │  └─static
    │      └─files
    ├─figure
    ├─SentEval
    │  ├─data
    │  │  └─downstream
    │  ├─examples
    │  └─senteval
    │      └─tools
    ├─simcse
└─slides 
3、使用示例
from simcse import SimCSE
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")
# get the sentence embedding
embeddings = model.encode("A woman is reading.")
# Compute the cosine similarities between two groups of sentences
sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
similarities = model.similarity(sentences_a, sentences_b)
# Build index for a group of sentences and search among them
sentences = ['A woman is reading.', 'A man is playing a guitar.']
model.build_index(sentences)
results = model.search("He plays guitar.")
# [('A man is playing a guitar.', 0.7855920791625977)]
4、模型部分的函数

MLPLayer:
利用bert的cls token来计算句子的representations。
只需linear+activateion

Similarity:
计算余弦相似度。

Pooler:
包含用几种不同的pooler,其中包含以下的几种方式:
'cls': [CLS] representation with BERT/RoBERTa's MLP pooler.
'cls_before_pooler': [CLS] representation without the original MLP pooler.
'avg': average of the last layers' hidden states at each token.
'avg_top2': average of the last two layers.
'avg_first_last': average of the first and the last layers.

cl_init:
对比学习的初始化函数

cl_forward:
该函数进行simcse对比学习的计算,利用cls。主要步骤:
1、 用bert作为encoder计算得到embeddings
2、 用bert作为encoder计算mlm任务结果
3、 根据pooler方式计算pooler结果
4、 计算余弦相似度
5、 计算loss

sentemb_forward:
该函数根据pooler类型,利用sentence_embedding计算并得到了bert的输出

5、使用的数据集

无监SimCSE使用了100万个wiki中的句子,有监督的SimCSE使用了SNLI 和 MNLI 数据集。两个数据集都已提供。

6、sentence embedding质量的评估工作

SimCSE中包含了基于SentEval的修改版本。

7、SimCSE项目的特点:

主要结合hugginface的transformers来做的,其中模型部分比较容易实现,其中的trainer部分利用到了大量的huggingface接口,不太容易用他们的方式实现。

Swin 模型加载 checkpoint 训练0号卡显存显著增多

现象描述

中兴那边的用户反馈,加载checkpoint继续训练的时候,发现 0 号卡显存显著增多,然后我发现用 swin 也可以复现问题,以下是实验数据:

实验数据

配置: batch 32,graph fp16, data parallel 4

编译 train graph 后开始训练的显存 第一次编译 val graph 后的显存
swin tiny 重头训 3610 3874
swin tiny 加载checkpoint 4102 4346
swin small 重头训 4936 5200
swin small 加载checkpoint 5754 5998
swin base 重头训 6368 6632
swin base 加载checkpoint 7816 8060

配置: batch 32,eager fp32, data parallel 4

开始训练的显存 接着第一次跑 val 的显存
swin tiny 重头训 5814 5834
swin tiny 加载checkpoint 6278 6278
swin small 重头训 8308 8328
swin small 加载checkpoint 9072 9072
swin base 重头训 11156 11176
swin base 加载checkpoint 12144 12144

实验结论

从数据上可以看到,不管是eager还是 graph ,对于0号卡上的显存占用,加载checkpoint继续训的方式比不加载的方式都会多不少,而且随着模型的增大,多的越多。

问题定位

然后通过阅读代码,实验分析,发现修改 oneflow check_point_v2.py 的代码为如下,则显存占用会降低很多,但是还是比不加载多一些

diff --git a/python/oneflow/framework/check_point_v2.py b/python/oneflow/framework/check_point_v2.py
index f45a074b86..db1117b366 100644
--- a/python/oneflow/framework/check_point_v2.py
+++ b/python/oneflow/framework/check_point_v2.py
@@ -132,11 +132,11 @@ def _LoadSingleVariable(
             file_backed_blob = FileBackendVariableBlob(path)
             loaded = flow.tensor(
                 file_backed_blob.numpy(), dtype=file_backed_blob.dtype
-            ).to("cuda")
+            )#.to("cuda")
         else:
-            loaded = flow.tensor([]).to("cuda")
+            loaded = flow.tensor([])#.to("cuda")
         loaded = loaded.to_global(
-            flow.placement("cuda", [global_src_rank]), flow.sbp.broadcast
+            flow.placement("cpu", [global_src_rank]), flow.sbp.broadcast
         )
         return loaded

修改后的数据

配置: batch 32,graph fp16, data parallel 4

编译 train graph 后开始训练的显存 第一次编译 val graph 后的显存
swin base 重头训 6368 6632
swin base 加载checkpoint 6490 6734

配置: batch 32,eager fp32, data parallel 4

开始训练的显存 接着第一次跑 val 的显存
swin base 重头训 11156 11176
swin base 加载checkpoint 11292 11292

MAE中的学习率更新问题

LiBai MAE复现遇到的问题

  • Layer Scale LR Decay

MAE在finetune的时候采用了一个layer scale lr decay的策略,具体而言就是,网络的每一层的学习率都不同,具体的做法在torch里的是:

  1. https://github.com/facebookresearch/mae/blob/main/util/lr_decay.py 中定义了一个函数方法,为每个param_group加上一个lr_scale的值
def param_groups_lrd(model, weight_decay=0.05, no_weight_decay_list=[], layer_decay=.75):
    """
    Parameter groups for layer-wise lr decay
    Following BEiT: https://github.com/microsoft/unilm/blob/master/beit/optim_factory.py#L58
    """
    param_group_names = {}
    param_groups = {}

    num_layers = len(model.blocks) + 1

    layer_scales = list(layer_decay ** (num_layers - i) for i in range(num_layers + 1))

    for n, p in model.named_parameters():
        if not p.requires_grad:
            continue

        # no decay: all 1D parameters and model specific ones
        if p.ndim == 1 or n in no_weight_decay_list:
            g_decay = "no_decay"
            this_decay = 0.
        else:
            g_decay = "decay"
            this_decay = weight_decay
            
        layer_id = get_layer_id_for_vit(n, num_layers)
        group_name = "layer_%d_%s" % (layer_id, g_decay)

        if group_name not in param_group_names:
            this_scale = layer_scales[layer_id]

            param_group_names[group_name] = {
                "lr_scale": this_scale,
                "weight_decay": this_decay,
                "params": [],
            }
            param_groups[group_name] = {
                "lr_scale": this_scale,
                "weight_decay": this_decay,
                "params": [],
            }

        param_group_names[group_name]["params"].append(n)
        param_groups[group_name]["params"].append(p)

    # print("parameter groups: \n%s" % json.dumps(param_group_names, indent=2))

    return list(param_groups.values())
  1. 为每个param_group添加完lr_scale参数之后,在scheduler里进行更新的时候,每更新一个step,就对param_group中的lr再乘上一个scale, 具体可见: https://github.com/facebookresearch/mae/blob/main/util/lr_sched.py
def adjust_learning_rate(optimizer, epoch, args):
    """Decay the learning rate with half-cycle cosine after warmup"""
    if epoch < args.warmup_epochs:
        lr = args.lr * epoch / args.warmup_epochs 
    else:
        lr = args.min_lr + (args.lr - args.min_lr) * 0.5 * \
            (1. + math.cos(math.pi * (epoch - args.warmup_epochs) / (args.epochs - args.warmup_epochs)))
    for param_group in optimizer.param_groups:
        if "lr_scale" in param_group:
            param_group["lr"] = lr * param_group["lr_scale"]
        else:
            param_group["lr"] = lr
    return lr

这个在oneflow下如果不是eager的模式应该无法实现,但是这是MAE finetune复现的强依赖,没有这个的话无法在复现超参上进行对齐

libai 设计文档之数据加载篇

由于预训练的数据量比较大(数据大小可能大于内存),所以libai在数据加载上要考虑更多高效性。

经翻阅megatron历史版本,发现GLM中的数据处理方法来自于megatron 0.1版本中,从1.0版本到现在,megatron都采用了现在的方法。估计是考虑了高效性,希望减少pad,减少重复的预处理步骤,从而缩短训练时长。因此,libai应该基于megatron设计数据处理。

megatron中的数据处理:

  1. 将原始文本预处理为numpy张量,并保存为二进制文件,之后模型读取二进制文件(以bin和idx为后缀文件)。
  2. 预处理是单独步骤,在训练开始前完成。调用tools/preprocess_data.py脚本实现,保存为bin和idx文件。该脚本支持多进程处理,且对于大的数据集,如果有多个文件,可以分别处理,分别保存,之后指明文件即可训练。
  3. megatron的底层dataset有三种,IndexedDataset,IndexedCacheDataset,MMapIndexedDataset。分别对应lazy,带缓存的lazy,和直接内存映射的。mmap是速度最快的,lazy会增加io通信。但mmap要求所有数据都放进内存,lazy可以只放一部分数据进内存。因此,当训练语料极其大时,lazy是可用的,其他可能无法使用。当语料较大,例如几百万个样本,mmap是比较好的。使用哪种底层,在预处理时通过--dataset-impl来切换。
  4. indexed_dataset(上面那三种)存储numpy数组,可以理解为每一条数据对应一个句子经过tokenize和convert tokens to ids后结果。
  5. 上层的dataset,就是与具体任务相关,例如bert_dataset,gpt_dataset,t5_dataset。底层的dataset都相同,无需改动,用户只可能改上层的dataset。
  6. bert dataset:对于每个文档中的句子,在不超过max_seq_length的前提下尽可能放入多个句子。也就是说,bert dataset的一个样本通常由多个indexed_datset样本组成。新合成的句子由多个句子组成,然后选一个位置(句子的idx),将新合成的句子分成前后两部分,并置换顺序。这其实不是bert的NSP任务,其实是albert提出的SOP任务,不过SOP任务好像比NSP任务更有效。里面的加噪方法也稍微复杂一点,支持wwm mask和span mask,这也是对bert的一种改进。
  7. gpt dataset:将所有的文档拼接成一个,按照顺序截取max_seq_length,如果前一个文档结束了,且句子不足max_seq_length,就从下一个文档进行读取。
  8. 在6、7中介绍了megatron对于bert和gpt在数据处理上的做法,优点在于,尽可能减少了padding,训练中的无效计算大大降低。同时,这些都是操作numpy数组,没有考虑原始文本,因此执行效率上也较高。缺点在于,上层dataset中的sample idx和底层indexed dataset中的sample idx不一致,因此使用build_sample_idx、build_index_mappings等函数做了id的重映射。这些操作破坏了一些易用性,简洁性,同时,用户自定义数据集的难度也增加了。
  9. megatron重写了data sampler,根据数据并行和rank,每个数据并行stage为1的机器获取数据。这样也能减少保存重复的数据。
  10. 预训练可能包含多个数据集,megatron提供了blendable dataset,用于混合各数据集,同样由于8中提到的缺点,blendable dataset无法像普通的concat dataset那样简单,需要调用build_blending_indices函数重新映射。另外,也支持对数据集加权,实现降采样或重采样。
  11. 对于大语料,直接切分训练集、验证集、测试集也不方便,因此没有split datset这样的类。它通过改变indexed dataset中的doc idx,让训练集、验证集、测试集保存不同的文档,实现这样的切分。
  12. 上面提到的idx重映射,也会保存为文件,如果不存在,则重新生成,如果存在,则直接读取。另外,构造idx映射时,有一个多机通信操作,暂时看不懂,可能检查机器间是否同步。
  13. megatron没有num_epoch概念,在构建数据集时获取或计算num_epoch和num_samples,然后构造对应的索引范围和对应的数据item。

libai的做法:

  1. 底层采用indexed dataset,具有效率优势。
  2. 上层希望各dataset尽可能独立,考虑那些build_sample_idx操作,可以分为哪几类,哪些是共有的,哪些是特有的。
  3. sampler采用相同的方法,在调用dataloader后,使用to_consistent转化为sbp机制。
  4. 混合数据集和切分数据集,应该也是有必要的。由于build_sample_idx操作,貌似很难写出split dataset类,之后再思考思考。
  5. 感觉dataset部分很难使用注册机制,实在不行,就在build_dataset函数中对于每一类数据集分别判定并构造。

Swin graph 3d 并行,打开 acc grad 报错

实验分支:#215

文件 swin_cifar100.py 关键配置

train.train_micro_batch_size = 8
train.num_accumulation_steps = 2
train.test_micro_batch_size = 16

# parallel strategy settings
train.dist.data_parallel_size = 2
train.dist.tensor_parallel_size = 2
train.dist.pipeline_parallel_size = 2
train.dist.pipeline_num_layers = sum(model.depths)
train.output_dir="./output"

# Set fp16 ON
train.amp.enabled = False
train.activation_checkpoint.enabled = False
graph.enabled = True
bash tools/train.sh tools/train_net.py configs/swin_cifar100.py 8

报错信息

F20220329 17:22:55.548020 41348 task_node.cpp:385] Check failed: lbi2data_regst.size() == lbis_.size() (1 vs. 0)

 TaskEdge lbi and regst NOT match. TaskEdge: edge_id = 139009 From: [System-GradientAccumulation-VariableRepeat-model.layers.3.blocks.1.attn.relative_position_bias_table-500] To: [kBoxingZeros\n5:84955136\n178163833374140]
*** Check failure stack trace: ***
    @     0x7febda2b92ea  (unknown)
    @     0x7febda2b95d2  (unknown)
    @     0x7febda2b8e57  (unknown)
    @     0x7febda2bb9c9  (unknown)
    @     0x7febd3755466  oneflow::TaskEdge::CheckRegstLbiValid()
    @     0x7febd37b3e1b  oneflow::Compiler::Compile()
    @     0x7febd327fd17  oneflow::NNGraph::CompileAndInitRuntime()
    @     0x7febd1c43f6f  (unknown)
    @     0x7febd16d3344  (unknown)
    @     0x55f88f4caed4  _PyCFunction_FastCallDict
    @     0x55f88f552c4e  call_function
    @     0x55f88f5752ca  _PyEval_EvalFrameDefault
    @     0x55f88f54bdc4  _PyEval_EvalCodeWithName
    @     0x55f88f54d358  _PyFunction_FastCallDict
    @     0x55f88f4cb29f  _PyObject_FastCallDict
    @     0x55f88f4cfda3  _PyObject_Call_Prepend
    @     0x55f88f4cacde  PyObject_Call
    @     0x55f88f576952  _PyEval_EvalFrameDefault
    @     0x55f88f54bdc4  _PyEval_EvalCodeWithName
    @     0x55f88f54d358  _PyFunction_FastCallDict
    @     0x55f88f4cb29f  _PyObject_FastCallDict
    @     0x55f88f4cfda3  _PyObject_Call_Prepend
    @     0x55f88f4cacde  PyObject_Call
    @     0x55f88f523971  slot_tp_call
    @     0x55f88f4cacde  PyObject_Call

基于libai复现SegFormer[projects]

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

基于transformer的语义分割模型
论文地址
image

项目目的

  1. 丰富libai的支持的视觉任务类别。目前已有或正在做的任务类别包括:分类、检测任务,可扩充分割任务
  2. 该模型对对于使用Transformer做分割的开篇之作SETR主要做了创新,提高了效率, 且部分moudle基于ViT可直调用libai现有layer,较适合作为libai项目

预期结果

  1. 可加载huggingface提供的预训练权重,并实现精度对齐
  2. 可完成基于libai的训练,并实现精度对齐
  3. 成熟后可直接做为libai的model,包括其中的一些layer可做为common layer

关于Attention里的Forward行为不一致的问题

Contents

这个issue主要记录LiBai中MultiHeadAttention与timm下的实现不一致问题,以及讨论对应的解决方案

问题描述

LiBai下的MultiHeadAttention与timm中实现的Attention的主要区别在于Forward部分,下面是简化的代码实现:

  • LiBai Attention的实现
class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape
        # LiBai中关于query key value的切分
        # =======================
        qkv = self.qkv(x)
        qkv = qkv.view(B, -1, self.num_heads, 3 * C // self.num_heads)
        qkv = qkv.permute(0, 2, 1, 3)
        q, k, v = torch.chunk(qkv, chunks=3, dim=-1)
        # =======================
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x
  • timm Attention的实现
class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape
        # timm中关于query, key, value的切分
        # =======================
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)
        # =======================
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

问题主要在于query, key, value切分方式的不一致,带来了,即使载入相同的weight,也没办法得到一致的forward结果,但是视觉transformer的实现一般都是参考timm下的Attention,所以用libai内置的layer去复现ViT模型的时候会遇到,加载一样的权重,无法推理出一样的结果的尴尬情况

可以参考的解决方案

  1. 增加一个timmAttention的代码,然后修改TransformerLayer,在TransformerLayer中添加一个Attention的参数,可以指定用的Attention模块,这样做的好处是:如果用户想换Attention模块,比如我不用vanilla Attention,我使用Linear Attention,或者任意Attention Block,Performer等等,可以直接在参数里指定Attention模块是啥,并且由于我们LazyCall的机制,也可以比较灵活的搭配,这样做的麻烦的点在于,LiBai Attention中指定了一些参数,这些参数不一定可以在每个Attention模块里统一,这里需要统一输入
  2. 不去修改LiBai里的代码,在子projects下自己重修改一套,类似MAE等项目,自己维护一套Attention代码,这样做就是比较奇怪,个人倾向于前一种方法

在不修改libai的代码下的权重转换思路

import torch
import torch.nn.functional as F

b = 2
n = 5
num_heads = 4
cn = 6
c = cn*num_heads
x = torch.randn(b*n*c).view(b, n, c)
weight = torch.randn(c*c*3)
bias = torch.rand(c*3)


weight1 = weight.view(c*3, c)
weight1 = weight1.view(3, num_heads, cn, c).permute(1, 0, 2, 3).contiguous().view(c*3, c)
# weight2 = weight.view(num_heads, 3, cn, c).permute(1, 0, 2, 3).contiguous().view(c*3, c)
"""
(head, 3, head_size, hidden_size) -> (3, head, head_size, hidden_size) -> (3 * head * head_size, hidden_size)
"""
weight2 = weight.view(c*3, c)

bias1 = bias
bias1 = bias1.view(3, num_heads, cn).permute(1, 0, 2).contiguous().view(c*3)
# bias2 = bias.view(num_heads, 3, cn).permute(1, 0, 2).contiguous().view(c*3)
bias2 = bias

qkv1 = F.linear(x, weight1, bias=None)
qkv2 = F.linear(x, weight2, bias=None)


# libai version
qkv1 = qkv1.view(b, -1, num_heads, 3*c//num_heads)
qkv1 = qkv1.permute(0, 2, 1, 3)
q1, k1, v1 = torch.chunk(qkv1, chunks=3, dim=-1)


# timm version
qkv2 = qkv2.reshape(b, n, 3, num_heads, c//num_heads)
qkv2 = qkv2.permute(2, 0, 3, 1, 4)
q2, k2, v2 = qkv2[0], qkv2[1], qkv2[2]

print((q1==q2))     # tensor(True)
print((k1==k2))     # tensor(True)
print((v1==v2))     # tensor(True)

libai 开发 check list (持续更新)

这里汇总libai项目需要支持的所有功能,以及它们是否完成,方便大家了解当前的开发进度,知道框架欠缺什么,可以主动认领任务或补齐特性。同时,汇总出来已实现的功能,之后可以根据此编写使用文档:

第一阶段:主体功能

模块抽象

  • linear
  • multihead attention
  • layer norm
  • mlp
  • lm_logits
  • vocab_embedding,embedding,position embedding
  • transformer layer
  • activation
  • cross entropy:由于框架内的损失函数都是常见的,且它仅实现损失计算,不涉及logging记录的功能,我(党凯)认为,这里用oneflow内置的损失函数即可,ignore_index和label smoothing这两个常用特性,如果内置损失函数不支持,应该在那里修改,不要交给用户。
  • attention with relational position embedding

模型

  • BERT:doing
  • GPT:doing
  • T5:doing
  • VIT:doing
  • realm:结合检索的语言模型

配置系统功能

  • yacs-based + 命令行参数
  • LazyCall
  • 注册机制

trainer

  • 构建graph
  • 构建optimizer
  • 构建lr scheduler
  • 训练、验证模型
  • 日志记录
  • 打印训练信息metric
  • 保存模型
  • 加载模型

数据加载

  • bert dataset
  • gpt dataset
  • t5 dataset
  • vit dataset
  • realm dataset
  • 数据集切分
  • 数据集合并

优化器、调度器

  • adam(fused)
  • adamw(fused)
  • sgd(fused)
  • LAMB(fused)
  • LARS
  • poly decay
  • cosine decay
  • linear decay
  • inverse root sqrt decay
  • multistep decay

第二阶段:补齐特性

通用特性

  • 模型并行下的2D、序列并行等尝试
  • generator构造
  • 下游任务结果复现

eager下的特性

  • amp训练
  • zero优化
  • moe机制
  • checkpointing

graph下的特性

  • amp训练
  • zero优化
  • moe机制
  • checkpointing

第三阶段:部署、量化、蒸馏、实现API接口

  • 抽象出知识蒸馏框架
  • 实现模型量化
  • 允许将训练出的模型对接huggingface、或者转ONNX用于部署
  • 提供API接口,用户通过API调用模型,做zero shot learning或few shot learning

设置不同的data_parallel_size导致了不同的global_batch_size

问题描述

在libai的config文件里面,比如 bert_large_pretrain.py,为了控制并行模式,我加上了这么几行

train.dist.data_parallel_size=2
train.dist.tensor_parallel_size=1

作为一个用户,我觉得data_parallel_size是数据并行的维度,tensor_parallel_size是模型并行的维度,不管怎么设置他们,从全局上来看(或者说一致性视角下),模型是不会变的。

然而实际上,在global_batch_size留空(未设置train.global_batch_size)时,实际的

cfg.train.global_batch_size = (
            train_micro_batch_size
            * dist.get_data_parallel_size()
            * cfg.train.num_accumulation_steps
        )

这导致一个结果就是设置

train.dist.data_parallel_size=2
train.dist.tensor_parallel_size=1

与设置

train.dist.data_parallel_size=1
train.dist.tensor_parallel_size=2

的global batch size不一样,前者的batch size是后者的2倍。

用户诉求

作为一个用户,我希望说,我改变了并行的方式不会影响到我的模型。

所以我尝试去把 train_micro_batch_size 留空,我设置了
train.global_batch_size = 8
结果却是出了错,因为train_micro_batch_size 留空时会被默认设置为32.

我希望更多地从全局性的视角去配置这个参数,所以我更加想设置global_batch_size而不是train_micro_batch_size。
我希望train_micro_batch_size被留空时也能由global_batch_size推算出来(一个除法,不难)

开发者诉求

因为我不是用框架搞算法的专业人员,所以对行业内的默认规则了解得不够多,我不知道行业内是怎么设置batch size的,但是我觉得应该让用户知晓global_batch_size 跟 train_micro_batch_size 还有 data_parallel_size的关系,至少应该在说明文档里面有提及。
用户不需要知道太多,但是应该知道改变data_parallel_size会改变模型的batch size

dataloader接口设计

方案一:

  • 创建函数:
def build_image_train_loader(dataset, batch_size, sampler=None, collate_fn=None,):
    if isinstance(dataset, omegaconf.listconfig.ListConfig):
        dataset = list(dataset)
    elif not isinstance(dataset, list):
        dataset = [dataset]

    if len(dataset) > 1:
        dataset = dataset_mixer(dataset)
    else:
        dataset = dataset[0]

    if sampler is None:
        sampler = CyclicSampler()

    dataloader = DataLoader(
        dataset,
        batch_sampler=sampler,
        num_workers=num_workers,
        collate_fn=trivial_batch_collator if collate_fn is None else collate_fn,
        **kwargs,
    )

    return dataloader, None, None
  • 创建实例:
dataloader.train = LazyCall(build_image_train_loader)(
    dataset=[
        LazyCall(ImageNetDataset)(
            root="/DATA/disk1/ImageNet/extract/", train=True, transform=train_aug_cfg
        ),
    ],
    batch_size=16,
)

方案二:
data/build.py

def build_train_valid_test_dataset(data_path, weights, splits=[.85, .1, .05]):
    ...
    return train_dataset, valid_dataset, test_dataset

def build_dataset(data_path, weights):
    ...
    return dataset

trainer文件

class Trainer:
    def build_train_dataloader(self, data_path, weights):
        dataset = build_dataset(data_path, weights)
        sampler = CyclicSampler(...)
        return Dataloader(dataset, batch_sampler=sampler, collate_fn=dataset.collate_fn)

    def build_train_valid_test_dataloader(self, data_path, weights, splits):
        train_dataset, valid_dataset, test_dataset = build_train_valid_test_dataset(data_path, weights)
        train_sampler = CyclicSampler(...)
        valid_sampler = SingleRoundSampler(...)
        test_sampler = SingleRoundSampler(...)
        train_dataloader = Dataloader(train_dataset, batch_sampler=sampler, collate_fn=dataset.collate_fn)
        valid_dataloader = Dataloader(valid_dataset, batch_sampler=sampler, collate_fn=dataset.collate_fn)
        test_dataloader = Dataloader(test_dataset, batch_sampler=sampler, collate_fn=dataset.collate_fn)
        return train_dataloader, valid_dataloader, test_dataloader
  • 方案一只支持lazycall,如果这样的话,之前那些注册机制什么的都白写了。用户想,这里用了lazycall创建对象,为什么其他地方不用lazycall。方案二在data/build.py里可以同时支持lazycall和注册机制。
  • 方案一接口较多,xxx_train_loader,xxx_test_loader,方案二只要build_dataset和build_train_valid_test_dataset。
  • 如何区分slack里说的情况1和情况2?使用train_path, valid_path, test_path, pretrain_path四种命名来区分。

WarmupLR Scheduler 单测问题

待讨论的问题以及可能的解决方案

问题描述

WarmupLR Scheduler在和其他Scheduler组合的时候,好像在Warmup结束的阶段学习率无法到预设的值,不知道是我代码的问题还是oneflow里的WarmUpLR的问题,可以一起帮忙我看看

import oneflow as flow
import oneflow.nn as nn

p = nn.Parameter(flow.zeros(0))
opt = flow.optim.SGD([p], lr=5.0)

multi_step = flow.optim.lr_scheduler.MultiStepLR(opt, [10], gamma=0.1)
sched = flow.optim.lr_scheduler.WarmUpLR(multi_step, 
                                         warmup_factor=0.001, 
                                         warmup_iters=5, 
                                         warmup_method="linear")

p.sum().backward()
opt.step()

lrs = [0.005]
for _ in range(20):
    sched.step()
    lrs.append(opt.param_groups[0]["lr"])

print(lrs)
>>> [0.005, 1.004, 2.003, 3.002, 4.001, 4.001, 4.001, 4.001, 4.001, 4.001, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007, 0.40010000000000007]

按理来说在warmup结束的时候学习率应该是5.0才对,但是这里直接是4.001

观察到的现象

  • warmup的steps更新有一个step的差距: comment
  • 如果warmup_iters设置为0的时候会报错:comment

可能的解决方案

目前libai里实现的Scheduler都用WarmUpLR进行了封装,举例如下:

@SCHEDULER_REGISTRY.register()
def WarmupMultiStepLR(optimizer: flow.optim.Optimizer,
                      warmup_factor: float,
                      warmup_iters: int,
                      milestones: list,
                      gamma: float = 0.1,
                      warmup_method: str = "linear",
                      **kwargs):
    multistep_lr = flow.optim.lr_scheduler.MultiStepLR(
        optimizer, milestones=milestones, gamma=gamma
    )
    warmup_multistep_lr = flow.optim.lr_scheduler.WarmUpLR(
        multistep_lr, warmup_factor=warmup_factor, warmup_iters=warmup_iters, warmup_method=warmup_method, **kwargs
    )
    return warmup_multistep_lr
  • 关于warmup steps更新相差一个steps的问题需要看代码确认一下
  • 关于warmup iters是否可以设置为0,有以下两个解决方案:
    • warmup iters设置为0,目的就是为了不使用任何warmup操作,但是我看了一下oneflow里是把WarmUpLR单独作为了一个Scheduler,可以传入optimizer或者一个别的scheduler,如果传入的是别的scheduler的话,我认为可以考虑一下warmup_iters设置为0的情况,这样表示我虽然用WarmUpLR进行了封装,但是我并不去调用这个WarmUp操作,而是保留了原来的scheduler
    • warmup iters设置为0,其实也可以通过我这里改一下判断,如果warmup_iters = 0的话,直接return原来的scheduler,不知道哪种更加合适
@SCHEDULER_REGISTRY.register()
def WarmupMultiStepLR(optimizer: flow.optim.Optimizer,
                      warmup_factor: float,
                      warmup_iters: int,
                      milestones: list,
                      gamma: float = 0.1,
                      warmup_method: str = "linear",
                      **kwargs):
    multistep_lr = flow.optim.lr_scheduler.MultiStepLR(
        optimizer, milestones=milestones, gamma=gamma
    )
    # 在这里做个判断直接return
    if warmup_iters == 0:
        return multistep_lr
    warmup_multistep_lr = flow.optim.lr_scheduler.WarmUpLR(
        multistep_lr, warmup_factor=warmup_factor, warmup_iters=warmup_iters, warmup_method=warmup_method, **kwargs
    )
    return warmup_multistep_lr

DETR结果对齐实验记录

Eager global 模型并行

参数对齐:https://github.com/facebookresearch/detr

问题排查TODO LIST:

  • 继承libai attention实现的MultiHeadAttention是否对齐torch.nn.MultiHeadAttention (已对齐)
  • PyTorch权重加载正确性
  • bakebone权重加载
  • libai-like transformer完善
  • [对某些input shape导致loss.backward报错"F20220602 14:17:25.050042 15603 shape.cpp:187] Check failed: !broadcast_axis_vec.empty() "问题的排查](#288 (comment))
  • 关于libai convert_to_distributed_default_setting无法将register_buffer to_global的问题 #288 (comment)
  • 待复现的view相关bug #288 (comment)
  • 待复现的RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false相关bug #288 (comment)

Module重构讨论

关于解决libai中参数层层传递的问题讨论,主要思路是让内部Module直接获取参数,不通过外部传递:
简单写了一个demo,可以直接放到libai下跑。

创建一个ModuleBase基类:

from omegaconf import DictConfig
import oneflow as flow

from libai.config import LazyCall, configurable
from libai.models import build_model


cfg = dict(
    in_dim = 1,
    out_dim = 2,
    act = "gelu_tanh"
)
cfg = DictConfig(cfg)

cfg['cfg'] = cfg

class ModuleBase(flow.nn.Module):
    def __init__(self, cfg=None):
        super().__init__()
        self.cfg = cfg

class MLP(ModuleBase):
    def __init__(
        self, 
        in_dim,
        out_dim,
        cfg=None
    ):
        super().__init__(cfg)
        self.a = in_dim
        self.b = out_dim
        self.act = cfg.act

class Transformer(ModuleBase):
    def __init__(
        self,
        in_dim,
        out_dim,
        cfg=None
    ):
        super().__init__(cfg)
        self.mlp = MLP(
            in_dim,
            out_dim,
            cfg=cfg
        )

class BertModel(ModuleBase):
    @configurable
    def __init__(
        self, 
        in_dim,
        out_dim,
        cfg=None
    ):
        super().__init__(cfg)
        self.transformer = Transformer(
            in_dim,
            out_dim,
            cfg=cfg
        )
    
    @classmethod
    def from_config(cls, cfg):
        return {
            "in_dim": cfg.in_dim,
            "out_dim": cfg.out_dim,
            "cfg": cfg.cfg,
        }

bert_model = LazyCall(BertModel)(cfg=cfg)
bert = build_model(bert_model)

print(bert.transformer.mlp.act)    # output: gelu_tanh

ModuleBase的方案的代价是需要每个layer和model继承,然后多出一个cfg parameter。但是现在暂时感觉ModuleBase的存在用处不大(还需要讨论),所以下面是不用ModuleBase的demo,这个方案的代价只是多出一个cfg parameter。

from omegaconf import DictConfig
import oneflow as flow

from libai.config import LazyCall, configurable
from libai.models import build_model


cfg = dict(
    in_dim = 1,
    out_dim = 2,
    act = "gelu_tanh"
)
cfg = DictConfig(cfg)

cfg['cfg'] = cfg


class MLP(flow.nn.Module):
    def __init__(
        self, 
        in_dim,
        out_dim,
        cfg=None
    ):
        super().__init__()
        self.a = in_dim
        self.b = out_dim
        self.act = cfg.act


class Transformer(flow.nn.Module):
    def __init__(
        self,
        in_dim,
        out_dim,
        cfg=None
    ):
        super().__init__()
        self.mlp = MLP(
            in_dim,
            out_dim,
            cfg=cfg
        )


class BertModel(flow.nn.Module):
    @configurable
    def __init__(
        self, 
        in_dim,
        out_dim,
        cfg=None
    ):
        super().__init__()
        self.transformer = Transformer(
            in_dim,
            out_dim,
            cfg=cfg
        )
    
    @classmethod
    def from_config(cls, cfg):
        return {
            "in_dim": cfg.in_dim,
            "out_dim": cfg.out_dim,
            "cfg": cfg.cfg,
        }


bert_model = LazyCall(BertModel)(cfg=cfg)
bert = build_model(bert_model)

print(bert.transformer.mlp.act)

libai trainer设计文档

调研了一下detectron2, mmdet, ColossalAI, paddledetection

paddledetection

paddledetection则是直接定义了一个object, 比较冗长, 主要有trainer.train()函数去协调各个模块, 整体下来感官和ColoraIAI的并没有本质区别
如果要加新功能, 则在这个object里面进行改动, 改动会对旧版本造成较大的影响 trainer.py

detectron2 && mmdet

这两者的设计思路其实差不多, 先定义一个HookBase

class HookBase:

    def before_train(self):
        """
        Called before the first iteration.
        """
        pass

    def after_train(self):
        """
        Called after the last iteration.
        """
        pass

    def before_step(self):
        """
        Called before each iteration.
        """
        pass

    def after_step(self):
        """
        Called after each iteration.
        """
        pass

    def state_dict(self):
        """
        Hooks are stateless by default, but can be made checkpointable by
        implementing `state_dict` and `load_state_dict`.
        """
        return {}

然后把和训练相关的一些步骤, 全部继承HookBase, 打包成一个list送到trainer里面去就可以了. 在train()函数里面进行统一的调用
比如LR_scheduler, optimizer, write_metrics, save_model, eval_metric 都可以继承HookBase, 各自分开写成一个 HookBase的子类, 这样可以一目了然的查看这个模块在训练的哪个阶段进行了什么操作, 不容易出错

class TrainerBase:

   def __init__(self) -> None:
        self._hooks: List[HookBase] = []
        self.iter: int = 0
        self.start_iter: int = 0
        self.max_iter: int
        self.storage: EventStorage
        _log_api_usage("trainer." + self.__class__.__name__)

    def register_hooks(self, hooks: List[Optional[HookBase]]) -> None:
        """
        Register hooks to the trainer. The hooks are executed in the order
        they are registered.
        Args:
            hooks (list[Optional[HookBase]]): list of hooks
        """
        hooks = [h for h in hooks if h is not None]
        for h in hooks:
            assert isinstance(h, HookBase)
            # To avoid circular reference, hooks and trainer cannot own each other.
            # This normally does not matter, but will cause memory leak if the
            # involved objects contain __del__:
            # See http://engineering.hearsaysocial.com/2013/06/16/circular-references-in-python/
            h.trainer = weakref.proxy(self)
        self._hooks.extend(hooks)
        
   def train(self, start_iter: int, max_iter: int):
        """
        Args:
            start_iter, max_iter (int): See docs above
        """
        logger = logging.getLogger(__name__)
        logger.info("Starting training from iteration {}".format(start_iter))

        self.iter = self.start_iter = start_iter
        self.max_iter = max_iter

        with EventStorage(start_iter) as self.storage:
            try:
                self.before_train()
                for self.iter in range(start_iter, max_iter):
                    self.before_step()
                    self.run_step()
                    self.after_step()
                # self.iter == max_iter can be used by `after_train` to
                # tell whether the training successfully finished or failed
                # due to exceptions.
                self.iter += 1
            except Exception:
                logger.exception("Exception during training:")
                raise
            finally:
                self.after_train()

    def before_train(self):
        for h in self._hooks:
            h.before_train()

    def after_train(self):
        self.storage.iter = self.iter
        for h in self._hooks:
            h.after_train()

    def before_step(self):
        # Maintain the invariant that storage.iter == trainer.iter
        # for the entire execution of each step
        self.storage.iter = self.iter

        for h in self._hooks:
            h.before_step()

    def after_step(self):
        for h in self._hooks:
            h.after_step()

    def run_step(self):
        raise NotImplementedError

ColossalAI

其中ColossaIAI的trainer和detection2以及mmdet 有一定的共同之处, 但是模块划分没有那么鲜明, 有点像介于paddledetection 和 detection2&&mmdet之间的结合体, 在train的时候仍然需要在函数中写optimizer.zero_grad()等 trainer

    def _train_epoch(self,
                     train_dataloader: DataLoader,
                     epoch: int = None,
                     display_progress: bool = False):
        # set training state
        self._engine.train()
        data_iter = iter(train_dataloader)
        progress = range(self._steps_per_epoch)
        if display_progress:
            if epoch is None:
                progress = tqdm(progress, desc='[Train]')
            else:
                progress = tqdm(progress, desc=f'[Epoch {epoch} train]')

        self._call_hooks('before_train_epoch')
        self._call_timer(action='start', item='train-epoch')
        for i in progress:
            self._call_hooks('before_train_iter')
            self._call_timer(action='start', item='train-step')

            # run 1 training step
            self.engine.zero_grad()
            logits, label, loss = self.schedule.forward_backward_step(
                self.engine, data_iter, forward_only=False, return_loss=True)
            self.engine.step()
            self._call_timer(action='stop', item='train-step', keep_in_history=True)
            self._call_hooks('after_train_iter', output=(logits, label, loss))

            self._cur_step += 1

            # stop when max iter is reached
            if self._exceed_max_step():
                break

        self._call_timer(action='stop', item='train-epoch', keep_in_history=True)
        self._call_hooks('after_train_epoch')
        self._call_timer(action='reset', item='train-step')

个人倾向于detectron2 && mmdet的设计思路, 欢迎各位补充

About Load HuggingFace Bert

用LiBai的Bert加载huggingface的权重对齐输出发现的一些问题,经过修改后可以与hugigngface输出对齐

参数结构对比,可以先看最下面两个库中Bert的参数结构:

  • LiBaiembedding部分和huggingface的没问题。
  • 然后,看LayerNorm层,我们LiBaiLayerNorm层放在每一结构的输入位置,huggingface的是放在每一结构的输出位置,也是没问题的,只需要加载huggingface权重时加载其上一层结构的LayerNorm即可。
  • 再看qkv部分,huggingface的q、k、v是分开定义,我们LiBai的是直接qkv,只需要加载出huggingface的q、k、v然后拼接就行。
  • 最后,就是加载权重时,凡是涉及到Linear层的地方,权重都进行permute(1,0)就可以。

LiBai的Bert与huggingface的Bert内部逻辑计算上不同,导致输出不对齐:

  • LiBai的MultiheadAttention中有两行代码导致这部分的输出与huggingface没法对齐,下面这两种计算方法得到的q、k、v是不一样的:
# 原始代码:
query_key_value = query_key_value.view(bsz, -1, self.num_heads, 3 * self.head_size)
query_key_value = query_key_value.permute(0, 2, 1, 3)
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)

# 我修改后的,结果可以与huggingface的q、k、v对齐:
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
query = query.view(query.size(0), query.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
key = key.view(key.size(0), key.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
value = value.view(value.size(0), value.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
  • 然后就是LiBai的TransformerLayer内部计算逻辑和huggingface的有些部分不一样,这里的不同同样导致了LiBai的输出无法与huggingface对齐:
# 这里的计算不同导致之后的所有输出都不一致,比如MLP层接受的输入也不同了
#原始代码:
# https://github.com/Oneflow-Inc/libai/blob/main/libai/layers/transformer_layer.py#L176
hidden_states = hidden_states + attention_output

# 我修改后的:
hidden_states = layernorm_output + attention_output

也就是说LiBai的hidden_states是用self-attention层的结果attention_output加上TransformerLayer的输入得到的Bert中有12层TransformerLayer第一层的TransformerLayer输入是Embedding层的输出但是huggingface中的hidden_states是用self-attention层的
结果attention_output加上TransformerLayer的输入经过一次LayerNorm得到的也就是说LiBai中的hidden_states没有经过LayerNorm就加到hidden_states里面了看起来是不合理的
  • 最后一个问题,也是在LiBaiTransformerLayer中,也是计算逻辑不同导致输出不一致:
# 原始代码:
# https://github.com/Oneflow-Inc/libai/blob/main/libai/layers/transformer_layer.py#L200
output = hidden_states + mlp_output

# 修改过后的:
output = layernorm_output + mlp_output

也就是说LiBai的TransformerLayer层的最后输出是由mlp_output和layernorm_output求和huggingface中这里是用layernorm_output来计算的
  • 修改完上面的问题后,把LiBaiBert中的bias_gelu_fusion、bias_dropout_fusion、apply_query_key_layer_scaling设置为False,然后我写了一个加载huggingface预训练模型的函数,加载之后LiBaiBert使用huggingface的权重可以得到与huggingfaceBert一样的输出(设置相同的一句话作为输入)。

先看LiBai中的Bert参数结构

embeddings.vocab_embeddings.weight oneflow.Size([30522, 768])
embeddings.position_embeddings.weight oneflow.Size([512, 768])
embeddings.tokentype_embeddings.weight oneflow.Size([2, 768])

encoders.0.input_layernorm.weight oneflow.Size([768])
encoders.0.input_layernorm.bias oneflow.Size([768])

encoders.0.self_attention.query_key_value.weight oneflow.Size([768, 2304])
encoders.0.self_attention.query_key_value.bias oneflow.Size([2304])
encoders.0.self_attention.dense.weight oneflow.Size([768, 768])
encoders.0.self_attention.dense.bias oneflow.Size([768])

encoders.0.post_attention_layernorm.weight oneflow.Size([768])
encoders.0.post_attention_layernorm.bias oneflow.Size([768])

encoders.0.mlp.dense_h_to_4h.weight oneflow.Size([768, 3072])
encoders.0.mlp.dense_h_to_4h.bias oneflow.Size([3072])


encoders.0.mlp.dense_4h_to_h.weight oneflow.Size([3072, 768])
encoders.0.mlp.dense_4h_to_h.bias oneflow.Size([768])

encoders.1.input_layernorm.weight oneflow.Size([768])
encoders.1.input_layernorm.bias oneflow.Size([768])

encoders.1.self_attention.query_key_value.weight oneflow.Size([768, 2304])
encoders.1.self_attention.query_key_value.bias oneflow.Size([2304])
encoders.1.self_attention.dense.weight oneflow.Size([768, 768])
encoders.1.self_attention.dense.bias oneflow.Size([768])
encoders.1.post_attention_layernorm.weight oneflow.Size([768])
encoders.1.post_attention_layernorm.bias oneflow.Size([768])
encoders.1.mlp.dense_h_to_4h.weight oneflow.Size([768, 3072])
encoders.1.mlp.dense_h_to_4h.bias oneflow.Size([3072])
encoders.1.mlp.dense_4h_to_h.weight oneflow.Size([3072, 768])
encoders.1.mlp.dense_4h_to_h.bias oneflow.Size([768])

final_layernorm.weight oneflow.Size([768])
final_layernorm.bias oneflow.Size([768])
pooler.dense.weight oneflow.Size([768, 768])
pooler.dense.bias oneflow.Size([768])

再看一下huggingface的参数结构

bert.embeddings.word_embeddings.weight torch.Size([30522, 768])
bert.embeddings.position_embeddings.weight torch.Size([512, 768])
bert.embeddings.token_type_embeddings.weight torch.Size([2, 768])
bert.embeddings.LayerNorm.gamma torch.Size([768])
bert.embeddings.LayerNorm.beta torch.Size([768])

bert.encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.query.bias torch.Size([768])
bert.encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.key.bias torch.Size([768])
bert.encoder.layer.0.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.value.bias torch.Size([768])
bert.encoder.layer.0.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.gamma torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.beta torch.Size([768])

bert.encoder.layer.0.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.0.intermediate.dense.bias torch.Size([3072])


bert.encoder.layer.0.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.0.output.dense.bias torch.Size([768])
bert.encoder.layer.0.output.LayerNorm.gamma torch.Size([768])
bert.encoder.layer.0.output.LayerNorm.beta torch.Size([768])

bert.encoder.layer.1.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.query.bias torch.Size([768])
bert.encoder.layer.1.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.key.bias torch.Size([768])
bert.encoder.layer.1.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.value.bias torch.Size([768])
bert.encoder.layer.1.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.1.attention.output.LayerNorm.gamma torch.Size([768])
bert.encoder.layer.1.attention.output.LayerNorm.beta torch.Size([768])
bert.encoder.layer.1.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.1.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.1.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.1.output.dense.bias torch.Size([768])
bert.encoder.layer.1.output.LayerNorm.gamma torch.Size([768])
bert.encoder.layer.1.output.LayerNorm.beta torch.Size([768])

bert.pooler.dense.weight torch.Size([768, 768])
bert.pooler.dense.bias torch.Size([768])

目前存在一个文档问题: Markdown的表格语法显示错误

目前存在一个文档问题: Markdown的表格语法显示错误, 可能需要 @lixiang007666 帮忙康康

1648867717(1)

这个是sphinx-rtd-theme这个主题不支持md语法的问题,我去查了下doc,他们支持的表格方式怪复杂,而且不支持写在markdown里。
所以目前可以用下面这种解决方式,样式可调:

'<table border="2" align="center">
    <tr>
        <td align="center">Model</td>
        <td align="center">Pretrain</td>
        <td align="center">Resolution</td>
        <td align="center">Acc@1</td>
        <td align="center">Acc@5</td>
        <td align="center">Download</td>
    </tr>
    <tr>
        <td align="center">ViT-Tiny</td>
        <td align="center">ImageNet-1K</td>
        <td align="center">224x224</td>
        <td align="center">72.7</td>
        <td align="center">91.0</td>
        <td align="center"><a href="https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/LiBai/ImageNet/vit_tiny_patch16_224/config.yaml">Config</a> | <a href="https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/LiBai/ImageNet/vit_tiny_patch16_224/model_best.zip">Checkpoint</a></td>
    </tr>
</table>'

后面我再仔细看下~

Originally posted by @lixiang007666 in #226 (comment)

libai 文档的主题是否需要和 oneflow 仓库的统一?

目前 libai 仓库的风格都用的默认 sphinx 主题,和 oneflow 仓库的不搭。oneflow docs/flowvision 仓库的主题是之前晟航 @jackalcooper 筛选的 furo 主题,比较好看。我个人建议libai也应该用 furo 比较好吧。或者退而求其次,起码所有对外 API 文档主题风格要统一。

image

image

@rentainhe @lixiang007666

如果要统一的话,参考 oneflow 仓库,sphinx 配置改两个地方就可以了,分别是:

https://github.com/Oneflow-Inc/oneflow/blob/master/docs/source/conf.py#L82
https://github.com/Oneflow-Inc/oneflow/blob/master/docs/requirements.txt#L3

[Research] Support ffcv backend

讨论如何支持FFCV格式

主要参考:

FFCV的使用

  • 首先最关键的点是需要制作FFCV的数据集: 参考官方例子,可以用下面的代码简单做一个FFCV的CIFAR10数据集
datasets = {
    'train': torchvision.datasets.CIFAR10(root="./", train=True, download=True),
    'test': torchvision.datasets.CIFAR10('./', train=False, download=True)
}
for (name, ds) in datasets.items():
    writer = DatasetWriter(f'./cifar_{name}.beton', {
        'image': RGBImageField(),
        'label': IntField()
    })
    writer.from_indexed_dataset(ds)

数据集会保存为.beton的格式,在FFCV的issue里有说到:

One of the many reasons behind the speed behind FFCV is that it uses an optimized format to store the dataset. You will need to convert your dataset before loading it.

  • 然后是使用FFCV的Loader模块
from typing import List

import torch as ch
import torchvision

from ffcv.fields.decoders import IntDecoder, SimpleRGBImageDecoder
from ffcv.loader import Loader, OrderOption
from ffcv.pipeline.operation import Operation
from ffcv.transforms import RandomHorizontalFlip, Cutout, \
    RandomTranslate, Convert, ToDevice, ToTensor, ToTorchImage
from ffcv.transforms.common import Squeeze

CIFAR_MEAN = [125.307, 122.961, 113.8575]
CIFAR_STD = [51.5865, 50.847, 51.255]

BATCH_SIZE = 16

loaders = {}
for name in ['train', 'test']:
    label_pipeline: List[Operation] = [IntDecoder(), ToTensor(), ToDevice('cuda:0'), Squeeze()]
    image_pipeline: List[Operation] = [SimpleRGBImageDecoder()]

    # Add image transforms and normalization
    if name == 'train':
        image_pipeline.extend([
            RandomHorizontalFlip(),
            RandomTranslate(padding=2),
            Cutout(8, tuple(map(int, CIFAR_MEAN))), # Note Cutout is done before normalization.
        ])
    image_pipeline.extend([
        ToTensor(),
        ToDevice('cuda:0', non_blocking=True),
        ToTorchImage(),
        Convert(ch.float16),
        torchvision.transforms.Normalize(CIFAR_MEAN, CIFAR_STD),
    ])

    # Create loaders
    loaders[name] = Loader(f'./cifar_{name}.beton',
                            batch_size=BATCH_SIZE,
                            num_workers=8,
                            order=OrderOption.RANDOM,
                            drop_last=(name == 'train'),
                            pipelines={'image': image_pipeline,
                                       'label': label_pipeline})

对于这个image_pipeline有几个接口说明一下作用

  • RandomHorizontalFlip, RandomTranslate, Cutout这几个是只适配于ffcv数据格式的数据增强操作
  • ToTensor()转换为Torch的tensor,但是格式是(H, W, C), 所以需要使用ToTorchImage()这个接口来将输入的tensor转换为C, H, W的格式,后续才能使用torchvision里的Normalize操作

FFCV兼容的难度

  • 数据增强操作只适用于ffcv特定的数据集,并且每个数据增强操作对接口都有一定的要求,不能独立出来兼容torchvision的数据增强
from ffcv.transforms import RandomHorizontalFlip
transfrom_test = torchvision.transforms.Compose(
    [torchvision.transforms.Resize(32), RandomHorizontalFlip(), torchvision.transforms.ToTensor()]
)
dataset = torchvision.datasets.CIFAR10(root="./", train=True, download=True, transform=transfrom_test)

试图看看能不能单独使用RandomHorizontalFlip结果会报错,实现与torch的不一样,所以只能适配ffcv自己的数据集

  • 不太好衡量兼容的性价比,因为实现的数据增强太少了,并且只有在调用完ToTorchImage()后才能得到一个pytorch的tensor,才能对这个tensor进行操作,数据增强太少无法满足训练需求

FFCV可能的兼容方案

  • 我能想到的只有重写几个dataset.py并且在得到数据后手动从pytorch的tensor转换为oneflow的tensor,再接入libai进行训练,ffcv的内置接口都无法单独拿出来使用

Swin 数据并行从单进程到8进程,显存暴涨

问题描述

进行 swin 数据并行实验的时候,发现不管是 eager global 还是 graph, 8进程的单卡显存占用 明显比 单进程要多很多,但是普通的ddp是正常的。

实验环境:
类脑 vs009
oneflow 版本:0.8.0+cu112.git.57869e9e39
libai 版本 :de2c68f2692760e5de87ebb815541a98d1b8ebe7

libai eager global fp32 batch 32

1进程 0号卡显存: 5734 M
8进程 0号卡显存: 15190 M

libai graph fp32 batch 32

1进程 0号卡显存:5098 M
8进程 0号卡显存:9064 M

ddp fp32 batch 32 https://github.com/Oneflow-Inc/swin-transformer/tree/swin_clean_ldp

1进程 0号卡显存: 5592 M
8进程 0号卡显存: 5920 M

GPT2预训练模型,相同配置下libai的显存占用率会显著高于megatron-lm

使用的oneflow版本:0.8.0+cu102,使用的libai版本:最新commit,实验环境:8× Tesla P100-PCIE-16GB

我目前正在比较onelfow-libai和megatron-lm在预训练GPT2时的性能,实验过程中发现,在相同配置下:oneflow-libai相较于megatron-lm占用的gpu显存会明显更高,不知道是我的配置问题还是框架本身的问题?

我在相同的GPT2模型网络结构配置下(详见后面贴的配置文件),将data_parallel设置为8,tensor_parallel和pipeline_parallel均设置为1,服务器环境是8块16G的Tesla-P100:

  • 在Megatron-LM下可以采用micro_batch_size=8,global_batch_size=64的单次迭代数据大小,GPU显存占用率在70%左右
  • 但在oneflow-libai下设置micro_batch_size=4,global_batch_size=32时,GPU就已经达到95%的显存占用率了;如果设置为和Megatron-LM相同的batch size,就会出现out of memory问题

我的模型、训练、并行等参数的配置文件修改:libai/configs/gpt2_pretrain.py:

from libai.config import LazyCall
from libai.evaluation import PPLEvaluator
from .common.models.gpt import pretrain_model as model
from .common.train import train
from .common.optim import optim
from .common.data.gpt_dataset import dataloader, tokenization

from .common.models.graph import graph

merge_files = "/home/gehao/dataset/gpt/hf-GPT2Data/merges.txt"
vocab_file = "/home/gehao/dataset/gpt/hf-GPT2Data/vocab.json"
data_prefix = "/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document"

tokenization.tokenizer.vocab_file = vocab_file
tokenization.tokenizer.merges_file = merge_files
dataloader.train.dataset[0].data_prefix = data_prefix
dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix

# GPT-2 model config
model.cfg.num_layers = 24
model.cfg.vocab_size = 50257
model.cfg.hidden_size = 1024
model.cfg.ffn_hidden_size = 4 * model.cfg.hidden_size 
model.cfg.num_attention_heads = 16
model.cfg.max_seq_length = 1024
model.cfg.embedding_dropout_prob = 0.1
model.cfg.attention_dropout_prob = 0.1

optim.lr = 1.5e-4

train.train_iter=100
train.warmup_ratio=0.01
train.zero_optimization.enabled=True
train.zero_optimization.stage=2
train.checkpointer.period=1000
train.test_micro_batch_size=8
train.evaluation.eval_period=100
train.evaluation.eval_iter=10
train.evaluation.evaluator = LazyCall(PPLEvaluator)()
train.log_period=1
train.amp.enabled = True

# train.input_placement_device = "cpu"
train.input_placement_device = "cuda"
train.rdma_enabled = False

train.dist.data_parallel_size=8
train.dist.tensor_parallel_size=1
train.dist.pipeline_parallel_size=1
train.dist.pipeline_num_layers = model.cfg.num_layers # 只有当pipeline_parallel_size>1时有效

train.train_micro_batch_size=4
train.num_accumulation_steps=1
train.global_batch_size=train.dist.data_parallel_size*train.train_micro_batch_size*train.num_accumulation_steps

for ds in dataloader.train.dataset:
    ds.max_seq_length = model.cfg.max_seq_length

train.output_dir = f"./output/oneflow_libai_perf_gpt2_pretrain"

对应的libai/tools/train.sh

#!/usr/bin/env bash

FILE=$1
CONFIG=$2
GPUS=8
NODE=1
NODE_RANK=0
ADDR=127.0.0.1
PORT=60075

export ONEFLOW_FUSE_OPTIMIZER_UPDATE_CAST=true

python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT \
$FILE --config-file $CONFIG ${@:4}

执行的命令:

bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py

VisionTransformer模型跟进

Vision Transformer

将在这个issue下跟进Libai VisionTransformer的各个工作进度

  • 目前已经完成的工作
    • Eager ViT的适配与Loss对齐,相关PR #74
    • Loss对齐的细节整理 #63
  • 后续需要继续进行的工作
    • 是否需要使用最新flowvision下对齐timm的Vision Transformer模型进行Loss对齐操作 (optional): 相关实验
    • 测试Evaluator是否兼容: 兼容Evaluator,但是Evaluator里的一个info有点问题,CP会帮忙修复~
    • 使用修改后的Eager ViT,完整跑通CIFAR100
    • 使用对齐的Eager ViT的模型先在Libai下进行ImageNet的完整训练,看看最终结果是否会有问题
    • 使用Libai内置的Layers实现Vision Transformer后进行Loss对齐实验
    • 使用Libai实现的Vision Transfor进行完整训练

Experiments

  • 简单进行了在CIFAR100上的训练: log.txt

ImageNet

  • 未完全对齐超参
Model Acc@1 (libai) Acc@1 (original)
ViT-Tiny 71.6 72.2

CIFAR100

  • 在这个修复的PR #226 settings下的执行结果
Model Acc@1 (libai) Acc@1 (original)
ViT-Tiny 69.1 65.5

ImageNet

  • 在这个修复的PR #226 settings下的执行结果
Model Acc@1 (libai) Acc@1 (original)
ViT-Tiny 72.7 72.3

想问问现在有转换hugging face模型的代码了吗?

不是催更,在issue和pr里逛了一圈,感觉好像快有了?如果快要更新了,就不自己折腾了。是想试试T5的3D并行。

顺便问一下,T5是支持pipeline并行吗?我看megatron代码里写的有decoder的情况下没有pipeline。

Swin Transformer V2 基于LiBai复现 (关于CCF的pr整合)

项目介绍

Swin-Transformer V2 [Liu et al.2021] 是微软对原有 Swin-Transformer 的继续深入研究。原有的 Swin-Transformer 通过引入图像自身的先验知识(shift 窗口来实现)在图像分类,目标检测,语义分割上取得了良好的性能。然而由于一系列问题: 大模型训练产生的激活阈值过大,下游任务的分辨率太大而预训练的分辨率太低导致相对位置编码损害了性能,以及预训练需要大量标记数据的问题。作者分别采用将 Layernorm 移动至 MLP 以及 Attention 的后方,采用 log-spaced 的元网络生成,利用 SimMIM 来辅助训练解决了上述问题。最后设计了一个具备 3 亿参数量的模型在众多数据集上取得了极为优异的性能。本课题的目标便是通过 Libai 复现网络,并且能够拓展模型,支持各种并行形式的训练。并在 ImageNet-21k, ImageNet-1k V1, V2 上进行完整训练,达到论文中的基准线。

模型和示例

模型介绍

模型架构大致和Swin Transformer基本相同,除了以下三点不同需要改动:

  1. 将LayerNorm移动至MLP和Attention的后方

  2. 通过余弦计算相似度替换原来的内积

  3. 使用元网络来实现相对位置编码,同时应用log-based实现

具体可以见下图:
teaserv6.pdf

实现难点

  1. 各种并行训练的实现,以及精度对齐问题。
  2. Oneflow和Pytorch算子的一些不对齐问题。

可借鉴的代码

  1. libai中原有关于VIT,Swin的实现
  2. facebook官方关于SwinV2的实现

中期考核前提交的pr

  • swinv2复现,数据并行验证实现,CIFAR-100精度对齐,已经合并:#321
  • 流水并行实现,同时添加单元测试验证并行正确性,已经合并:#348
  • 添加swinv2_loader.py ,实现huggingface和libai的权重导入,以及其对应的单元测试文件,未合并:#353

MoE

MoE(Mixture-Of-Experts, 混合专家系统),在不增加计算量的情况下增加模型容量。采用的技术是Conditional computation,通过加入可训练的门控网络,决定专家系统的稀疏组合。直观看来,就是把一个大模型,按层拆分成不同的小模型组合,在输入样本时,动态地选择对应的小模型计算。
使用SPARSELY-GATE机制来选择模型,MoE包含一个门控网络决定激活哪些层。
image

推理和生成相关调研和设计

调研了不同的 NLP 库在预测阶段的处理方式

FairSeq

针对生成任务的代码主要在 https://github.com/pytorch/fairseq/blob/main/fairseq/sequence_generator.py

class SequenceGenerator(nn.Module):
    def __init__(
        self,
        models,
        tgt_dict,
        beam_size=1,
        ...
    ):
        """Generates translations of a given source sentence."""
        ...
        
    def _generate(
        self,
        sample: Dict[str, Dict[str, Tensor]],
        prefix_tokens: Optional[Tensor] = None,
        constraints: Optional[Tensor] = None,
        bos_token: Optional[int] = None,
    ):
    	...

针对序列预测任务的代码主要在 https://github.com/pytorch/fairseq/blob/7e758841da9e05cb21826a60d30a563a9e189d1d/fairseq/sequence_scorer.py#L12

class SequenceScorer(object):
   """Scores the target for a given source sentence."""

   def __init__(
       self,
       tgt_dict,
       softmax_batch=None,
       compute_alignment=False,
       eos=None,
       symbols_to_strip_from_output=None,
   ):
     ...
   
   @torch.no_grad()
   def generate(self, models, sample, **kwargs):
       """Score a batch of translations."""
       net_input = sample["net_input"]
       ...

主要针对生成的任务进行构建的,tasks 支持比较少,而且两种风格不统一,同时不支持模型并行模式的推理。

AllenNLP

主要代码在 https://github.com/allenai/allennlp/blob/426d894ceef591b406cb77a7b094c88c85ad0068/allennlp/models/model.py#L193

在模型层面进行实现,每种模型绑定一个推理方式,这种方式下,模型和任务没有解耦,在训练中耦合 generation 的逻辑

Megatron-LM

提供了 api 代码 https://github.com/NVIDIA/Megatron-LM/blob/e156d2fea7fc5c98e645f7742eb86b643956d840/megatron/text_generation/api.py#L30

def generate_and_post_process(model,
                              prompts=None,
                              tokens_to_generate=0,
                              return_output_log_probs=False,
                              top_k_sampling=0,
                              top_p_sampling=0.0,
                              temperature=1.0,
                              add_BOS=False,
                              use_eod_token_for_early_termination=True):
    """Run inference and post-process outputs, i.e., detokenize,
    move to cpu and convert to list."""

    # Main inference.
    tokens, lengths, output_log_probs = generate(
        model,
        prompts=prompts,
        tokens_to_generate=tokens_to_generate,
        return_output_log_probs=return_output_log_probs,
        top_k_sampling=top_k_sampling,
        top_p_sampling=top_p_sampling,
        temperature=temperature,
        add_BOS=add_BOS,
        use_eod_token_for_early_termination=use_eod_token_for_early_termination)

    # Only post-process on first stage.
    if mpu.is_pipeline_first_stage():
        tokens, prompts_plus_generations, prompts_plus_generations_segments = \
            detokenize_generations(tokens, lengths, True)
    ...

支持的 tasks 比较少,不过可以支持复杂并行的模型推理,比如 pipeline 并行,但是整体实现以及调用流程比较复杂,对用户不友好

HuggingFace

主要代码在 https://github.com/huggingface/transformers/blob/eb5bdcdfa51f743887ee1d9c7f230444d7a8b23c/src/transformers/pipelines/base.py#L710

在整个流程抽象为如下的处理流

Input -> Tokenization -> Model Inference -> Post-Processing (task dependent) -> Output

调用方式清晰简单

from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to introduce pipeline to the transformers repository.')
>>> [{'label': 'POSITIVE', 'score': 0.9996980428695679}]

扩展任务比较方便,可以继承基类 Pipeline,解耦了任务相关的流程和模型推理的流程。

@thinksoso @xiezipeng-ML 遗漏的内容可以补充一下,有错误的地方可以修正~

swin 3d并行8卡训练会崩(数据+张量+接力)

现象描述

实验发现,swin 在 8卡 3d 并行的配置下,训练会出现训到某个点,开始loss会一直上升,精度会一直下降

实验配置:

train.dist.data_parallel_size = 2
train.dist.tensor_parallel_size = 2
train.dist.pipeline_parallel_size = 2

目前做过的实验

出问题的case

  • 数据+模型+朴素接力(2,2, 2), eager global, clip grad on ,local batch size = 128, 到 28% 往下掉成 1% 训崩了

  • 数据+模型+朴素接力(2,2, 2), eager global, clip grad off ,local batch size = 128, 到 28% 往下掉成 1% 训崩了

  • 数据+模型+朴素接力(2,2, 2), eager global, clip grad on ,local batch size = 32, 到 18% 往下掉成 1% 训崩了

没问题的case

  • 纯数据并行 8卡, graph + amp + zero stage 1, clip grad on,local batch size = 32,最后收敛精度 > 75%

  • 纯数据并行 8卡, eager global, clip grad on,local batch size = 32,最后收敛精度 > 75%

  • 数据+模型并行(4,2)8卡, graph fp32, clip grad on,local batch size = 64,最后收敛精度 > 75%

其他不完全的实验

  • 数据+朴素流水并行(2, 4)8卡, graph fp32, clip grad on,local batch size = 128,最后精度 ~69%,收敛不到指定精度
  • 数据+模型并行(4,2), eager global, clip grad on,local batch size = 64,慢的无法接受, 速度比纯数据并行 eager global 慢 ~7倍左右
  • 数据+模型+朴素接力(2,2, 2), graph fp32, clip grad on,local batch size = 128,训到 > 52% 就停了,但是在同样的迭代数下,比 纯数据并行精度要低很多

Loss对齐实验细节记录

这次issue记录了做CV任务相关Loss对齐需要注意的细节,也是一份踩坑记录,可供后续Loss对齐实验参考

实验配置

数据读取配置

虽然都是使用imagenet数据集,但是为了保证能使用相同的输入,需要对齐以下两个方面:

1. 数据增强一致

这里需要注意的细节是,Resize这个函数,如果只输入一个值的话,是短边Resize,如果不加CenterCrop这个操作,是会报错的,但是如果加了CenterCrop操作的话,这里是不会报错的, 因为就算图片不一样,例如(224,356)(256,224), 都会被CenterCrop到(224, 224),所以可能在写transforms代码的时候,Resize(224)后加上一个CenterCrop(224)是会导致输入不一致的,所以这里最保险的做法是不加transfroms.CenterCrop,我也不知道我当时为什么加上了这个,但是就当作踩坑记录了~,统一采用以下方式即可

no_aug_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.CenterCrop((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD)
])

在LazyCall中对应的写法

no_augmentation_transform = LazyCall(transforms.Compose)(
    transforms=[
        LazyCall(transforms.Resize)(
            size=(224, 224),
            interpolation=InterpolationMode.BILINEAR,
        ),
        LazyCall(transforms.CenterCrop)(
            size=(224, 224),
        ),
        LazyCall(transforms.ToTensor)(),
        LazyCall(transforms.Normalize)(
            mean=IMAGENET_DEFAULT_MEAN,
            std=IMAGENET_DEFAULT_STD,
        )
    ]
)

2. 数据读取顺序一致

由于libai不能像eager一样使用同一份dataloader,所以为了保证一样的数据输入,我们在pytorch的dataloader里调用的sampler必须和libai中的一致,并且都要通过numpy来设置随机种子,才能保证每次采样都会输入相同的index,这样即使shuffle设置为True的话,也可以得到同样的输入结果,以下是一份参考的改写sampler的方式:

import numpy as np
from torch.utils.data import Sampler


class CyclicSampler(Sampler):
    """This sampler supports cyclic sampling, and it is also compatible with
    non data parallelism and data parallelism.

    Arguments:
        dataset: dataset to be sampled.
        micro_batch_size: batch size for per model instance.
        global_batch_size is micro_batch_size times data_parallel_size.
        shuffle: whether to shuffle the dataset.
        consumed_samples: the number of samples that have been trained at the current time,
        used for resuming training.
        data_parallel_rank: local rank for data parallelism.
        data_parallel_size: the size of data parallelism.
        seed: random seed, used for reproducing experiments.
    """

    def __init__(
        self,
        dataset,
        micro_batch_size,
        shuffle=False,
        consumed_samples=0,
        data_parallel_rank=0,
        data_parallel_size=1,
        seed=0,
    ):
        self.dataset = dataset
        self.data_size = len(self.dataset)
        self.shuffle = shuffle

        self.data_parallel_rank = data_parallel_rank
        self.data_parallel_size = data_parallel_size
        self.micro_batch_size = micro_batch_size
        self.actual_batch_size = self.micro_batch_size * self.data_parallel_size
        self.remain_data_size = self.data_size % self.actual_batch_size
        self.active_data_size = self.data_size - self.remain_data_size
        self.consumed_samples = consumed_samples

        self.seed = seed

    def __iter__(self):
        """divide the data into data_parallel_size buckets,
        and shuffle it if `shuffle` is set to `True`.
        Each processor samples from its own buckets and data_loader
        will load the corresponding data.
        """
        epoch = self.consumed_samples // self.data_size
        batch = []
        while True:
            current_epoch_samples = self.consumed_samples % self.data_size

            bucket_size = (
                self.data_size // self.actual_batch_size * self.micro_batch_size
            )
            bucket_offset = current_epoch_samples // self.data_parallel_size
            start_idx = self.data_parallel_rank * bucket_size

            if self.shuffle:
                # 这里最初设置的是torch.Generator(),改为np.random.seed()后,才可以保证在相同seed的情况下,采样顺序一致
                np.random.seed(self.seed)
                random_idx = np.random.permutation(bucket_size).tolist()
                indices = [start_idx + x for x in random_idx[bucket_offset:]]
            else:
                seq_idx = flow.arange(bucket_size).tolist()
                indices = [start_idx + x for x in seq_idx[bucket_offset:]]

            epoch += 1

            if (
                hasattr(self.dataset, "supports_prefetch")
                and self.dataset.supports_prefetch
            ):
                self.dataset.prefetch(indices)

            for idx in indices:
                batch.append(idx)
                if len(batch) == self.micro_batch_size:
                    self.consumed_samples += self.actual_batch_size
                    yield batch
                    batch = []

    def __len__(self):
        return self.data_size

    def set_consumed_samples(self, consumed_samples):
        """you can recover the training iteration by setting `consumed_samplers`."""
        self.consumed_samples = consumed_samples

    def set_epoch(self, epoch):
        """used for restoring training status."""
        self.epoch = epoch

训练超参配置

  • 模型的dropout参数需要关闭,可以开启model.eval()
  • 模型必须载入相同的初始化权重
  • 优化器的参数必须一致
  • 不采用任何scheduler,使用固定学习率,在libai下因为没有ConstantLR,所以可以采用MultistepLR,将milestones的值调大一些
scheduler = LazyCall(WarmupMultiStepLR)(
    max_iters=1000,
    warmup_iters=0,
    warmup_factor = 0.0001,
    alpha = 0.01,
    milestones=[2000]
)
  • 注意是否有grad-norm操作,由于libai中使用的是get_default_optimizer_params函数,里面可以设置clip_grad_max_norm等参数,需要将这些都设置为None,才可以与最简单的torch训练进行Loss对齐操作
optim = LazyCall(flow.optim.AdamW)(
    parameters=LazyCall(get_default_optimizer_params)(
        # parameters.model is meant to be set to the model object,
        # before instantiating the optimizer.
        # 对齐Loss时需要设置为None
        clip_grad_max_norm=None,
        clip_grad_norm_type=None,
        weight_decay_norm=None,
        weight_decay_bias=None,
    ),
    lr=1e-4,
    weight_decay=0.01,
    betas=(0.9, 0.999),
    do_bias_correction=True,
)

保存Loss,画图

参考xingyu的BERT Loss对齐,在Libai中可以通过修改DefaultTrainer和GraphTrainer,来保存每个steps的Loss,具体的修改如下:

  • 修改GraphTrainer,具体位置在/libai/trainer/trainer.py
class GraphTrainer(TrainerBase):
    def __init__(self, graph, data_loader_iter):
        super().__init__()

        graph.model.train()
        self._data_loader_iter = iter(data_loader_iter)
        self.graph = graph
        # 新增保存Loss的一个list
        self.all_losses = []

    def run_step(self, get_batch: Callable):
        """
        Implement the standard training logic described above.
        """
        assert (
            self.graph.model.training
        ), "[SimpleTrainer] model was changed to eval mode!"
        start = time.perf_counter()

        # If you want to do something with the data, you can wrap the dataloader.
        data = next(self._data_loader_iter)
        data = get_batch(data)
        data_time = time.perf_counter() - start

        # If you want to do something with the losses, you can wrap the model.

        losses = self.graph(*data)
        loss_dict = {"total_loss": losses}

        self.write_metrics(loss_dict, data_time)
        # 每个iter结束后,将这个值存入到之前定义好的存放loss的一个list中
        self.all_losses.append(dist.tton(losses).item())
  • 修改DefaultTrainer, 具体位置在/libai/trainer/default.py下, 大约370行左右的train函数下
    def train(self):
        """
        Run training.
        Returns:
            OrderedDict of results, if evaluation is enabled. Otherwise None.
        """
        super().train(self.start_iter, self.max_iter)
        # write loss
        all_losses = self._trainer.all_losses
        # 将存储好的Loss保存到一个文件中
        with open("of_vit_loss.txt", "w") as f:
            for loss in all_losses:
                f.write(str(loss) + "\n")

流水并行下 get_batch 中 data to_global 的修复方式讨论

相关issue

#243 (comment)

目前的修复 pr

#255

相关讨论:

廖星宇:
如果他在 getitem 里面通过随机性修改 image 和 label

廖星宇:
类似 mixup 那种

廖星宇:
索引还是按我们的要求来

梁德澎:
所以直接把rank0上的发送到 rank -1是比较合理的

廖星宇:
是的,我就是想这个 pr 能不能直接把这个实现了

廖星宇:
而不是特定为 mixup 修复这个问题

廖星宇:
这样后面出错的风险也低了

梁德澎:
这个要想一下,因为分类任务

梁德澎:
有什么数据我是知道的

梁德澎:
我明确知道 label 要传

梁德澎:
所以我可以直接写死

梁德澎:
nlp类的任务,我怎么知道那些要传那些不传呢

程鹏:
1649646192(1)
程鹏:
这里可以加一个 prefer_placement

程鹏:
如果是None

程鹏:
就直接to_placement(placement_idx)

梁德澎:
也就是在 to_global 里处理这个问题是吧

梁德澎:
我觉得是可行的

廖星宇:
我们默认to rank0 吧

廖星宇:
或者 to stage1 的 rank

程鹏:
是的

廖星宇:
可以把 get_batch 搞得简单一点

程鹏:
但是data不一定只在 rank0上面

廖星宇:
就是数据并行的 rank

廖星宇:
因为模型并行都是 b

廖星宇:
所以可以直接发过去

廖星宇:
我们可以写一个 issue 来讨论一下

廖星宇:
如果不好实现,那可以先把这个 mixup 合了

廖星宇:
如果好实现,就在这个 pr 里面一并做了

混合并行无法收敛

问题描述

swin 设置数据+流水并行,发现训练无法收敛

实验分支:#215

对比实验

以下实验,将总的 batch size 固定为 32

实验1,单卡

swin_cifar100.py 中的dist配置改为:

train.dist.data_parallel_size=1
train.dist.tensor_parallel_size=1
train.dist.pipeline_parallel_size=1

第一个 epoch top1 acc: 3.49

实验2,2卡

swin_cifar100.py 中的dist配置改为:

train.dist.data_parallel_size=1
train.dist.tensor_parallel_size=1
train.dist.pipeline_parallel_size=2

第一个 epoch top1 acc: 3.14

实验3,4卡

swin_cifar100.py 中的dist配置改为:

train.dist.data_parallel_size=1
train.dist.tensor_parallel_size=1
train.dist.pipeline_parallel_size=4

第一个 epoch top1 acc: 3.81

实验4,8卡

swin_cifar100.py 中的dist配置改为:

train.dist.data_parallel_size=2
train.dist.tensor_parallel_size=1
train.dist.pipeline_parallel_size=4

第一个 epoch top1 acc: 1.04

实验5,4卡

swin_cifar100.py 中的dist配置改为:

train.dist.data_parallel_size=2
train.dist.tensor_parallel_size=1
train.dist.pipeline_parallel_size=2

第一个 epoch top1 acc: 0.98

实验结论

单纯的朴素流水并行,1,2,4卡都能收敛,但是当数据加流水并行一起跑,就不会收敛。

在实验5基础上,做实验把 eager_trainer 中的 optimizer.step 注释掉,跑出来的精度也是1.多。

Documentation Guide

文档细节整理

from PR: #229 by @khloe-zhang

  • 开发者的自我称呼:在文档中看到很多“we”用来指代开发者,显得不太客观。解决方法:

    • 尽量用“LiBai”来代替“we”,如:In LiBai, we define a standard set of config namespaces 改为 LiBai defines a standard set of config namespaces
    • 改用物做主语,如:We will introduce it in detail as follows 改为 The details are as follows
  • 对用户的称呼前后不一致:有user,有you,还有we。解决方法:

    • 建议统一称呼方式,这次refine里都修改成了you。

    注意:在指导用户操作时不要用we。中文里习惯说“我们可以这样做”,但英文里面“We can do...”可能会迷惑读者,没有指明到底是想让谁这样做。建议直接用“You can do...”。

  • 标签格式前后不一致:“note”和“step”在不同的文档里面格式不同,有 Note 和 NOTE,1. 和 Step 1.

    • 统一用 Note 和 Step 1.

OneFlow fix zero在libai的回归测试

PR:Oneflow-Inc/oneflow#7557

这个issue的方案已经确定,辛苦用这个分支试验一下(我这里已经验证通过)。

验证点:
1、zero的可以执行,就是issue的内容; @CPFLAME
2、libai的混合并行,性能正常(这个改动涉及一个基础的sbp推理限制,想验证这个改动对性能没有负面影响); @L1aoXingyu

验证通过,就合并这个PR;

libai 配置系统设计文档

配置系统主要是为模型结构的定义和训练超参提供配置参数,好的配置系统可以让模型的定义以及训练流程更清晰,也可以让用户一眼能够了解到不同模型配置差异以及训练差异在哪里,而且还可以让模型的复现变得更加容易。

一个好的配置系统我认为有下面4个特点:

  • 生成的配置文件可以序列化和反序列化,这样的好处有两个:1. 用户可以通过 diff config 轻松获得两次训练的配置差异;2. 直接 load config 便能复现别人的结果,能够节省大量沟通的时间成本;
  • 配置系统层次清晰,比如模型结构相关的参数、网络训练的超参以及数据读取的参数能够清晰的分开;
  • 参数在配置之后是可修改的,因为有的时候需要根据数据集的 size 再动态设置训练的 iters 等,这就需要用户定义好配置之后,参数可以再次被修改,但是需要有一定的保护机制;
  • 用户可以灵活增加配置参数而不需要侵入式修改 libai 的内部代码,希望用户可以将 libai 作为一个 lib 进行使用,而不是 clone 之后在上面修改;

调研了市面上比较常见的配置系统,下面是一个总结

megatron

使用 python argparser 进行定义

def parse_args(extra_args_provider=None, defaults={},
               ignore_unknown_args=False):
    """Parse all arguments."""
    parser = argparse.ArgumentParser(description='Megatron-LM Arguments',
                                     allow_abbrev=False)

    # Standard arguments.
    parser = _add_network_size_args(parser)
    parser = _add_regularization_args(parser)
    parser = _add_training_args(parser)
    parser = _add_initialization_args(parser)
    ...

问题是定义的参数无法序列化,不好对比两次训练的超参差别,在代码中通过 get_args() 直接穿透到内层获得参数,同时参数的修改没有保护机制,非常容易误修改参数;

huggingface

使用了 python class 进行定义

@dataclass
class TrainingArguments:
    """
    TrainingArguments is the subset of the arguments we use in our example scripts **which relate to the training loop
    itself**.
    Using :class:`~transformers.HfArgumentParser` we can turn this class into `argparse
    <https://docs.python.org/3/library/argparse.html#module-argparse>`__ arguments that can be specified on the command
    line.
    Parameters:
        output_dir (:obj:`str`):
            The output directory where the model predictions and checkpoints will be written.
        overwrite_output_dir (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If :obj:`True`, overwrite the content of the output directory. Use this to continue training if
            :obj:`output_dir` points to a checkpoint directory.
...
  """

output_dir: str = field(
        metadata={"help": "The output directory where the model predictions and checkpoints will be written."},
    )
    overwrite_output_dir: bool = field(
        default=False,
        metadata={
            "help": (
                "Overwrite the content of the output directory. "
                "Use this to continue training if output_dir points to a checkpoint directory."
            )
        },
    )
...

huggingface 的配置系统非常繁琐,阅读代码的时候很不方便,整个配置被分散在了不同的 python 文件中,貌似可以保存配置,但是不确定能不能读取配置进行再次训练,也不了解新增配置参数是否方便;

detectron2 & ColossalAI & mmdet

他们的配置系统类似于传统的 yacs-based 配置系统,不过都是利用 dict 的方式进行定义

# d2
model = L(GeneralizedRCNN)(
    backbone=L(FPN)(
        bottom_up=L(ResNet)(
            stem=L(BasicStem)(in_channels=3, out_channels=64, norm="FrozenBN"),
            stages=L(ResNet.make_default_stages)(
                depth=50,
                stride_in_1x1=True,
                norm="FrozenBN",
            ),
            out_features=["res2", "res3", "res4", "res5"],
        ),
        in_features="${.bottom_up.out_features}",
        out_channels=256,
        top_block=L(LastLevelMaxPool)(),
    ),
    proposal_generator=L(RPN)(
        in_features=["p2", "p3", "p4", "p5", "p6"],
    ...

# colossal
model = dict(
    type='VisionTransformerFromConfig',
    tensor_splitting_cfg=dict(
        type='ViTInputSplitter2D',
    ),
    embedding_cfg=dict(
        type='ViTPatchEmbedding2D',
        img_size=IMG_SIZE,
        patch_size=PATCH_SIZE,
        embed_dim=DIM,
    ),
    ...

# mmdet
norm_cfg = dict(type='BN', requires_grad=False)
model = dict(
    type='FasterRCNN',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=3,
        strides=(1, 2, 2),
      ...

这一类配置系统和传统的 yaml 类型的配置系统相比,优势就是非常灵活,利用了 python 语法可以方便的增加和修改字典,也可以在配置系统中写一些简单的算术或者是简单函数,同时也可以非侵入式增加配置参数。
整体配置的定义非常清晰,网络定义,dataloader 以及 训练流程可以完全由配置系统构成,也支持序列化和反序列化,应该能够支持用户任意组合内置模块进行训练而不需要额外写代码,只需要完成配置即可。

img

libai 设计文档

LiBai

config 设计文档 详情
trainer 设计文档 详情
lr scheduler 设计文档 详情
dataloader 设计文档 详情

Megatron https://github.com/NVIDIA/Megatron-LM

megatron 的整体结构如下

megatron
	data (和数据处理相关的代码)
    	bert_dataset.py (bert 数据读取代码)
		gpt_dataset.py (gpt 数据读取代码)
		helper.cpp (建立 index 的工具代码)
		dataset_utils.py (一些数据创建工具,比如生成 mask,padding 等等)
		data_samplers.py (数据采样的代码)
		...
    fused_kernels (一些算子融合加速的代码)
		layer_norm_cuda_kernel.cu (layernorm cuda 加速代码)
		scaled_masked_softmax_cuda.cu (masked + softmax 加速代码)
		...
    model (所有的模型结构以及一些层的定义)
		bert_model.py (bert 模型)
		gpt_model.py (gpt 模型)
		transformer.py (transformer 的所有层)
		language_model.py (语言模型相关的层,比如 embedding 和 pooler)
		fused_bias_gelu.py (一些零碎的层的合并)
		...
    mpu (model parallel utility,模型并行工具包)
		layers.py (需要模型并行的 layer,比如 vocabEmbedding,ColumnLinear 等等)
		cross_entropy.py (分布式的交叉熵计算)
		...
    optimizer (优化器相关的部分)
		clip_grads.py (梯度裁剪)
		optimizer.py (定义使用的优化器)
		...
    tokenizer (分词相关的代码)
		tokenizer.py (分词的 interface)
		bert_tokenization.py (bert 任务的分词代码)
		gpt2_tokenization.py (gpt 任务的分词代码)
    arguments.py (模型训练传入的参数)
	checkpointing.py (模型保存相关的工具函数)
	global_vars.py (全局参数)
	learning_rates.py (学习率调整的策略)
	training.py (模型训练的所有工具代码)

tasks (不同任务相关的代码)
	glue (glue 任务相关的评测代码)
    vision (视觉分类的评测代码)
    finetune_utils.py (微调工具)

pretrain_bert.py (bert 训练入口)
pretrain_gpt.py (gpt 训练入口)
...

对于 model 部分

在 model 中定义了模型结构,如果要定义一个新模型,只需要定义 extended_attention_mask, position_ids, LMHead, language_model_processing, Model 即可,比如以 RoFormer 为例

https://github.com/Oneflow-Inc/LibaiLM/blob/53a243fbe4692c9bbc6b5a28691f3529af656054/model/roformer_model.py#L23-L145

  • 优点:如果模型整体以 transformer 堆叠的 encoder 作为 backbone 提取 feature,只是修改 position encoding 以及后续和 loss 相关的部分,实现新的模型相对是简单的

  • 缺点:如果需要修改 transformer 内部的逻辑,比如类似 roformer 需要在 attention 部分加入旋转位置编码,这时就需要直接去修改 transformer 里面的代码,不能做到模块化的方式来插入这个部分的代码

对于 mpu 部分

这个部分主要实现了模型并行需要的东西,主要就是分布式的 linear 层类似的小计算单元

  • 优点:分离普通的 layer 和并行的 layer,使得代码在阅读上更清晰,同时将需要高度手工定制的 layer 抽象出来也方便其他代码的调用
  • 缺点:高度定制化的代码只能针对特定情况进行使用,不够灵活

Training

整体的训练代码通过 functional 的方式进行实现,里面没有面向对象的概念,所有的接口都是通过函数互相调用,在写一个新的模型训练时,只需要定义 model_provider, get_batch, loss_func, forward_step 等函数即可自定义训练方式

  • 优点:函数式编程使得自定义训练流程比较灵活,用户不需要去 hack 内部的常规训练代码,只需要将自己修改的部分加入即可
  • 缺点:自定义的位置是固定的,如果需要对训练流程进行更自由的定义,则需要去改内部的训练逻辑

megatron 整体上是一个高度自定义的分布式训练模型库,它支持的模型类型和训练方式都比较固定,对于修改比较小的任务而言,直接使用会比较方便,如果在模型内部需要较多的自定义,则需要对 megatron 整体进行改动,所以对于 research 来说并不是特别友好,同时由于其函数式编程方式,不希望在函数中传入配置,所以不管是在模型定义还是训练中,都有大量的 args = get_args() 来获取全局配置,这在阅读代码的时候并不友好,需要实际运行才能知道里面参数的具体数值,同时如果用户想抽取其中一个 layer 在外部使用时,也必须带上 args=get_args(),这会让外部用户没有办法进行使用

最后一个缺点我认为是 megatron 的训练配置没有办法很直观的显示出来,要么通过 console 进行展示,或者是通过 shell 脚本进行查看,更直观的方式是在训练开始的时候将训练配置存成一个 yaml,这样用户可以很方便的通过 diff conf1 conf2 来得到两次训练配置的区别从而还方便的知道改动。

xformer https://github.com/facebookresearch/xformers

xformer 的整体结构如下

xformers
	componenets (一些模型需要的零组件)
    	attention (attention 模块)
        	fourier_mix.py
			ortho.py
			lambda_layer.py
			scaled_dot_product.py
        feedforward (前馈网络)
        	fused_mlp.py
			mlp.py
        positional_embedding (位置编码)
        	sine.py
			vocab.py
        activations.py (激活函数)
		residual.py (残差连接)
		reversible.py 
    factory
    	block_factory.py (可编程的网络层生成)
		model_factory.py (可编程的模型生成)
    helpers (帮助函数)
    	timm_sparese_attention.py
    models (模型)
    	linformer.py
    triton (一些融合算子)
    	fused_linear_layer.py
		activations.py
		layer_norm.py
		...

xformer 作为一个刚刚开源的 codebase,并没有提供训练代码,主打的卖点是将 Transformer 模型进行精细地拆分,使得每一个模块可以独立组合,最终所有的 sota 模型可以通过这些模块进行不同顺序或者微调进行构建

  • 优点:网络的各个层进行模块化,方便用户进行调用,使用 triton 进行算子融合,避免了大量的 cuda 代码
  • 缺点:没有大多数完整模型的结构代码啊,没有训练代码,同时不支持分布式 layer

huggingface https://github.com/huggingface/transformers

hugginface 的整体结构如下

 transformers
 	data (数据处理函数)
    models (模型部分)
    	albert
        bart
        beit
        bert
        big_bird
        ...
  • 优点:huggingface 提供了最全面的模型,同时有训练代码,模型构建支持 tf1.0/2.0,pytorch 以及 jax,模型转化支持 onnx,支持几乎所有的下游任务以及详细的 example 和教程
  • 缺点:只支持最简单的数据并行,同时模型之间互相没有关系,每一个模型都是一份完整的代码,如果要定义一个新模型需要大量的重复劳动

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.