microsoft / pymarlin Goto Github PK

Lightweight Deep Learning Model Training library based on PyTorch

License: MIT License

Dockerfile 0.21% Python 96.37% Batchfile 0.12% JavaScript 2.70% CSS 0.60%

pymarlin's Introduction

PyMarlin, a lightweight PyTorch library for agile deep learning!

PyMarlin was developed with the goal of simplifying the E2E Deep Learning experimentation lifecycle for data scientists using PyTorch. The library enables an agile way to quickly prototype a new AI scenario on dev box and seamlessly scale it training multi-node DDP GPU training with AzureML or other cloud services.

Key features

Provides public and enterprise data pre-processing recipes, which provides out of the box vanilla and parallel processing. It requires no additional code to run for AzureML or other environments easily.
Provides scalable model training with support for Single Process, VM, multi-GPU, multi-node, distributed Data Parallel, mixed-precision (AMP, Apex) training. ORT and DeepSpeed based training are going to be available soon!
Provides out of the box Plugins that can be used for all typical NLP tasks like Sequence Classification, Named Entity Recognition and Seq2Seq text generation.
Provides reusable modules for model checkpointing, stats collection, Tensorboard and compliant AML logging which can be customized based on your scenario.
Provides custom arguments parser that allows for saving all the default values for arguments related to a scenario in an YAML config file, merging user provided arguments at runtime.
All core modules are thoroughly linted,unit tested and even ran E2E (multi-node, GPU) in AzureML.
PyMarlin is minimal and has a easy to understand codebase. PyMarlin was designed to make it easy for others to understand the entire codebase and customize according to their needs.

Installation

pip install pymarlin

Read the installation doc for more information.

Start exploring!

Full documentation website

Full website with guides and SDK reference.

Train your first model with pymarlin

Check out the CIFAR image classification example.

GLUE task benchmarking

Explore how to use pymarlin to benchmark your language models on GLUE tasks.

We want your feedback!

Reach out to us with your feedback and suggestions.

pymarlin's People

Contributors

Stargazers

Watchers

Forkers

dankgroundhog shatu nifarn standardgalactic kampamocha

pymarlin's Issues

Gracefully handle hang issue when training with uneven batches via throwing an exception

Handle hang issue reported in #80 via throwing an exception while we wait for #81 to be implemented.

DeepSpeed Trainer Backend

It's currently on my plate to finish up the ORT Trainer Backend that Felix has started (currently in PR) - but not a lot of work has been resourced to figure out how to create a generic DeepSpeed trainer backend. The CNN/DailyMail Summarization example has a custom DeepSpeed Trainer/Trainer Backend - and other popular libraries like PyTorch Lightning has a DeepSpeed backend. I believe we should definitely integrate DeepSpeed, especially to adopt any model / pipeline parallelism into our model training.

Link the blogposts?

Now that we have blog posts on PyMarlin available publicly, maybe we should add a link to them in the repo?

Enable distributed training over multiple node in compliant detonation chamber

Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the set_environment_variables_for_nccl_backend method in pymarlin.utils.distributed where the master node's address is taken from the environment variable AZ_BATCH_MASTER_NODE rather than AZ_BATCHAI_MPI_MASTER_NODE. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always use AZ_BATCHAI_MPI_MASTER_NODE over AZ_BATCH_MASTER_NODE.

Given the current implementation for set_environment_variables_for_nccl_backend, this should be a pretty straightforward change of removing the if statement for single_node. I have already verified these changes in compliant detonation chamber.

def set_environment_variables_for_nccl_backend():
    """Sets distributed training environments for azureml openmpi runs with NCCL backend."""

    # NCCL environment. Still works without it.
    os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
    os.environ["NCCL_IB_DISABLE"] = "0"  # for IB

    single_node = int(os.environ["OMPI_COMM_WORLD_LOCAL_SIZE"]) == int(
        os.environ["OMPI_COMM_WORLD_SIZE"]
    )

    if single_node:
        master_node = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
        master_port = "54965"
    else:
        master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")

        master_node = master_node_params[0]
        master_port = (
            os.environ["MASTER_PORT"] if "MASTER_PORT" in os.environ else "6105"
        )

    # set env variables
    os.environ["MASTER_ADDR"] = master_node
    os.environ["MASTER_PORT"] = master_port

Add integration test for DDP

这个真没必要

pytorch 本来学习成本不大，你这么一搞，更麻烦。

DDP Hangs with uneven batches

If you run training with an uneven number of batches on different GPUs, a hang will occur. Since PyTorch 1.8.1, the join context manager on DDP exists to help with this situation. I don't believe PyMarlin handles this right now

Easy self.log() and other QoL improvements such as sane defaults for module_interface

With Python 3.10 pattern matching released, we can very easily implement ModuelInterface.log() method which can take the following combinations of arguments:

ModuleInterface.log(msg:str) -> log message to loggers (stdout)

ModuleInterace.log(key:str, value:float) -> log metric to metric logger (azureml, tensorboard, stdout) as well as keeping track in pymarlin trainer to potentially use for Early Stopping and Best checkpointing

Other sane defaults may include, moving tensors to device automatically, base implementation of train/val step, etc

DDP validation: All gather for flattened 1D tensors taking long time to complete

Task = POS tagging

    def val_step(self, global_step: int, batch, device="cpu", encoder = None, encoder_kwargs={}):
        """
        Can return multiple outputs. First output need not be loss.
        """
        ...
        print(rels_predicted.shape)
        return label_loss, pointer_loss, rels_predicted, rels_labels

validation ptb_dep 3:: 0%| | 0/7 [00:00<?, ?it/s]torch.Size([1541])
torch.Size([1547])
torch.Size([1500])
torch.Size([1514])
torch.Size([1570])
torch.Size([1506])
torch.Size([1477])
torch.Size([1626])
validation ptb_dep 2:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 30.67it/s]
gathering
validation ptb_dep 3:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 29.47it/s]
gathering
validation ptb_dep 1:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 28.46it/s]
gathering
validation ptb_dep 0:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 27.57it/s]
gathering

Running validation more often than once an epoch.

This is a feature request to be able to run validation/checkpointing more often than once an epoch. For very large datasets, it feels unreasonable to only be able to run validation once an epoch, especially if it takes a couple hours to complete an epoch. Running validation more often would be useful at least for a faster response time for tuning parameters.

I was able to get around this issue by creating a hack with our usage of PyMarlin where we would set the max_steps_per_epoch to the desired logging frequency for validation. However, this requires modifying the input dataset to track where it currently is in the actual epoch and modifying the number of epochs supplied to the trainer to take into account "logging epochs". It also causes PyMarlin to now inaccurately report the actual epochs the model is trained on.

Overall, the request would be to either to integrate the hack into PyMarlin's logic for a better experience from the user's perspective, or implement more frequent validation/checkpointing through a different method. I am more than happy to supply the code for the hack.

Dockerfile and CI pipeline for Docker image

@ashwinsr01 requested to know how to create custom docker images for pymarlin and I don't believe we've shared any in this repo. I will create a PR with a simple Dockerfile that starts from one of the AzureML base images and installs pymarlin[plugins] and potential backends we want to install (apex, opacus, ORT, DeepSpeed)

Update installation documentation

PyMarlin is now available in pypi -- the installation documentation should be updated to reflect this.
https://marlin-docs.azurewebsites.net/installation.html

Add support for uneven number of batches on different GPUs

new feature needed to address #80.
Please add your comment against this feature request to prioritize addressing it. Thanks!

log_graph is pretty restrictive

Currently, the log_graph function in the tensorboard writer only supports the tuple type for the sample input. It should at least be extended to support a dictionary input as well, which is fairly common with HF style inputs.

PyMarlin/pymarlin/utils/writer/tensorboard.py

Line 102 in fb8bbc9

if isinstance(inputs, tuple):

PyTorch version

What PyTorch version does the latest PyMarlin release support? Since PyTorch has a fairly frequent release cycle with plenty of breaking changes, it would be good to specify the lower and upper bounds of the PyTorch version the PyMarlin library can support.

Restrict scope of modification of root logger to logging_utils

scope this modification down to get_logger method in logging_utils.py

logging.root.handlers = []
logging.basicConfig(level="WARN",
format='SystemLog: %(asctime)s:%(levelname)s : %(name)s : %(lineno)d : %(message)s',
stream=sys.stdout)

user defined save_best_checkpoint() hook

it should be a feature of our library to keep the "best" checkpoint based on a tracked metric like the minimum value of validation loss.

this could be a semi-big change because:

we do not have any metric tracking
we do not have any conditional checkpointing implemented, but Eduardo make checkpointing pretty generic for it.

CustomArgParser doesn't support the yaml_file_arg_key functionality

I believe the intention of providing a yaml_file_arg_key argument to the CustomArgParser is that the end-user can choose the command-line name for providing their config file. However, currently, the value provided to yaml_file_arg_key is not utilized and the config param name is just assumed to be config_path while fetching and parsing the config in Line 83.

PyMarlin/pymarlin/utils/config_parser/custom_arg_parser.py

Line 83 in 11bce75

self._parse_config(args.config_path)

tb_log_hist_steps is misplaced in the Tensorboard writer

tb_log_hist_steps seems to be misplaced in the Tensorboard writer. A good design principle here in my opinion should be to just let the writer write out the logs, and control the frequency of writing through the stats and trainer/trainer_backend classes. This is already followed for other types of logging, and should be used consistently for writing out the histograms through tensorboard as well.

PyMarlin/pymarlin/utils/writer/tensorboard.py

Line 143 in fb8bbc9

if step % self.args.tb_log_hist_steps == 0:

refactor to units of steps instead of epochs

Many models train, perform actions (like checkpointing and evalution) in units of steps instead of epoch, we should probably incorporate this and most likely would need to convert epochs to steps depending on what the user requests.

Custom Argparser doesn't support multi-level commandline override

Custom argparser throws an exception when a multi-level nested config.yaml file is provided, and some of the nested parameters are overridden through commandline.

I believe it is because of the hard-coded assumption of a single-level nesting in the custom argparser. Some examples:

PyMarlin/pymarlin/utils/config_parser/custom_arg_parser.py

Line 135 in 11bce75

self._config[arglist[0].strip('-')][arglist[1].strip('-')] = arg_dict[arg]

PyMarlin/pymarlin/utils/config_parser/custom_arg_parser.py

Line 146 in 11bce75

yaml_arg_value = self._config[arglist[0].strip('-')][arglist[1]]

Example of multi-level nesting:

microsoft / pymarlin Goto Github PK

pymarlin's Introduction

PyMarlin, a lightweight PyTorch library for agile deep learning!

Key features

Installation

Start exploring!

Full documentation website

Train your first model with pymarlin

GLUE task benchmarking

We want your feedback!

pymarlin's People

Contributors

Stargazers

Watchers

Forkers

pymarlin's Issues

Recommend Projects

Recommend Topics

Recommend Org