Giter Site home page Giter Site logo

pymarlin's Introduction

PyMarlin, a lightweight PyTorch library for agile deep learning!

Unit Tests codecov Docs AzureML Canary pypi

PyMarlin was developed with the goal of simplifying the E2E Deep Learning experimentation lifecycle for data scientists using PyTorch. The library enables an agile way to quickly prototype a new AI scenario on dev box and seamlessly scale it training multi-node DDP GPU training with AzureML or other cloud services.

Key features

  • Provides public and enterprise data pre-processing recipes, which provides out of the box vanilla and parallel processing. It requires no additional code to run for AzureML or other environments easily.
  • Provides scalable model training with support for Single Process, VM, multi-GPU, multi-node, distributed Data Parallel, mixed-precision (AMP, Apex) training. ORT and DeepSpeed based training are going to be available soon!
  • Provides out of the box Plugins that can be used for all typical NLP tasks like Sequence Classification, Named Entity Recognition and Seq2Seq text generation.
  • Provides reusable modules for model checkpointing, stats collection, Tensorboard and compliant AML logging which can be customized based on your scenario.
  • Provides custom arguments parser that allows for saving all the default values for arguments related to a scenario in an YAML config file, merging user provided arguments at runtime.
  • All core modules are thoroughly linted,unit tested and even ran E2E (multi-node, GPU) in AzureML.
  • PyMarlin is minimal and has a easy to understand codebase. PyMarlin was designed to make it easy for others to understand the entire codebase and customize according to their needs.

Installation

pip install pymarlin

Read the installation doc for more information.

Start exploring!

Full documentation website

Full website with guides and SDK reference.

Train your first model with pymarlin

Check out the CIFAR image classification example.

GLUE task benchmarking

Explore how to use pymarlin to benchmark your language models on GLUE tasks.

We want your feedback!

Reach out to us with your feedback and suggestions.

pymarlin's People

Contributors

aminsaied avatar ananthrs1 avatar ashwinsr01 avatar egonzalez125 avatar gshruti95 avatar huseyinatahaninan avatar jsleep avatar krishansubudhi avatar krkusuk avatar mercerchen-msft avatar microsoft-github-operations[bot] avatar microsoftopensource avatar nifarn avatar rajban2017 avatar shatu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pymarlin's Issues

DeepSpeed Trainer Backend

It's currently on my plate to finish up the ORT Trainer Backend that Felix has started (currently in PR) - but not a lot of work has been resourced to figure out how to create a generic DeepSpeed trainer backend. The CNN/DailyMail Summarization example has a custom DeepSpeed Trainer/Trainer Backend - and other popular libraries like PyTorch Lightning has a DeepSpeed backend. I believe we should definitely integrate DeepSpeed, especially to adopt any model / pipeline parallelism into our model training.

Link the blogposts?

Now that we have blog posts on PyMarlin available publicly, maybe we should add a link to them in the repo?

Enable distributed training over multiple node in compliant detonation chamber

Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the set_environment_variables_for_nccl_backend method in pymarlin.utils.distributed where the master node's address is taken from the environment variable AZ_BATCH_MASTER_NODE rather than AZ_BATCHAI_MPI_MASTER_NODE. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always use AZ_BATCHAI_MPI_MASTER_NODE over AZ_BATCH_MASTER_NODE.

Given the current implementation for set_environment_variables_for_nccl_backend, this should be a pretty straightforward change of removing the if statement for single_node. I have already verified these changes in compliant detonation chamber.

def set_environment_variables_for_nccl_backend():
    """Sets distributed training environments for azureml openmpi runs with NCCL backend."""

    # NCCL environment. Still works without it.
    os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
    os.environ["NCCL_IB_DISABLE"] = "0"  # for IB

    single_node = int(os.environ["OMPI_COMM_WORLD_LOCAL_SIZE"]) == int(
        os.environ["OMPI_COMM_WORLD_SIZE"]
    )

    if single_node:
        master_node = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
        master_port = "54965"
    else:
        master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")

        master_node = master_node_params[0]
        master_port = (
            os.environ["MASTER_PORT"] if "MASTER_PORT" in os.environ else "6105"
        )

    # set env variables
    os.environ["MASTER_ADDR"] = master_node
    os.environ["MASTER_PORT"] = master_port

DDP Hangs with uneven batches

If you run training with an uneven number of batches on different GPUs, a hang will occur. Since PyTorch 1.8.1, the join context manager on DDP exists to help with this situation. I don't believe PyMarlin handles this right now

Easy self.log() and other QoL improvements such as sane defaults for module_interface

With Python 3.10 pattern matching released, we can very easily implement ModuelInterface.log() method which can take the following combinations of arguments:

ModuleInterface.log(msg:str) -> log message to loggers (stdout)

ModuleInterace.log(key:str, value:float) -> log metric to metric logger (azureml, tensorboard, stdout) as well as keeping track in pymarlin trainer to potentially use for Early Stopping and Best checkpointing

Other sane defaults may include, moving tensors to device automatically, base implementation of train/val step, etc

DDP validation: All gather for flattened 1D tensors taking long time to complete

Task = POS tagging

    def val_step(self, global_step: int, batch, device="cpu", encoder = None, encoder_kwargs={}):
        """
        Can return multiple outputs. First output need not be loss.
        """
        ...
        print(rels_predicted.shape)
        return label_loss, pointer_loss, rels_predicted, rels_labels

validation ptb_dep 3:: 0%| | 0/7 [00:00<?, ?it/s]torch.Size([1541])
torch.Size([1547])
torch.Size([1500])
torch.Size([1514])
torch.Size([1570])
torch.Size([1506])
torch.Size([1477])
torch.Size([1626])
validation ptb_dep 2:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 30.67it/s]
gathering
validation ptb_dep 3:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 29.47it/s]
gathering
validation ptb_dep 1:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 28.46it/s]
gathering
validation ptb_dep 0:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 27.57it/s]
gathering

Running validation more often than once an epoch.

This is a feature request to be able to run validation/checkpointing more often than once an epoch. For very large datasets, it feels unreasonable to only be able to run validation once an epoch, especially if it takes a couple hours to complete an epoch. Running validation more often would be useful at least for a faster response time for tuning parameters.

I was able to get around this issue by creating a hack with our usage of PyMarlin where we would set the max_steps_per_epoch to the desired logging frequency for validation. However, this requires modifying the input dataset to track where it currently is in the actual epoch and modifying the number of epochs supplied to the trainer to take into account "logging epochs". It also causes PyMarlin to now inaccurately report the actual epochs the model is trained on.

Overall, the request would be to either to integrate the hack into PyMarlin's logic for a better experience from the user's perspective, or implement more frequent validation/checkpointing through a different method. I am more than happy to supply the code for the hack.

Dockerfile and CI pipeline for Docker image

@ashwinsr01 requested to know how to create custom docker images for pymarlin and I don't believe we've shared any in this repo. I will create a PR with a simple Dockerfile that starts from one of the AzureML base images and installs pymarlin[plugins] and potential backends we want to install (apex, opacus, ORT, DeepSpeed)

PyTorch version

What PyTorch version does the latest PyMarlin release support? Since PyTorch has a fairly frequent release cycle with plenty of breaking changes, it would be good to specify the lower and upper bounds of the PyTorch version the PyMarlin library can support.

user defined save_best_checkpoint() hook

it should be a feature of our library to keep the "best" checkpoint based on a tracked metric like the minimum value of validation loss.

this could be a semi-big change because:

  • we do not have any metric tracking
  • we do not have any conditional checkpointing implemented, but Eduardo make checkpointing pretty generic for it.

CustomArgParser doesn't support the yaml_file_arg_key functionality

I believe the intention of providing a yaml_file_arg_key argument to the CustomArgParser is that the end-user can choose the command-line name for providing their config file. However, currently, the value provided to yaml_file_arg_key is not utilized and the config param name is just assumed to be config_path while fetching and parsing the config in Line 83.

self._parse_config(args.config_path)

tb_log_hist_steps is misplaced in the Tensorboard writer

tb_log_hist_steps seems to be misplaced in the Tensorboard writer. A good design principle here in my opinion should be to just let the writer write out the logs, and control the frequency of writing through the stats and trainer/trainer_backend classes. This is already followed for other types of logging, and should be used consistently for writing out the histograms through tensorboard as well.

if step % self.args.tb_log_hist_steps == 0:

refactor to units of steps instead of epochs

Many models train, perform actions (like checkpointing and evalution) in units of steps instead of epoch, we should probably incorporate this and most likely would need to convert epochs to steps depending on what the user requests.

Custom Argparser doesn't support multi-level commandline override

Custom argparser throws an exception when a multi-level nested config.yaml file is provided, and some of the nested parameters are overridden through commandline.

I believe it is because of the hard-coded assumption of a single-level nesting in the custom argparser. Some examples:

self._config[arglist[0].strip('-')][arglist[1].strip('-')] = arg_dict[arg]

yaml_arg_value = self._config[arglist[0].strip('-')][arglist[1]]

Example of multi-level nesting:
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.