Giter Site home page Giter Site logo

edl-hw5's Introduction

Week 5 home assignment

The assignment for this week consists of four parts: the first three are obligatory, and the fourth is a bonus one. Include all the files with implemented functions/classes and the report for Tasks 2 (and 4) in your submission.

Task 1 (1 point)

Implement the function for deterministic sequential printing of N numbers for N processes, using sequential_print.py as a template. You should be able to test it with torchrun --nproc_per_node N sequential_print.py Pay attention to the output format!

Task 2 (7 points)

The pipeline you saw in the seminar shows only the basic building blocks of distributed training. Now, let's train something actually interesting!

For this task, let's take the CIFAR-100 dataset and train a model with synchronized Batch Normalization: we aggregate the statistics across all workers during each forward pass.

Importantly, you have to call a communication primitive only once during each forward or backward pass; if you use it more than once, you will only earn up to 4 points for this task. Additionally, you are not allowed to use internal PyTorch functions that compute the backward pass of batch normalization: please implement it manually.

  • Also, implement a version of distributed training which is aware of gradient accumulation: for each batch that doesn't run optimizer.step, you can avoid the All-Reduce step for gradients altogether.
  • Compare the performance (in terms of both speed, memory footprint and final quality) of your distributed training pipeline with the one that uses primitives from PyTorch (i.e., torch.nn.parallel.DistributedDataParallel and torch.nn.SyncBatchNorm). You need to compare the implementations by training with at least two processes.
  • In addition, test the SyncBN layer itself by comparing the results with standard BatchNorm1d and changing the number of workers (use at least 1 and 4), the size of activations (128, 256, 512, 1024), and the batch size (32, 64). Compare the results of forward/backward passes with the FP32 inputs from the standard Gaussian distribution and the loss function that simply sums the outputs over first N/2 samples (N is the total batch size): a working implementation should have reasonably high rtol and atol (at least 1e-3).
  • Finally, measure the GPU time (2+ workers) and the memory footprint of standard SyncBatchNorm and your implementation in the same setup as above.

Provide the results of your experiments in a .ipynb/.pdf report and attach it to your code when submitting the homework. Your report should include a brief experimental setup (if changed) and the infrastructure description (version of PyTorch, number of processes, type of GPUs, etc.).

Use syncbn.py and ddp_cifar100.py as a template.

Task 3 (2 points)

Until now, we only aggregated the gradients across different workers during training. But what if we want to run distributed validation on a large dataset as well? In this assignment, you have to implement distributed metric aggregation: shard the dataset across different workers (with scatter), compute accuracy for each subset on its respective worker and then average the metric values on the master process.

Also, make one more quality-of-life improvement of the pipeline by logging the loss (and accuracy!) only from the rank-0 process to avoid flooding the standard output of your training command. Submit the training code that includes all enhancements from Tasks 2 and 3.

Task 4 (bonus, 3 points)

Using allreduce.py as a template, implement the Ring All-Reduce algorithm using only point-to-point communication primitives from torch.distributed. Compare it with the provided implementation of Butterfly All-Reduce, as well as with torch.distributed.all_reduce in terms of CPU speed, memory usage and the accuracy of averaging. Specifically, compare custom implementations of All-Reduce with 1–32 workers and compare your implementation of Ring All-Reduce with torch.distributed.all_reduce on 1–16 processes and vectors of 1,000–100,000 elements.

edl-hw5's People

Contributors

dimalishudi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.