Giter Site home page Giter Site logo

deepbooster's Introduction

DeepBooster: Colab-driven multi-GPU training of large Neural Nets

Authors: Maksim Eremeev (@maks5507, [email protected]), Mikhail Khramenkov (@0x2500)

Installation

python setup.py build
pip install .

Usage:

Below is a sample script can be run in multiple colab notebooks to test the system. Make sure you specify different my_queue and ignition_queue for different instances.

! git clone -b dev https://github.com/maks5507/amqp-interface.git; cd amqp-interface; python3 setup.py build; pip install .
  
! git clone https://github.com/maks5507/deepbooster.git; cd deepbooster; python3 setup.py build; pip install .
  
import torch
import torch.nn as nn

class Test(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(10, 1000),
            nn.Tanh(),
            nn.Linear(1000, 1000),
            nn.Tanh(),
            nn.Linear(1000, 1000),
            nn.Tanh(),
            nn.Linear(1000, 1),
            nn.Tanh()
        )

    def forward(self, x):
        import time
        time.sleep(10)
        return self.layers(x)  
      
import random
import json


path = '.'
with open(f'{path}/chunk1.jsonl', 'w') as f:
    for i in range(int(5)):
        curr = json.dumps({'input': torch.rand((1000, 10)).tolist(), 'label': random.randint(0, 4)})
        f.write(f'{curr}\n')
        

import torch.nn.functional as F
import deepbooster

model = Test()

trainer_params = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': torch.optim.Adam(model.parameters(), lr=0.01),

    'device': torch.device('cuda'),
    'n_epoch': 1000,

    'ignition_queue': 'start_1',
    'my_queue': 'test1',
    'trainers_queues': ['test1', 'test2'],

    'url_parameters': 'amqp://user:[email protected]:5672',

    'chunk_path': './chunk1.jsonl',
    'transformer': torch.Tensor,
    'apply_transformer_to_label': True
}

trainer = deepbooster.Trainer(**trainer_params)

trainer.start()

The process will be blocked unless you post any message to the ignition queue of your instance.

Message Queue

The RabbitMQ is used for sync of the workers. As the computed gradients are passed through the queue on each step, make sure the RAM & Disk do not overflow. The RMQ bandwidth is enough to carry a gradient of > 1.4B parameter network.

As the queue should be accessible from any colab instance, currently we use a pubic server with open 5672 and 15672 ports: 35.222.31.138.

The RMQ needs to be setup (convenient scripts to follow) as follows:

  • Create an ignition queue and sync queue for each instance you want to launch
  • Bind both queues to the amq.topic exchange
  • Purge both ignition and sync queues before running a new training procedure

Codestyle check

Before making a pull-request, please check the coding style with bash script in codestyle directory. Make sure that your folder is included in codestyle/pycodestyle_files.txt list.

Your changes will not be approved if the script indicates any incongruities (this does not apply to 3rd-party code).

Usage:

cd codestyle
sh check_code_style.sh

deepbooster's People

Contributors

maks5507 avatar

Stargazers

Ilya Zisman avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.