Giter Site home page Giter Site logo

ibm / mi-prometheus Goto Github PK

View Code? Open in Web Editor NEW
42.0 10.0 18.0 59.79 MB

Enabling reproducible Machine Learning research

Home Page: http://mi-prometheus.rtfd.io/

License: Apache License 2.0

Python 100.00%
mi-prometheus pytorch machine-learning model problem worker grid-worker

mi-prometheus's People

Contributors

aasseman avatar cclauss avatar cshivade avatar dependabot[bot] avatar imgbotapp avatar kant avatar sesevgen avatar tkornuta-ibm avatar tsjayram avatar vincentalbouy avatar vmarois avatar younesbouhadjar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mi-prometheus's Issues

Enhancement of the documentation

There are still a few things to do to enhance further the documentation:

  • Explain the contents of the configuration files: the section (mandatory vs optional), parameters (mandatory vs optional) etc.

  • Add the expected attributes of each problem class. So far, they all indicate params which is not very explicit.. This should be a team effort ๐Ÿ™‚

  • Add pictures where relevant: for instance, in the algorithmic problems, to illustrate the inputs & outputs

  • Link the classes mentions so that they can redirect to their documentation. For instance:

    :param data_dict: DataDict, as created by the Problem class.
    :type data_dict: :py:class:`miprometheus.utils.DataDict` # <- this should be a hyperlink sending to the doc page of DataDict
    
  • Link the external classes (PyTorch etc) to their doc (inter-doc linking)

Investigate overhead of mi-prometheus

A slow-down in some experiments has been noticed after releasing 0.2.0 and above:

  • A run of MAC on CLEVR for 20 epochs takes more than 3 days whereas it was taking 26h previously. This could be due to the I/O operations for loading the image file on-the-fly in __getitem__.
  • The exhaustion of the MNIST or CIFAR10 datasets seems slower (according to @tkornut), although I was not able to reproduce what he reported.
  • Generally, I feel like we could reduce the overall overhead / slowness of mip. A simple example is mip-online-trainer --h takes a good second to print the options, but I am guessing it could be faster.

Check presence of cuda-compatible devices in GPU grid workers

Currently, they just crash and output:

Traceback (most recent call last):
File "/home/tkornuta/anaconda3/bin/mip-gridtester-gpu", line 11, in
load_entry_point('miprometheus==0.2.0', 'console_scripts', 'mip-gridtester-gpu')()
File "/home/tkornuta/pytorch-env/mi-prometheus/miprometheus/workers/grid_tester_gpu.py", line 118, in main
grid_tester_gpu.run_grid_experiment()
File "/home/tkornuta/pytorch-env/mi-prometheus/miprometheus/workers/grid_tester_gpu.py", line 87, in run_grid_experiment
with ThreadPool(processes=torch.cuda.device_count()) as pool:
File "/home/tkornuta/anaconda3/lib/python3.6/multiprocessing/pool.py", line 789, in init
Pool.init(self, processes, initializer, initargs)
File "/home/tkornuta/anaconda3/lib/python3.6/multiprocessing/pool.py", line 167, in init
raise ValueError("Number of processes must be at least 1")
ValueError: Number of processes must be at least 1

Investigate the content of statistics, rethink the desired behavior of online/offline trainers

When running grid trainer with "weird" terminal conditions:
# Set the same terminal conditions.
terminal_conditions:
loss_stop: 1.0e-5
episode_limit: 100
epochs_limit: 10

I received even weirder results, i.e.:

  • content of training_statistics.csv:
loss episode epoch acc batch_size
2.3023982 0 0 0.0625 64
2.42056847 1 0 0.09375 64
2.31480336 2 0 0.015625 64
2.30580044 3 0 0.125 64
2.28870821 4 0 0.109375 64
2.29206491 5 0 0.140625 64
2.29954505 6 0 0.109375 64
2.31241441 7 0 0.0625 64
2.2548635 8 0 0.1875 64
2.27464652 9 0 0.078125 64
2.24816489 10 0 0.171875 64
2.22160149 11 0 0.140625 64
2.14928031 12 0 0.171875 64
2.01351118 13 0 0.125 64
1.9892391 14 0 0.203125 64
1.78259158 15 0 0.375 64
1.76220071 16 0 0.40625 64
1.59635079 17 0 0.5625 64
1.24760115 18 0 0.59375 64
1.52464759 19 0 0.578125 64
1.02153337 20 0 0.6875 64
1.13793755 21 0 0.65625 64
1.04477477 22 0 0.703125 64
0.9281581 23 0 0.65625 64
0.8872776 24 0 0.75 64
0.96978921 25 0 0.71875 64
0.54203308 26 0 0.84375 64
0.71089166 27 0 0.828125 64
0.87194252 28 0 0.78125 64
0.74255377 29 0 0.796875 64
0.97673011 30 0 0.71875 64
0.89663988 31 0 0.671875 64
0.52516961 32 0 0.859375 64
0.67374974 33 0 0.6875 64
1.07340407 34 0 0.640625 64
0.65307564 35 0 0.71875 64
0.62181377 36 0 0.796875 64
0.53182065 37 0 0.84375 64
0.71876705 38 0 0.8125 64
0.69108903 39 0 0.8125 64
0.64931148 40 0 0.84375 64
0.64401007 41 0 0.765625 64
0.53108335 42 0 0.78125 64
0.47211882 43 0 0.8125 64
0.42510599 44 0 0.859375 64
0.53872186 45 0 0.78125 64
0.47975114 46 0 0.875 64
0.42682296 47 0 0.84375 64
0.68501329 48 0 0.84375 64
0.51802105 49 0 0.84375 64
0.42391279 50 0 0.890625 64
0.54398292 51 0 0.8125 64
0.43966454 52 0 0.828125 64
0.41222006 53 0 0.859375 64
0.30380589 54 0 0.890625 64
0.28526509 55 0 0.890625 64
0.3890301 56 0 0.890625 64
0.3726145 57 0 0.859375 64
0.37899232 58 0 0.84375 64
0.31090876 59 0 0.90625 64
0.29964575 60 0 0.890625 64
0.29754484 61 0 0.875 64
0.30940181 62 0 0.90625 64
0.28904665 63 0 0.90625 64
0.28892154 64 0 0.9375 64
0.28293437 65 0 0.890625 64
0.28884795 66 0 0.9375 64
0.27016857 67 0 0.890625 64
0.38757008 68 0 0.921875 64
0.24764507 69 0 0.921875 64
0.25606325 70 0 0.890625 64
0.48922622 71 0 0.84375 64
0.2977196 72 0 0.890625 64
0.3917419 73 0 0.921875 64
0.19252293 74 0 0.9375 64
0.39461273 75 0 0.875 64
0.28725618 76 0 0.859375 64
0.24857962 77 0 0.921875 64
0.22327447 78 0 0.9375 64
0.41391894 79 0 0.859375 64
0.19850856 80 0 0.921875 64
0.30375871 81 0 0.890625 64
0.38144702 82 0 0.890625 64
0.29862314 83 0 0.921875 64
0.16170724 84 0 0.953125 64
0.25888351 85 0 0.953125 64
0.17384183 86 0 0.953125 64
0.24882084 87 0 0.953125 64
0.20304871 88 0 0.921875 64
0.354817 89 0 0.9375 64
0.12355755 90 0 0.96875 64
0.20728019 91 0 0.921875 64
0.17258625 92 0 0.921875 64
0.16974132 93 0 0.953125 64
0.37275705 94 0 0.90625 64
0.09402215 95 0 0.96875 64
0.27992848 96 0 0.90625 64
0.13900934 97 0 0.953125 64
0.27177253 98 0 0.921875 64
0.15787081 99 0 0.921875 64
0.40764943 99 1 0.890625 64
0.12967248 99 2 0.9375 64
0.16256529 99 3 0.953125 64
0.08198662 99 4 0.96875 64
0.12792362 99 5 0.96875 64
0.1427121 99 6 0.953125 64
0.19214444 99 7 0.953125 64
0.26682153 99 8 0.890625 64
0.11921781 99 9 0.9375 64
  • content of training_set_agg_statistics.csv:
episode episodes_aggregated loss loss_min loss_max loss_std epoch acc acc_min acc_max acc_std samples_aggregated
99 100 0.77779132 0.09402215 2.42056847 0.70779961 0 0.72874999 0.015625 0.96875 0.27987459 6400
99 1 0.40764943 0.40764943 0.40764943 0 1 0.890625 0.890625 0.890625 0 64
99 1 0.12967248 0.12967248 0.12967248 0 2 0.9375 0.9375 0.9375 0 64
99 1 0.16256529 0.16256529 0.16256529 0 3 0.953125 0.953125 0.953125 0 64
99 1 0.08198662 0.08198662 0.08198662 0 4 0.96875 0.96875 0.96875 0 64
99 1 0.12792362 0.12792362 0.12792362 0 5 0.96875 0.96875 0.96875 0 64
99 1 0.1427121 0.1427121 0.1427121 0 6 0.953125 0.953125 0.953125 0 64
99 1 0.19214444 0.19214444 0.19214444 0 7 0.953125 0.953125 0.953125 0 64
99 1 0.26682153 0.26682153 0.26682153 0 8 0.890625 0.890625 0.890625 0 64
99 1 0.11921781 0.11921781 0.11921781 0 9 0.9375 0.9375 0.9375 0 64
  • content of validation_statistics.csv:
    loss episode epoch acc batch_size
    2.378291607 0 0 0.125 64

  • content of validation_set_agg_statistics.csv:

episode episodes_aggregated loss loss_min loss_max loss_std epoch acc acc_min acc_max acc_std samples_aggregated
99 79 0.18435289 0.00570344 0.51182282 0.1067069 0 0.9398734 0.859375 1 0.03431381 5000
99 79 0.15693755 0.03428814 0.49671429 0.08907631 1 0.94996047 0.84375 1 0.0299593 5000
99 79 0.13773099 0.01599673 0.4856357 0.07974177 2 0.95925635 0.890625 1 0.02568884 5000
99 79 0.1349417 0.02542676 0.38587141 0.07309812 3 0.96004748 0.875 1 0.02536133 5000
99 79 0.13557728 0.02810043 0.51120263 0.07920972 4 0.95866299 0.890625 1 0.02552018 5000
99 79 0.14610203 0.03679127 0.45367861 0.08219316 5 0.9535206 0.875 1 0.02991695 5000
99 79 0.14048618 0.04181131 0.55806714 0.08375487 6 0.95530063 0.890625 1 0.02448055 5000
99 79 0.13374574 0.02522391 0.46379766 0.08368348 7 0.95787185 0.875 1 0.02812438 5000
99 79 0.13794875 0.02302515 0.46463043 0.07694951 8 0.9541139 0.875 1 0.0276742 5000
99 79 0.14701881 0.00304925 0.52258015 0.09639523 9 0.95391613 0.890625 1 0.02705158 5000
99 79 0.14747529 0.02490817 0.49389082 0.08180903 9 0.95391613 0.875 1 0.02829573 5000

Numpy > 1.11.0 causes errors due to numpy.float64 not being automatically converted to Int

Describe the bug
Numpy floats do not automatically get converted to Ints when needed with later versions of numpy. Thus, pytorch network configurations that expect Ints throw errors.

To Reproduce
Steps to reproduce the behavior:
Run the following with a later version of numpy installed:
$ mip-offline-trainer --c mi-prometheus/configs/vision/simplecnn_mnist.yaml

Expected behavior
Network definitions proceed without errors

Desktop (please complete the following information):

  • OS: Ubuntu
  • OS version 18.04
  • Python version 3.5.2
  • PyTorch version 0.4.1

Change --o to --e

I have analyzed all workers, in my opinion there is no conflict with --e(xperiments_dir)

This is one of few flags that are shared along all workers and grid workers and outputdir is misleading, as in the case of most of them this is also input dir....

@vmarois what do you think about that?

This issue relates to #25

Clean up the flags of the different workers

The workers have several flags available, which are added based on inheritance for now (e.g. Worker โฌ…๏ธ Trainer โฌ…๏ธ OfflineTrainer etc). This results in some flags not particularly useful for a specific worker, or having a different meaning.

Some examples:

  • experiment_repetitions is a flag param for grid-tester-* but a config file one for the grid-trainer-*.
  • Not sure that savetag for grid-tester-* is particularly useful

Refactor and simplify MNIST and CIFAR10 problems

Currently they are too "overcomplicated":

  1. contain hardcoded elements (e.g. resizing to exactly 224 x 224) which had to be removed in order to make those problems
  2. default_values have to be simplified to image dimensions (w,h,c) + number of classes
  3. naming: num_classes, num_channels

Store termination cause in model checkpoint

  • set initial value of "termination_cause" to "[Not Converged]"
  • set termination_cause depending on termination reason
  • pass termination_cause to model.save as additional field

Thanks to this the grid-analyzer will simply inform the user whether model converged or not! ๐Ÿ‘

Trainers should indicate if loss criterion is met but curriculum learning still going

The current implementation of curriculum learning forces by default that it needs to be finished even if convergence has been reached (in terms of loss < threshold):

# If the 'must_finish' key is not present in config then then it will be finished by default self.params['training']['curriculum_learning'].add_default_params({'must_finish': True})

While I think this is okay, it'd be great if the trainer would indicate that. For instance, if the loss threshold for a run of MAES on SerialRecall is set to 1e-2, MAES will most likely converge before the end of curriculem learning. In this case, a warning message would be great, in the lines of:

if not self.curric_done and converged:
self.logger.warning('The model has converged but curriculum has been set with must_finish=True.')

Refactor VideoTextToClass problems

The goal is that the video stream should return 5D Tensors:
BATCH x SEQ_LEN x CHANNELS X WIDTH X HEIGHT

(copy of internal issue #221)

Introduce mutex-based experiment configuration to Grid Workers GPU

Grid Trainers/Testers on GPU have hardcoded sleep time (currently 3s). This is motivated by the fact that cuda-gpupick picks a free GPU only by checking the contexts running on a given device.

The problem is that loading the configuration/configuring a given experiment might take longer than 3 seconds. This is the situation that we have faced with training of multiple models of MAC/SMAC on CLEVR/CoGenT.

For now we have increased the sleep time to 60 seconds (Closes #29 )

Desired solution

  1. introduce a "configuration_in_progress" mutex to both basic and grid workers
  2. when a basic worker starts, it raises the "configuration_in_progress" mutex
  3. after spanning the process grid workers hangs on the "configuration_in_progress"
  4. after the setup_configuration() method is finished, given basic workers lowers the "configuration_in_progress" mutex, that frees the grid worker to proceed (and potentially span next worker)

Rethink operation of grid-analyzer

Assuming we run:

  • grid-trainer that resulted in a grid of trained models (let's say 6 exp = 2 models x 3 problems)
  • then we run grid-tester with two runs (with different random seeds of course)

That results in 10 "experiments" that should potentially form the content of a single csv file - 12 rows?

The goal of this issue is to discuss what, how and when should be copied taking into account:

  • we got best_model.pt checkpoint, which indicates in which episode the model was created
  • we got two types of trainers, that could save the model depending on loss calculated on:
    • online-trainer: partial validation set, that is activated every episode % "partial_validation_interval"
    • offline-trainer: full validation set, that is calculated at the end of every epoch

Besides, offline-trainer can optionally store partial_validation_statistics in csv.file, when one set partial_validation_interval > 0 in config:validation section.

As a result, different trainers will produce different statistics, how to deal with that diversity - this is the goal of that discussion/issue.

'ThalnetModule' object has not attribute 'logger'

Describe the bug
models/thalnet/thalnet_module.py is the source of the error, on line 131. Trying to log an error to logger causes this.

To Reproduce
Steps to reproduce the behavior:
Happened as I'm testing changes for #58 - I suspect just feeding in an incorrect sized data as described on the if statement on line 130 should trigger it.

Expected behavior
Should properly log to logger.

Desktop (please complete the following information):

  • OS: Ubuntu
  • OS version 18.04
  • Python version 3.5
  • PyTorch version 0.4.1

Clear up torchvision version

had to use torchvision 0.2.0 to get the doc build successful for the first time but seems to be causing an error with Resize (wasn't showing up with 0.2.1)

Investigate why two validation runs on the same model return slightly different statistics

validation with batch of size 1 - works perfectly

validation with batch of size 10 - from time to time returns values that differ

First I was thinking that the issue related to lack of weighted averaging when we are not dropping the last batch. Sadly, the issue remained even when dropping last batch/limiting size of set to batch.

To Reproduce

mip-offline-trainer --c configs/vision/simplecnn_mnist.yaml

Validation problem section:

validation:
problem:
name: MNIST
batch_size: 10
use_train_data: True
resize: [32, 32]
sampler:
name: SubsetRandomSampler
indices: [55000, 55010]
dataloader:
drop_last: True

================================================================================
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes)
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851807; loss_min 0.0016851807; loss_max 0.0016851807; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation]
[2018-11-08 12:36:07] - INFO - Model >>> Model and statistics exported to checkpoint ./experiments/MNIST/SimpleConvNet/20181108_123547/models/model_best.pt
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>>

[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes)
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851809; loss_min 0.0016851809; loss_max 0.0016851809; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation]

Change the way --model is handled

As of now, the trainers have the option to load a pretrained model using the flag --model.
We haven't really used this feature so far, and the way it is implemented (as a flag) makes it hard to handle in the grid trainers, which results in not handling this for now.

So I like to handle this in the grid trainers, because it useful to load a pretrained model for each experiment (e.g. to finetune a pre-trained model).
I'm thinking of either:

  • adding a corresponding --models to the grid trainers, where the user could then use to indicate the trained models he wants to reuse, but this could be messy as we would have to check if all models are present, with which experiments they are compatible etc..
  • Moving --model in the trainers from being a flag to a config parameter, that the user could specify in the config file. I know that this wouldn't be consistent with the tester, but I find this cleaner, and easier to handle in the grid trainers.

We can discuss that ๐Ÿ™‚

Extract and add absolute path to nested config files

Describe the bug
Currently, all workers assume that they are executed from the mi-prometheus main directory.
Along with setup.py we open the possibility to execute the mip-* workers from any directory.
To Reproduce

tkornuta@tkornuta-MacBookPro:~/pytorch-env$ mip-onlinetrainer --c mi-prometheus/configs/maes_baselines/maes/maes_serial_recall.yaml
Info: Parsing the mi-prometheus/configs/maes_baselines/maes/maes_serial_recall.yaml configuration file
Info: Parsing the configs/maes_baselines/maes/default_maes.yaml configuration file
Error: Configuration file configs/maes_baselines/maes/default_maes.yaml does not exist

Expected behavior
Workers should search the other configs in relation to the first one.

Solution

Extract the absolute path to the main config, then navigate in relation to that one. The goal is to leave the paths in default_config sections as they are, i.e. starting from the configs/ directory.

Standardize params names across methods, problems etc.

There is inconsistent naming across params. For example, directories are sometimes called 'dir' and sometimes called 'folder'. We should perhaps decide on a unified set of standard names for source and target folders etc, and then change init and configs to reflect it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.