ibm / mi-prometheus Goto Github PK
View Code? Open in Web Editor NEWEnabling reproducible Machine Learning research
Home Page: http://mi-prometheus.rtfd.io/
License: Apache License 2.0
Enabling reproducible Machine Learning research
Home Page: http://mi-prometheus.rtfd.io/
License: Apache License 2.0
Use "mip-" prefix, change underscores to dashes.
There are still a few things to do to enhance further the documentation:
Explain the contents of the configuration files: the section (mandatory vs optional), parameters (mandatory vs optional) etc.
Add the expected attributes of each problem class. So far, they all indicate params
which is not very explicit.. This should be a team effort ๐
Add pictures where relevant: for instance, in the algorithmic problems, to illustrate the inputs & outputs
Link the classes mentions so that they can redirect to their documentation. For instance:
:param data_dict: DataDict, as created by the Problem class.
:type data_dict: :py:class:`miprometheus.utils.DataDict` # <- this should be a hyperlink sending to the doc page of DataDict
Link the external classes (PyTorch etc) to their doc (inter-doc linking)
A slow-down in some experiments has been noticed after releasing 0.2.0 and above:
__getitem__
.mip
. A simple example is mip-online-trainer --h
takes a good second to print the options, but I am guessing it could be faster.Currently, they just crash and output:
Traceback (most recent call last):
File "/home/tkornuta/anaconda3/bin/mip-gridtester-gpu", line 11, in
load_entry_point('miprometheus==0.2.0', 'console_scripts', 'mip-gridtester-gpu')()
File "/home/tkornuta/pytorch-env/mi-prometheus/miprometheus/workers/grid_tester_gpu.py", line 118, in main
grid_tester_gpu.run_grid_experiment()
File "/home/tkornuta/pytorch-env/mi-prometheus/miprometheus/workers/grid_tester_gpu.py", line 87, in run_grid_experiment
with ThreadPool(processes=torch.cuda.device_count()) as pool:
File "/home/tkornuta/anaconda3/lib/python3.6/multiprocessing/pool.py", line 789, in init
Pool.init(self, processes, initializer, initargs)
File "/home/tkornuta/anaconda3/lib/python3.6/multiprocessing/pool.py", line 167, in init
raise ValueError("Number of processes must be at least 1")
ValueError: Number of processes must be at least 1
This will be part of example presented in tutorial
When running grid trainer with "weird" terminal conditions:
# Set the same terminal conditions.
terminal_conditions:
loss_stop: 1.0e-5
episode_limit: 100
epochs_limit: 10
I received even weirder results, i.e.:
loss | episode | epoch | acc | batch_size |
---|---|---|---|---|
2.3023982 | 0 | 0 | 0.0625 | 64 |
2.42056847 | 1 | 0 | 0.09375 | 64 |
2.31480336 | 2 | 0 | 0.015625 | 64 |
2.30580044 | 3 | 0 | 0.125 | 64 |
2.28870821 | 4 | 0 | 0.109375 | 64 |
2.29206491 | 5 | 0 | 0.140625 | 64 |
2.29954505 | 6 | 0 | 0.109375 | 64 |
2.31241441 | 7 | 0 | 0.0625 | 64 |
2.2548635 | 8 | 0 | 0.1875 | 64 |
2.27464652 | 9 | 0 | 0.078125 | 64 |
2.24816489 | 10 | 0 | 0.171875 | 64 |
2.22160149 | 11 | 0 | 0.140625 | 64 |
2.14928031 | 12 | 0 | 0.171875 | 64 |
2.01351118 | 13 | 0 | 0.125 | 64 |
1.9892391 | 14 | 0 | 0.203125 | 64 |
1.78259158 | 15 | 0 | 0.375 | 64 |
1.76220071 | 16 | 0 | 0.40625 | 64 |
1.59635079 | 17 | 0 | 0.5625 | 64 |
1.24760115 | 18 | 0 | 0.59375 | 64 |
1.52464759 | 19 | 0 | 0.578125 | 64 |
1.02153337 | 20 | 0 | 0.6875 | 64 |
1.13793755 | 21 | 0 | 0.65625 | 64 |
1.04477477 | 22 | 0 | 0.703125 | 64 |
0.9281581 | 23 | 0 | 0.65625 | 64 |
0.8872776 | 24 | 0 | 0.75 | 64 |
0.96978921 | 25 | 0 | 0.71875 | 64 |
0.54203308 | 26 | 0 | 0.84375 | 64 |
0.71089166 | 27 | 0 | 0.828125 | 64 |
0.87194252 | 28 | 0 | 0.78125 | 64 |
0.74255377 | 29 | 0 | 0.796875 | 64 |
0.97673011 | 30 | 0 | 0.71875 | 64 |
0.89663988 | 31 | 0 | 0.671875 | 64 |
0.52516961 | 32 | 0 | 0.859375 | 64 |
0.67374974 | 33 | 0 | 0.6875 | 64 |
1.07340407 | 34 | 0 | 0.640625 | 64 |
0.65307564 | 35 | 0 | 0.71875 | 64 |
0.62181377 | 36 | 0 | 0.796875 | 64 |
0.53182065 | 37 | 0 | 0.84375 | 64 |
0.71876705 | 38 | 0 | 0.8125 | 64 |
0.69108903 | 39 | 0 | 0.8125 | 64 |
0.64931148 | 40 | 0 | 0.84375 | 64 |
0.64401007 | 41 | 0 | 0.765625 | 64 |
0.53108335 | 42 | 0 | 0.78125 | 64 |
0.47211882 | 43 | 0 | 0.8125 | 64 |
0.42510599 | 44 | 0 | 0.859375 | 64 |
0.53872186 | 45 | 0 | 0.78125 | 64 |
0.47975114 | 46 | 0 | 0.875 | 64 |
0.42682296 | 47 | 0 | 0.84375 | 64 |
0.68501329 | 48 | 0 | 0.84375 | 64 |
0.51802105 | 49 | 0 | 0.84375 | 64 |
0.42391279 | 50 | 0 | 0.890625 | 64 |
0.54398292 | 51 | 0 | 0.8125 | 64 |
0.43966454 | 52 | 0 | 0.828125 | 64 |
0.41222006 | 53 | 0 | 0.859375 | 64 |
0.30380589 | 54 | 0 | 0.890625 | 64 |
0.28526509 | 55 | 0 | 0.890625 | 64 |
0.3890301 | 56 | 0 | 0.890625 | 64 |
0.3726145 | 57 | 0 | 0.859375 | 64 |
0.37899232 | 58 | 0 | 0.84375 | 64 |
0.31090876 | 59 | 0 | 0.90625 | 64 |
0.29964575 | 60 | 0 | 0.890625 | 64 |
0.29754484 | 61 | 0 | 0.875 | 64 |
0.30940181 | 62 | 0 | 0.90625 | 64 |
0.28904665 | 63 | 0 | 0.90625 | 64 |
0.28892154 | 64 | 0 | 0.9375 | 64 |
0.28293437 | 65 | 0 | 0.890625 | 64 |
0.28884795 | 66 | 0 | 0.9375 | 64 |
0.27016857 | 67 | 0 | 0.890625 | 64 |
0.38757008 | 68 | 0 | 0.921875 | 64 |
0.24764507 | 69 | 0 | 0.921875 | 64 |
0.25606325 | 70 | 0 | 0.890625 | 64 |
0.48922622 | 71 | 0 | 0.84375 | 64 |
0.2977196 | 72 | 0 | 0.890625 | 64 |
0.3917419 | 73 | 0 | 0.921875 | 64 |
0.19252293 | 74 | 0 | 0.9375 | 64 |
0.39461273 | 75 | 0 | 0.875 | 64 |
0.28725618 | 76 | 0 | 0.859375 | 64 |
0.24857962 | 77 | 0 | 0.921875 | 64 |
0.22327447 | 78 | 0 | 0.9375 | 64 |
0.41391894 | 79 | 0 | 0.859375 | 64 |
0.19850856 | 80 | 0 | 0.921875 | 64 |
0.30375871 | 81 | 0 | 0.890625 | 64 |
0.38144702 | 82 | 0 | 0.890625 | 64 |
0.29862314 | 83 | 0 | 0.921875 | 64 |
0.16170724 | 84 | 0 | 0.953125 | 64 |
0.25888351 | 85 | 0 | 0.953125 | 64 |
0.17384183 | 86 | 0 | 0.953125 | 64 |
0.24882084 | 87 | 0 | 0.953125 | 64 |
0.20304871 | 88 | 0 | 0.921875 | 64 |
0.354817 | 89 | 0 | 0.9375 | 64 |
0.12355755 | 90 | 0 | 0.96875 | 64 |
0.20728019 | 91 | 0 | 0.921875 | 64 |
0.17258625 | 92 | 0 | 0.921875 | 64 |
0.16974132 | 93 | 0 | 0.953125 | 64 |
0.37275705 | 94 | 0 | 0.90625 | 64 |
0.09402215 | 95 | 0 | 0.96875 | 64 |
0.27992848 | 96 | 0 | 0.90625 | 64 |
0.13900934 | 97 | 0 | 0.953125 | 64 |
0.27177253 | 98 | 0 | 0.921875 | 64 |
0.15787081 | 99 | 0 | 0.921875 | 64 |
0.40764943 | 99 | 1 | 0.890625 | 64 |
0.12967248 | 99 | 2 | 0.9375 | 64 |
0.16256529 | 99 | 3 | 0.953125 | 64 |
0.08198662 | 99 | 4 | 0.96875 | 64 |
0.12792362 | 99 | 5 | 0.96875 | 64 |
0.1427121 | 99 | 6 | 0.953125 | 64 |
0.19214444 | 99 | 7 | 0.953125 | 64 |
0.26682153 | 99 | 8 | 0.890625 | 64 |
0.11921781 | 99 | 9 | 0.9375 | 64 |
episode | episodes_aggregated | loss | loss_min | loss_max | loss_std | epoch | acc | acc_min | acc_max | acc_std | samples_aggregated |
---|---|---|---|---|---|---|---|---|---|---|---|
99 | 100 | 0.77779132 | 0.09402215 | 2.42056847 | 0.70779961 | 0 | 0.72874999 | 0.015625 | 0.96875 | 0.27987459 | 6400 |
99 | 1 | 0.40764943 | 0.40764943 | 0.40764943 | 0 | 1 | 0.890625 | 0.890625 | 0.890625 | 0 | 64 |
99 | 1 | 0.12967248 | 0.12967248 | 0.12967248 | 0 | 2 | 0.9375 | 0.9375 | 0.9375 | 0 | 64 |
99 | 1 | 0.16256529 | 0.16256529 | 0.16256529 | 0 | 3 | 0.953125 | 0.953125 | 0.953125 | 0 | 64 |
99 | 1 | 0.08198662 | 0.08198662 | 0.08198662 | 0 | 4 | 0.96875 | 0.96875 | 0.96875 | 0 | 64 |
99 | 1 | 0.12792362 | 0.12792362 | 0.12792362 | 0 | 5 | 0.96875 | 0.96875 | 0.96875 | 0 | 64 |
99 | 1 | 0.1427121 | 0.1427121 | 0.1427121 | 0 | 6 | 0.953125 | 0.953125 | 0.953125 | 0 | 64 |
99 | 1 | 0.19214444 | 0.19214444 | 0.19214444 | 0 | 7 | 0.953125 | 0.953125 | 0.953125 | 0 | 64 |
99 | 1 | 0.26682153 | 0.26682153 | 0.26682153 | 0 | 8 | 0.890625 | 0.890625 | 0.890625 | 0 | 64 |
99 | 1 | 0.11921781 | 0.11921781 | 0.11921781 | 0 | 9 | 0.9375 | 0.9375 | 0.9375 | 0 | 64 |
content of validation_statistics.csv:
loss episode epoch acc batch_size
2.378291607 0 0 0.125 64
content of validation_set_agg_statistics.csv:
episode | episodes_aggregated | loss | loss_min | loss_max | loss_std | epoch | acc | acc_min | acc_max | acc_std | samples_aggregated |
---|---|---|---|---|---|---|---|---|---|---|---|
99 | 79 | 0.18435289 | 0.00570344 | 0.51182282 | 0.1067069 | 0 | 0.9398734 | 0.859375 | 1 | 0.03431381 | 5000 |
99 | 79 | 0.15693755 | 0.03428814 | 0.49671429 | 0.08907631 | 1 | 0.94996047 | 0.84375 | 1 | 0.0299593 | 5000 |
99 | 79 | 0.13773099 | 0.01599673 | 0.4856357 | 0.07974177 | 2 | 0.95925635 | 0.890625 | 1 | 0.02568884 | 5000 |
99 | 79 | 0.1349417 | 0.02542676 | 0.38587141 | 0.07309812 | 3 | 0.96004748 | 0.875 | 1 | 0.02536133 | 5000 |
99 | 79 | 0.13557728 | 0.02810043 | 0.51120263 | 0.07920972 | 4 | 0.95866299 | 0.890625 | 1 | 0.02552018 | 5000 |
99 | 79 | 0.14610203 | 0.03679127 | 0.45367861 | 0.08219316 | 5 | 0.9535206 | 0.875 | 1 | 0.02991695 | 5000 |
99 | 79 | 0.14048618 | 0.04181131 | 0.55806714 | 0.08375487 | 6 | 0.95530063 | 0.890625 | 1 | 0.02448055 | 5000 |
99 | 79 | 0.13374574 | 0.02522391 | 0.46379766 | 0.08368348 | 7 | 0.95787185 | 0.875 | 1 | 0.02812438 | 5000 |
99 | 79 | 0.13794875 | 0.02302515 | 0.46463043 | 0.07694951 | 8 | 0.9541139 | 0.875 | 1 | 0.0276742 | 5000 |
99 | 79 | 0.14701881 | 0.00304925 | 0.52258015 | 0.09639523 | 9 | 0.95391613 | 0.890625 | 1 | 0.02705158 | 5000 |
99 | 79 | 0.14747529 | 0.02490817 | 0.49389082 | 0.08180903 | 9 | 0.95391613 | 0.875 | 1 | 0.02829573 | 5000 |
Describe the bug
Numpy floats do not automatically get converted to Ints when needed with later versions of numpy. Thus, pytorch network configurations that expect Ints throw errors.
To Reproduce
Steps to reproduce the behavior:
Run the following with a later version of numpy installed:
$ mip-offline-trainer --c mi-prometheus/configs/vision/simplecnn_mnist.yaml
Expected behavior
Network definitions proceed without errors
Desktop (please complete the following information):
I have analyzed all workers, in my opinion there is no conflict with --e(xperiments_dir)
This is one of few flags that are shared along all workers and grid workers and outputdir is misleading, as in the case of most of them this is also input dir....
@vmarois what do you think about that?
This issue relates to #25
In order to enable reproducible VIGIL experiments, and make CLEVR a lighter class:
Move most of the complexity of CLEVR.generate_feature_maps_file() to GenerateFeatureMaps\
Describe the bug
Running grid_trainer_cpu on MAC results in:
max_processes = min(len(os.sched_getaffinity(0)), self.max_concurrent_runs)
AttributeError: module 'os' has no attribute 'sched_getaffinity'
It seems that OSX doesn't support this :]
https://stackoverflow.com/questions/42538153/python-3-6-0-os-module-does-not-have-sched-getaffinity-method
Desktop (please complete the following information):
For now grid trainers are taking into account number of available CPUs (for trainer cpu) or GPUs (for trainer gpu). This is not consistent with grid testers, which lack that functionality.
Aggregates issues related to release 0.3.1
TL;DR
when you are developer, call:
python setup.py develop
instead:
python setup.py install
https://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode
The workers have several flags available, which are added based on inheritance for now (e.g. Worker โฌ ๏ธ Trainer โฌ ๏ธ OfflineTrainer etc). This results in some flags not particularly useful for a specific worker, or having a different meaning.
Some examples:
experiment_repetitions
is a flag param for grid-tester-* but a config file one for the grid-trainer-*.savetag
for grid-tester-* is particularly usefulCurrently they are too "overcomplicated":
Thanks to this the grid-analyzer will simply inform the user whether model converged or not! ๐
For now it is hard to track whether subset sampler is working or not
The current implementation of curriculum learning forces by default that it needs to be finished even if convergence has been reached (in terms of loss < threshold):
# If the 'must_finish' key is not present in config then then it will be finished by default self.params['training']['curriculum_learning'].add_default_params({'must_finish': True})
While I think this is okay, it'd be great if the trainer would indicate that. For instance, if the loss threshold for a run of MAES on SerialRecall is set to 1e-2, MAES will most likely converge before the end of curriculem learning. In this case, a warning message would be great, in the lines of:
if not self.curric_done and converged:
self.logger.warning('The model has converged but curriculum has been set with must_finish=True.')
Python Package Index (PyPI)
As we are showing that as an example in our papers/presentations, I want to simply add it to our collections of models. ;)
General idea is to reproduce "tiny helpful apps", like e.g. index_splitter OR... grid workers(!)
The goal is that the video stream should return 5D Tensors:
BATCH x SEQ_LEN x CHANNELS X WIDTH X HEIGHT
(copy of internal issue #221)
Handle KeyboardInterrupt similarly to workers
Grid Trainers/Testers on GPU have hardcoded sleep time (currently 3s). This is motivated by the fact that cuda-gpupick picks a free GPU only by checking the contexts running on a given device.
The problem is that loading the configuration/configuring a given experiment might take longer than 3 seconds. This is the situation that we have faced with training of multiple models of MAC/SMAC on CLEVR/CoGenT.
For now we have increased the sleep time to 60 seconds (Closes #29 )
Desired solution
Assuming we run:
That results in 10 "experiments" that should potentially form the content of a single csv file - 12 rows?
The goal of this issue is to discuss what, how and when should be copied taking into account:
Besides, offline-trainer can optionally store partial_validation_statistics in csv.file, when one set partial_validation_interval > 0 in config:validation section.
As a result, different trainers will produce different statistics, how to deal with that diversity - this is the goal of that discussion/issue.
This issue aggregates minor fixes and updates.
Describe the bug
models/thalnet/thalnet_module.py is the source of the error, on line 131. Trying to log an error to logger causes this.
To Reproduce
Steps to reproduce the behavior:
Happened as I'm testing changes for #58 - I suspect just feeding in an incorrect sized data as described on the if statement on line 130 should trigger it.
Expected behavior
Should properly log to logger.
Desktop (please complete the following information):
had to use torchvision 0.2.0 to get the doc build successful for the first time but seems to be causing an error with Resize
(wasn't showing up with 0.2.1)
validation with batch of size 1 - works perfectly
validation with batch of size 10 - from time to time returns values that differ
First I was thinking that the issue related to lack of weighted averaging when we are not dropping the last batch. Sadly, the issue remained even when dropping last batch/limiting size of set to batch.
To Reproduce
mip-offline-trainer --c configs/vision/simplecnn_mnist.yaml
Validation problem section:
validation:
problem:
name: MNIST
batch_size: 10
use_train_data: True
resize: [32, 32]
sampler:
name: SubsetRandomSampler
indices: [55000, 55010]
dataloader:
drop_last: True
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes)
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851809; loss_min 0.0016851809; loss_max 0.0016851809; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation]
As of now, the trainers have the option to load a pretrained model using the flag --model
.
We haven't really used this feature so far, and the way it is implemented (as a flag) makes it hard to handle in the grid trainers, which results in not handling this for now.
So I like to handle this in the grid trainers, because it useful to load a pretrained model for each experiment (e.g. to finetune a pre-trained model).
I'm thinking of either:
--models
to the grid trainers, where the user could then use to indicate the trained models he wants to reuse, but this could be messy as we would have to check if all models are present, with which experiments they are compatible etc..--model
in the trainers from being a flag to a config parameter, that the user could specify in the config file. I know that this wouldn't be consistent with the tester, but I find this cleaner, and easier to handle in the grid trainers.We can discuss that ๐
Describe the bug
Currently, all workers assume that they are executed from the mi-prometheus main directory.
Along with setup.py we open the possibility to execute the mip-* workers from any directory.
To Reproduce
tkornuta@tkornuta-MacBookPro:~/pytorch-env$ mip-onlinetrainer --c mi-prometheus/configs/maes_baselines/maes/maes_serial_recall.yaml
Info: Parsing the mi-prometheus/configs/maes_baselines/maes/maes_serial_recall.yaml configuration file
Info: Parsing the configs/maes_baselines/maes/default_maes.yaml configuration file
Error: Configuration file configs/maes_baselines/maes/default_maes.yaml does not exist
Expected behavior
Workers should search the other configs in relation to the first one.
Solution
Extract the absolute path to the main config, then navigate in relation to that one. The goal is to leave the paths in default_config sections as they are, i.e. starting from the configs/ directory.
Grid workers currently rely on hardcoded paths/names of basic workers, fail to run them from
Testing will include simply running tester on experiment directory
Generally this should follow the same logic of "doubling flag and parameters", i.e.
"flags overwrite parameters read from configuration which overwrite default parameters"
There is inconsistent naming across params. For example, directories are sometimes called 'dir' and sometimes called 'folder'. We should perhaps decide on a unified set of standard names for source and target folders etc, and then change init and configs to reflect it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.