Standardize the names of mip scripts

Use "mip-" prefix, change underscores to dashes.

Push MAC and S-MAC to public github

Enhancement of the documentation

There are still a few things to do to enhance further the documentation:

Explain the contents of the configuration files: the section (mandatory vs optional), parameters (mandatory vs optional) etc.
Add the expected attributes of each problem class. So far, they all indicate params which is not very explicit.. This should be a team effort 🙂
Add pictures where relevant: for instance, in the algorithmic problems, to illustrate the inputs & outputs

Link the classes mentions so that they can redirect to their documentation. For instance:

:param data_dict: DataDict, as created by the Problem class.
:type data_dict: :py:class:`miprometheus.utils.DataDict` # <- this should be a hyperlink sending to the doc page of DataDict

Link the external classes (PyTorch etc) to their doc (inter-doc linking)

Online and offline trainers aggregate different number samples for MNIST during training

Investigate overhead of mi-prometheus

A slow-down in some experiments has been noticed after releasing 0.2.0 and above:

A run of MAC on CLEVR for 20 epochs takes more than 3 days whereas it was taking 26h previously. This could be due to the I/O operations for loading the image file on-the-fly in __getitem__.
The exhaustion of the MNIST or CIFAR10 datasets seems slower (according to @tkornut), although I was not able to reproduce what he reported.
Generally, I feel like we could reduce the overall overhead / slowness of mip. A simple example is mip-online-trainer --h takes a good second to print the options, but I am guessing it could be faster.

Port/adapt bAbI problem from 0.1

Check presence of cuda-compatible devices in GPU grid workers

Currently, they just crash and output:

Traceback (most recent call last):
File "/home/tkornuta/anaconda3/bin/mip-gridtester-gpu", line 11, in
load_entry_point('miprometheus==0.2.0', 'console_scripts', 'mip-gridtester-gpu')()
File "/home/tkornuta/pytorch-env/mi-prometheus/miprometheus/workers/grid_tester_gpu.py", line 118, in main
grid_tester_gpu.run_grid_experiment()
File "/home/tkornuta/pytorch-env/mi-prometheus/miprometheus/workers/grid_tester_gpu.py", line 87, in run_grid_experiment
with ThreadPool(processes=torch.cuda.device_count()) as pool:
File "/home/tkornuta/anaconda3/lib/python3.6/multiprocessing/pool.py", line 789, in init
Pool.init(self, processes, initializer, initargs)
File "/home/tkornuta/anaconda3/lib/python3.6/multiprocessing/pool.py", line 167, in init
raise ValueError("Number of processes must be at least 1")
ValueError: Number of processes must be at least 1

Prepare config for grid_trainer with LeNet5 and SimpleCNN model trained on MNIST

This will be part of example presented in tutorial

Create webhooks for Continuous Integration (Jenkins)

Investigate the content of statistics, rethink the desired behavior of online/offline trainers

When running grid trainer with "weird" terminal conditions:
# Set the same terminal conditions.
terminal_conditions:
loss_stop: 1.0e-5
episode_limit: 100
epochs_limit: 10

I received even weirder results, i.e.:

content of training_statistics.csv:

loss	episode	epoch	acc	batch_size
2.3023982	0	0	0.0625	64
2.42056847	1	0	0.09375	64
2.31480336	2	0	0.015625	64
2.30580044	3	0	0.125	64
2.28870821	4	0	0.109375	64
2.29206491	5	0	0.140625	64
2.29954505	6	0	0.109375	64
2.31241441	7	0	0.0625	64
2.2548635	8	0	0.1875	64
2.27464652	9	0	0.078125	64
2.24816489	10	0	0.171875	64
2.22160149	11	0	0.140625	64
2.14928031	12	0	0.171875	64
2.01351118	13	0	0.125	64
1.9892391	14	0	0.203125	64
1.78259158	15	0	0.375	64
1.76220071	16	0	0.40625	64
1.59635079	17	0	0.5625	64
1.24760115	18	0	0.59375	64
1.52464759	19	0	0.578125	64
1.02153337	20	0	0.6875	64
1.13793755	21	0	0.65625	64
1.04477477	22	0	0.703125	64
0.9281581	23	0	0.65625	64
0.8872776	24	0	0.75	64
0.96978921	25	0	0.71875	64
0.54203308	26	0	0.84375	64
0.71089166	27	0	0.828125	64
0.87194252	28	0	0.78125	64
0.74255377	29	0	0.796875	64
0.97673011	30	0	0.71875	64
0.89663988	31	0	0.671875	64
0.52516961	32	0	0.859375	64
0.67374974	33	0	0.6875	64
1.07340407	34	0	0.640625	64
0.65307564	35	0	0.71875	64
0.62181377	36	0	0.796875	64
0.53182065	37	0	0.84375	64
0.71876705	38	0	0.8125	64
0.69108903	39	0	0.8125	64
0.64931148	40	0	0.84375	64
0.64401007	41	0	0.765625	64
0.53108335	42	0	0.78125	64
0.47211882	43	0	0.8125	64
0.42510599	44	0	0.859375	64
0.53872186	45	0	0.78125	64
0.47975114	46	0	0.875	64
0.42682296	47	0	0.84375	64
0.68501329	48	0	0.84375	64
0.51802105	49	0	0.84375	64
0.42391279	50	0	0.890625	64
0.54398292	51	0	0.8125	64
0.43966454	52	0	0.828125	64
0.41222006	53	0	0.859375	64
0.30380589	54	0	0.890625	64
0.28526509	55	0	0.890625	64
0.3890301	56	0	0.890625	64
0.3726145	57	0	0.859375	64
0.37899232	58	0	0.84375	64
0.31090876	59	0	0.90625	64
0.29964575	60	0	0.890625	64
0.29754484	61	0	0.875	64
0.30940181	62	0	0.90625	64
0.28904665	63	0	0.90625	64
0.28892154	64	0	0.9375	64
0.28293437	65	0	0.890625	64
0.28884795	66	0	0.9375	64
0.27016857	67	0	0.890625	64
0.38757008	68	0	0.921875	64
0.24764507	69	0	0.921875	64
0.25606325	70	0	0.890625	64
0.48922622	71	0	0.84375	64
0.2977196	72	0	0.890625	64
0.3917419	73	0	0.921875	64
0.19252293	74	0	0.9375	64
0.39461273	75	0	0.875	64
0.28725618	76	0	0.859375	64
0.24857962	77	0	0.921875	64
0.22327447	78	0	0.9375	64
0.41391894	79	0	0.859375	64
0.19850856	80	0	0.921875	64
0.30375871	81	0	0.890625	64
0.38144702	82	0	0.890625	64
0.29862314	83	0	0.921875	64
0.16170724	84	0	0.953125	64
0.25888351	85	0	0.953125	64
0.17384183	86	0	0.953125	64
0.24882084	87	0	0.953125	64
0.20304871	88	0	0.921875	64
0.354817	89	0	0.9375	64
0.12355755	90	0	0.96875	64
0.20728019	91	0	0.921875	64
0.17258625	92	0	0.921875	64
0.16974132	93	0	0.953125	64
0.37275705	94	0	0.90625	64
0.09402215	95	0	0.96875	64
0.27992848	96	0	0.90625	64
0.13900934	97	0	0.953125	64
0.27177253	98	0	0.921875	64
0.15787081	99	0	0.921875	64
0.40764943	99	1	0.890625	64
0.12967248	99	2	0.9375	64
0.16256529	99	3	0.953125	64
0.08198662	99	4	0.96875	64
0.12792362	99	5	0.96875	64
0.1427121	99	6	0.953125	64
0.19214444	99	7	0.953125	64
0.26682153	99	8	0.890625	64
0.11921781	99	9	0.9375	64

content of training_set_agg_statistics.csv:

episode	episodes_aggregated	loss	loss_min	loss_max	loss_std	epoch	acc	acc_min	acc_max	acc_std	samples_aggregated
99	100	0.77779132	0.09402215	2.42056847	0.70779961	0	0.72874999	0.015625	0.96875	0.27987459	6400
99	1	0.40764943	0.40764943	0.40764943	0	1	0.890625	0.890625	0.890625	0	64
99	1	0.12967248	0.12967248	0.12967248	0	2	0.9375	0.9375	0.9375	0	64
99	1	0.16256529	0.16256529	0.16256529	0	3	0.953125	0.953125	0.953125	0	64
99	1	0.08198662	0.08198662	0.08198662	0	4	0.96875	0.96875	0.96875	0	64
99	1	0.12792362	0.12792362	0.12792362	0	5	0.96875	0.96875	0.96875	0	64
99	1	0.1427121	0.1427121	0.1427121	0	6	0.953125	0.953125	0.953125	0	64
99	1	0.19214444	0.19214444	0.19214444	0	7	0.953125	0.953125	0.953125	0	64
99	1	0.26682153	0.26682153	0.26682153	0	8	0.890625	0.890625	0.890625	0	64
99	1	0.11921781	0.11921781	0.11921781	0	9	0.9375	0.9375	0.9375	0	64

content of validation_statistics.csv:
loss episode epoch acc batch_size
2.378291607 0 0 0.125 64
content of validation_set_agg_statistics.csv:

episode	episodes_aggregated	loss	loss_min	loss_max	loss_std	epoch	acc	acc_min	acc_max	acc_std	samples_aggregated
99	79	0.18435289	0.00570344	0.51182282	0.1067069	0	0.9398734	0.859375	1	0.03431381	5000
99	79	0.15693755	0.03428814	0.49671429	0.08907631	1	0.94996047	0.84375	1	0.0299593	5000
99	79	0.13773099	0.01599673	0.4856357	0.07974177	2	0.95925635	0.890625	1	0.02568884	5000
99	79	0.1349417	0.02542676	0.38587141	0.07309812	3	0.96004748	0.875	1	0.02536133	5000
99	79	0.13557728	0.02810043	0.51120263	0.07920972	4	0.95866299	0.890625	1	0.02552018	5000
99	79	0.14610203	0.03679127	0.45367861	0.08219316	5	0.9535206	0.875	1	0.02991695	5000
99	79	0.14048618	0.04181131	0.55806714	0.08375487	6	0.95530063	0.890625	1	0.02448055	5000
99	79	0.13374574	0.02522391	0.46379766	0.08368348	7	0.95787185	0.875	1	0.02812438	5000
99	79	0.13794875	0.02302515	0.46463043	0.07694951	8	0.9541139	0.875	1	0.0276742	5000
99	79	0.14701881	0.00304925	0.52258015	0.09639523	9	0.95391613	0.890625	1	0.02705158	5000
99	79	0.14747529	0.02490817	0.49389082	0.08180903	9	0.95391613	0.875	1	0.02829573	5000

Implement index_splitter worker

Add cuda-gpupick check to grid_*_gpu workers

Numpy > 1.11.0 causes errors due to numpy.float64 not being automatically converted to Int

Describe the bug
Numpy floats do not automatically get converted to Ints when needed with later versions of numpy. Thus, pytorch network configurations that expect Ints throw errors.

To Reproduce
Steps to reproduce the behavior:
Run the following with a later version of numpy installed:
$ mip-offline-trainer --c mi-prometheus/configs/vision/simplecnn_mnist.yaml

Expected behavior
Network definitions proceed without errors

Desktop (please complete the following information):

OS: Ubuntu
OS version 18.04
Python version 3.5.2
PyTorch version 0.4.1

Create webhooks triggering generation of documentation from code using Sphinx

Change --o to --e

I have analyzed all workers, in my opinion there is no conflict with --e(xperiments_dir)

This is one of few flags that are shared along all workers and grid workers and outputdir is misleading, as in the case of most of them this is also input dir....

@vmarois what do you think about that?

This issue relates to #25

Move generate_feature_maps_file() from CLEVR to GenerateFeatureMaps

In order to enable reproducible VIGIL experiments, and make CLEVR a lighter class:
Move most of the complexity of CLEVR.generate_feature_maps_file() to GenerateFeatureMaps\

grid_*_cpu not working on MAC/OSX

Describe the bug
Running grid_trainer_cpu on MAC results in:

max_processes = min(len(os.sched_getaffinity(0)), self.max_concurrent_runs)
AttributeError: module 'os' has no attribute 'sched_getaffinity'

It seems that OSX doesn't support this :]
https://stackoverflow.com/questions/42538153/python-3-6-0-os-module-does-not-have-sched-getaffinity-method
Desktop (please complete the following information):

OS: OSX
OS version: Darwin-18.0.0-x86_64-i386-64bit

Standardize max_concurrent_runs and experiment_repetitions for all grid trainers and testers

For now grid trainers are taking into account number of available CPUs (for trainer cpu) or GPUs (for trainer gpu). This is not consistent with grid testers, which lack that functionality.

Release 0.3.1

Aggregates issues related to release 0.3.1

Add description about development mode to the documentation

TL;DR
when you are developer, call:
python setup.py develop
instead:
python setup.py install

https://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode

Augment documentation with descriptions of concepts/images from paper

Clean up the flags of the different workers

The workers have several flags available, which are added based on inheritance for now (e.g. Worker ⬅️ Trainer ⬅️ OfflineTrainer etc). This results in some flags not particularly useful for a specific worker, or having a different meaning.

Some examples:

experiment_repetitions is a flag param for grid-tester-* but a config file one for the grid-trainer-*.
Not sure that savetag for grid-tester-* is particularly useful

Update the documentation and Sphinx scripts so they will reflect current project content

Refactor and simplify MNIST and CIFAR10 problems

Currently they are too "overcomplicated":

contain hardcoded elements (e.g. resizing to exactly 224 x 224) which had to be removed in order to make those problems
default_values have to be simplified to image dimensions (w,h,c) + number of classes
naming: num_classes, num_channels

Store termination cause in model checkpoint

set initial value of "termination_cause" to "[Not Converged]"
set termination_cause depending on termination reason
pass termination_cause to model.save as additional field

Thanks to this the grid-analyzer will simply inform the user whether model converged or not! 👍

Logging of dataset size before/after sampler initialization

For now it is hard to track whether subset sampler is working or not

Trainers should indicate if loss criterion is met but curriculum learning still going

The current implementation of curriculum learning forces by default that it needs to be finished even if convergence has been reached (in terms of loss < threshold):

# If the 'must_finish' key is not present in config then then it will be finished by default self.params['training']['curriculum_learning'].add_default_params({'must_finish': True})

While I think this is okay, it'd be great if the trainer would indicate that. For instance, if the loss threshold for a run of MAES on SerialRecall is set to 1e-2, MAES will most likely converge before the end of curriculem learning. In this case, a warning message would be great, in the lines of:

if not self.curric_done and converged:
self.logger.warning('The model has converged but curriculum has been set with must_finish=True.')

Create package enabling installation using PIP

Python Package Index (PyPI)

Add LeNet-5 model for MNIST image classification

As we are showing that as an example in our papers/presentations, I want to simply add it to our collections of models. ;)

Reorganize helpers/workers

General idea is to reproduce "tiny helpful apps", like e.g. index_splitter OR... grid workers(!)

Refactor VideoTextToClass problems

The goal is that the video stream should return 5D Tensors:
BATCH x SEQ_LEN x CHANNELS X WIDTH X HEIGHT

(copy of internal issue #221)

Add exporting both training and validation statistics to model

Add Ctrl+C handling to grid workers

Handle KeyboardInterrupt similarly to workers

Introduce mutex-based experiment configuration to Grid Workers GPU

Grid Trainers/Testers on GPU have hardcoded sleep time (currently 3s). This is motivated by the fact that cuda-gpupick picks a free GPU only by checking the contexts running on a given device.

The problem is that loading the configuration/configuring a given experiment might take longer than 3 seconds. This is the situation that we have faced with training of multiple models of MAC/SMAC on CLEVR/CoGenT.

For now we have increased the sleep time to 60 seconds (Closes #29 )

Desired solution

introduce a "configuration_in_progress" mutex to both basic and grid workers
when a basic worker starts, it raises the "configuration_in_progress" mutex
after spanning the process grid workers hangs on the "configuration_in_progress"
after the setup_configuration() method is finished, given basic workers lowers the "configuration_in_progress" mutex, that frees the grid worker to proceed (and potentially span next worker)

Rethink operation of grid-analyzer

Assuming we run:

grid-trainer that resulted in a grid of trained models (let's say 6 exp = 2 models x 3 problems)
then we run grid-tester with two runs (with different random seeds of course)

That results in 10 "experiments" that should potentially form the content of a single csv file - 12 rows?

The goal of this issue is to discuss what, how and when should be copied taking into account:

we got best_model.pt checkpoint, which indicates in which episode the model was created
we got two types of trainers, that could save the model depending on loss calculated on:
- online-trainer: partial validation set, that is activated every episode % "partial_validation_interval"
- offline-trainer: full validation set, that is calculated at the end of every epoch

Besides, offline-trainer can optionally store partial_validation_statistics in csv.file, when one set partial_validation_interval > 0 in config:validation section.

As a result, different trainers will produce different statistics, how to deal with that diversity - this is the goal of that discussion/issue.

Release 0.2.1

This issue aggregates minor fixes and updates.

'ThalnetModule' object has not attribute 'logger'

Describe the bug
models/thalnet/thalnet_module.py is the source of the error, on line 131. Trying to log an error to logger causes this.

To Reproduce
Steps to reproduce the behavior:
Happened as I'm testing changes for #58 - I suspect just feeding in an incorrect sized data as described on the if statement on line 130 should trigger it.

Expected behavior
Should properly log to logger.

Desktop (please complete the following information):

OS: Ubuntu
OS version 18.04
Python version 3.5
PyTorch version 0.4.1

Standardize the way all worker scripts are reacting for lack of CUDA-compatible devices

Clear up torchvision version

had to use torchvision 0.2.0 to get the doc build successful for the first time but seems to be causing an error with Resize (wasn't showing up with 0.2.1)

Investigate why two validation runs on the same model return slightly different statistics

validation with batch of size 1 - works perfectly

validation with batch of size 10 - from time to time returns values that differ

First I was thinking that the issue related to lack of weighted averaging when we are not dropping the last batch. Sadly, the issue remained even when dropping last batch/limiting size of set to batch.

To Reproduce

mip-offline-trainer --c configs/vision/simplecnn_mnist.yaml

Validation problem section:

validation:
problem:
name: MNIST
batch_size: 10
use_train_data: True
resize: [32, 32]
sampler:
name: SubsetRandomSampler
indices: [55000, 55010]
dataloader:
drop_last: True

================================================================================
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes)
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851807; loss_min 0.0016851807; loss_max 0.0016851807; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation]
[2018-11-08 12:36:07] - INFO - Model >>> Model and statistics exported to checkpoint ./experiments/MNIST/SimpleConvNet/20181108_123547/models/model_best.pt
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>>

[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes)
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851809; loss_min 0.0016851809; loss_max 0.0016851809; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation]

Implement Sampler Factory reusing the existing PyTorch samplers

Change the way --model is handled

As of now, the trainers have the option to load a pretrained model using the flag --model.
We haven't really used this feature so far, and the way it is implemented (as a flag) makes it hard to handle in the grid trainers, which results in not handling this for now.

So I like to handle this in the grid trainers, because it useful to load a pretrained model for each experiment (e.g. to finetune a pre-trained model).
I'm thinking of either:

adding a corresponding --models to the grid trainers, where the user could then use to indicate the trained models he wants to reuse, but this could be messy as we would have to check if all models are present, with which experiments they are compatible etc..
Moving --model in the trainers from being a flag to a config parameter, that the user could specify in the config file. I know that this wouldn't be consistent with the tester, but I find this cleaner, and easier to handle in the grid trainers.

We can discuss that 🙂

Doc build is failing

Symptom: https://readthedocs.org/projects/mi-prometheus/builds/8028158/

Perhaps to try:

https://stackoverflow.com/questions/50990875/requiring-psychopy-fails-readthedocs-build-due-to-memory-consumption

https://docs.readthedocs.io/en/latest/guides/build-using-too-many-resources.html

Extract and add absolute path to nested config files

Describe the bug
Currently, all workers assume that they are executed from the mi-prometheus main directory.
Along with setup.py we open the possibility to execute the mip-* workers from any directory.
To Reproduce

tkornuta@tkornuta-MacBookPro:~/pytorch-env$ mip-onlinetrainer --c mi-prometheus/configs/maes_baselines/maes/maes_serial_recall.yaml
Info: Parsing the mi-prometheus/configs/maes_baselines/maes/maes_serial_recall.yaml configuration file
Info: Parsing the configs/maes_baselines/maes/default_maes.yaml configuration file
Error: Configuration file configs/maes_baselines/maes/default_maes.yaml does not exist

Expected behavior
Workers should search the other configs in relation to the first one.

Solution

Extract the absolute path to the main config, then navigate in relation to that one. The goal is to leave the paths in default_config sections as they are, i.e. starting from the configs/ directory.

Grid workers should rely on mip-* scripts

Grid workers currently rely on hardcoded paths/names of basic workers, fail to run them from

Create grid configs enabling reproducing results from VIGIL paper

Train MAC/S-MAC on CLEVR/CoGenT-A
Finetune:

MAC/S-MAC trained on CLEVR - finetune on CoGenT-A & CoGenT-B
MAC/S-MAC trained on CoGenT-A - finetune on CoGenT-B

Testing will include simply running tester on experiment directory

Add basic trainer selection to config of grid_trainer_*

Generally this should follow the same logic of "doubling flag and parameters", i.e.
"flags overwrite parameters read from configuration which overwrite default parameters"

Add page with tutorial explaining how to create new problem/model/config

Standardize params names across methods, problems etc.

There is inconsistent naming across params. For example, directories are sometimes called 'dir' and sometimes called 'folder'. We should perhaps decide on a unified set of standard names for source and target folders etc, and then change init and configs to reflect it.

ibm / mi-prometheus Goto Github PK

mi-prometheus's People

Contributors

Stargazers

Watchers

Forkers

mi-prometheus's Issues

Recommend Projects

Recommend Topics

Recommend Org