oracen-zz / midas Goto Github PK

View Code? Open in Web Editor NEW

117.0 117.0 28.0 3.69 MB

Multiple imputation utilising denoising autoencoder for approximate Bayesian inference

License: Apache License 2.0

Python 100.00%

midas's People

Contributors

Stargazers

Watchers

midas's Issues

The file formed in the MIDAS, where it get save .Is the method mentioned can be used without introdusing noise.

The file formed in the MIDAS, where it get save .Is the method mentioned can be used without introdusing noise. Please help me. Thanks

Demo crashes kernel

Hi there,

Thanks for making this! I'm having a problem running the demo (and other minimal examples). On generate_samples() and overimpute() the kernel dies fairly quickly. I'm running Mac OS X High Sierra, Anaconda 3, Tensorflow 1.12.0. Any idea what's going on?

GPU utilization in AWS

Hi,

Once again thanks for the effort.
Using the previous version of the library on AWS (p2.8xlarge) on a ~250GB dataset and although it seems that all GPUs are available
Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:17.0, compute capability: 3.7)
2018-10-17 06:30:29.402721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:18.0, compute capability: 3.7)
2018-10-17 06:30:29.402734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:19.0, compute capability: 3.7)
2018-10-17 06:30:29.402745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:1a.0, compute capability: 3.7)
2018-10-17 06:30:29.402757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:1b.0, compute capability: 3.7)
2018-10-17 06:30:29.402768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:1c.0, compute capability: 3.7)
2018-10-17 06:30:29.402779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:1d.0, compute capability: 3.7)
2018-10-17 06:30:29.402801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

when checking the utilization only one GPU is utilized

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2244 C /home/ubuntu/src/anaconda3/bin/python 10912MiB |
| 1 2244 C /home/ubuntu/src/anaconda3/bin/python 10858MiB |
| 2 2244 C /home/ubuntu/src/anaconda3/bin/python 10858MiB |
| 3 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 4 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 5 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 6 2244 C /home/ubuntu/src/anaconda3/bin/python 10854MiB |
| 7 2244 C /home/ubuntu/src/anaconda3/bin/python 10854MiB |
+-----------------------------------------------------------------------------+

Is the library designed so as to utilize all GPUs available in the system by default?

main library versions:
tensorflow 1.4.0rc0
numpy 1.13.3
pandas 0.20.3 py36h6022372_2
Cuda compilation tools, release 9.0, V9.0.176

Thanks in advance.

Categorical variables and multiple imputation

For categorical variables I understand we one hot encode variables and take the argmax as the imputation result.

With multiple iterations, numerical values are averaged and the resulting mean is taken as the model's prediction. What is the recommended way to do this for categorical variables? Should the plurality be taken as the final imputation?

Additionally, would it be valid to simply take a single iterations imputation result as the model's prediction? Are there any bounds on the bias of the model as a function on the number of iterations?

Thanks!

Compatibility with Python 2.7

Greetings

In the midas class code, you use both yield and an non empty return, for instance in the function def batch_yield_samples:

            yield output_df
    return  self

It seems to be a nice feature of Python 3 yet it raises an error with Python 2.7.13. Is it possible to have a work around so that Midas could be used with Python 2.7.13? Indeed, in the readme, you seem to look for Python 2.7 compatibility.

In addition, the copy method is not defined for lists in Python 2.7. In that case the work around is very easy.

-> for that one
import copy
and use copy.copy(list)
instead of list.copy()

Another problem with the softmax loss computation is that it is 0 with P2.7 likely because you divide an int by an int and it is rounded (to 0 in that) to give an int (the remainder of the Euclidean division). In P3.6, I think that it returns a float with the floating point division of the two integers.

Best.

PS: Here is the error you get with Python 2.7.
File "midas.py", line 853
return self
SyntaxError: 'return' with argument inside generator

https://stackoverflow.com/questions/15809296/python-syntaxerror-return-with-argument-inside-generator?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

Do you have an input pipeline example

I am going through your documentation but am still a little confused for how I would need to implement the input pipeline functionality if I want to stream in data. Say you have data stored in a large CSV that has been scaled as you have specified, how would you go about creating the input pipeline function? Do you have an example one can use?

Research paper

Hi, Could you please mention the research paper that the program is based on ?

Issue in running demo

Hi,
While running the demo on jupyter notebook I ran in the following issue, while executing the codeblock:
imputer.batch_generate_samples(m= 5)

print("Original value:", original_value)
imputed_vals = []
for dataset in imputer.output_list:
imputed_vals.append(pd.DataFrame(scaler.inverse_transform(dataset),
columns= dataset.columns).iloc[50, 0])
print("Imputed values:")
print(imputed_vals)
print("Imputation mean:", np.mean(imputed_vals))
print("Standard deviation of the imputation mean:", np.std(imputed_vals))

Can you please help me fix this. Thanks

INFO:tensorflow:Restoring parameters from tmp/MIDAS
Model restored.

InvalidIndexError Traceback (most recent call last)
in ()
----> 1 imputer.batch_generate_samples(m= 5)
2
3 print("Original value:", original_value)
4 imputed_vals = []
5 for dataset in imputer.output_list:

~/git/PycharmProjects/untitled/midas.py in batch_generate_samples(self, m, b_size, verbose)
910 columns= self.imputation_target.columns)
911 output_df = self.imputation_target.copy()
--> 912 output_df[np.invert(self.na_matrix.values)] = y_out[np.invert(self.na_matrix.values)]
913 self.output_list.append(output_df)
914 return self

~/git/PycharmProjects/untitled/venv/lib/python3.5/site-packages/pandas/core/frame.py in setitem(self, key, value)
3109
3110 if isinstance(key, DataFrame) or getattr(key, 'ndim', None) == 2:
-> 3111 self._setitem_frame(key, value)
3112 elif isinstance(key, (Series, np.ndarray, list, Index)):
3113 self._setitem_array(key, value)

~/git/PycharmProjects/untitled/venv/lib/python3.5/site-packages/pandas/core/frame.py in _setitem_frame(self, key, value)
3158 self._check_inplace_setting(value)
3159 self._check_setitem_copy()
-> 3160 self._where(-key, value, inplace=True)
3161
3162 def _ensure_valid_index(self, value):

~/git/PycharmProjects/untitled/venv/lib/python3.5/site-packages/pandas/core/generic.py in _where(self, cond, other, inplace, axis, level, errors, try_cast)
7543 not all(other._get_axis(i).equals(ax)
7544 for i, ax in enumerate(self.axes))):
-> 7545 raise InvalidIndexError
7546
7547 # slice me out of the other

InvalidIndexError:

Impute unseen/test data

Hello,
thank you for the great work.

I'm trying to impute missing values on data different than the training set, by initializing a new imputer object liko so:

imputer_test = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
imputer_test.build_model(X_test,categorical_columns=feature_cols)

Then I call imputer.generate_samples() to impute the missing data in the test set. However, when the test set consist of less than 100 samples, there are still NaN's in the .output_lists.
Is there a theoretical minimum input df size (sorry I'm not familiar with how autoencoders work) ?

Thanks!

incompatible with tensorflow 2.0

since a lot of code restructure happened in tensorflow2.0, many tensorflow methods are uncallable within midas/midas_base.py

Using MIDAS with Times series or sequence data

Greetings,

I would like to know if it is possible to impute time series or multilevel (especially two level) data with MIDAS.
If not, do you plan to extend your algorithm to that setting ? The micemd R package does this kind of imputation, Audigier, V. et al (2017) arXiv:1702.00971. Some LSTM based encoders try to cope with times series.
Best.

In regards to the background of the model

Hi,

First of all thanks for your work.
I am in the process of doing some testings and if possible would like a quick clarification. Is your model the same or influenced by the one in the paper below? "MIDA: Multiple Imputation using Denoising Autoencoders" by Lovedeep Gondara and Ke Wang.
https://arxiv.org/pdf/1705.02737.pdf

oracen-zz / midas Goto Github PK

midas's People

Contributors

Stargazers

Watchers

Forkers

midas's Issues

Recommend Projects

Recommend Topics

Recommend Org