Giter Site home page Giter Site logo

midas's People

Contributors

oracen avatar ranjitlall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

midas's Issues

Demo crashes kernel

Hi there,

Thanks for making this! I'm having a problem running the demo (and other minimal examples). On generate_samples() and overimpute() the kernel dies fairly quickly. I'm running Mac OS X High Sierra, Anaconda 3, Tensorflow 1.12.0. Any idea what's going on?

GPU utilization in AWS

Hi,

Once again thanks for the effort.
Using the previous version of the library on AWS (p2.8xlarge) on a ~250GB dataset and although it seems that all GPUs are available
Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:17.0, compute capability: 3.7)
2018-10-17 06:30:29.402721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:18.0, compute capability: 3.7)
2018-10-17 06:30:29.402734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:19.0, compute capability: 3.7)
2018-10-17 06:30:29.402745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:1a.0, compute capability: 3.7)
2018-10-17 06:30:29.402757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:1b.0, compute capability: 3.7)
2018-10-17 06:30:29.402768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:1c.0, compute capability: 3.7)
2018-10-17 06:30:29.402779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:1d.0, compute capability: 3.7)
2018-10-17 06:30:29.402801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

when checking the utilization only one GPU is utilized

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:17.0 Off | 0 |
| N/A 77C P0 85W / 149W | 10931MiB / 11439MiB | 60% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:00:18.0 Off | 0 |
| N/A 54C P0 69W / 149W | 10877MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:00:19.0 Off | 0 |
| N/A 78C P0 60W / 149W | 10877MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:00:1A.0 Off | 0 |
| N/A 57C P0 70W / 149W | 10875MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 00000000:00:1B.0 Off | 0 |
| N/A 74C P0 61W / 149W | 10875MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 00000000:00:1C.0 Off | 0 |
| N/A 56C P0 70W / 149W | 10875MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 00000000:00:1D.0 Off | 0 |
| N/A 77C P0 62W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 59C P0 70W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2244 C /home/ubuntu/src/anaconda3/bin/python 10912MiB |
| 1 2244 C /home/ubuntu/src/anaconda3/bin/python 10858MiB |
| 2 2244 C /home/ubuntu/src/anaconda3/bin/python 10858MiB |
| 3 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 4 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 5 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 6 2244 C /home/ubuntu/src/anaconda3/bin/python 10854MiB |
| 7 2244 C /home/ubuntu/src/anaconda3/bin/python 10854MiB |
+-----------------------------------------------------------------------------+

Is the library designed so as to utilize all GPUs available in the system by default?

main library versions:
tensorflow 1.4.0rc0
numpy 1.13.3
pandas 0.20.3 py36h6022372_2
Cuda compilation tools, release 9.0, V9.0.176

Thanks in advance.

Categorical variables and multiple imputation

For categorical variables I understand we one hot encode variables and take the argmax as the imputation result.

With multiple iterations, numerical values are averaged and the resulting mean is taken as the model's prediction. What is the recommended way to do this for categorical variables? Should the plurality be taken as the final imputation?

Additionally, would it be valid to simply take a single iterations imputation result as the model's prediction? Are there any bounds on the bias of the model as a function on the number of iterations?

Thanks!

Compatibility with Python 2.7

Greetings

In the midas class code, you use both yield and an non empty return, for instance in the function def batch_yield_samples:

            yield output_df
    return  self

It seems to be a nice feature of Python 3 yet it raises an error with Python 2.7.13. Is it possible to have a work around so that Midas could be used with Python 2.7.13? Indeed, in the readme, you seem to look for Python 2.7 compatibility.

In addition, the copy method is not defined for lists in Python 2.7. In that case the work around is very easy.

-> for that one
import copy
and use copy.copy(list)
instead of list.copy()

Another problem with the softmax loss computation is that it is 0 with P2.7 likely because you divide an int by an int and it is rounded (to 0 in that) to give an int (the remainder of the Euclidean division). In P3.6, I think that it returns a float with the floating point division of the two integers.

Best.

PS: Here is the error you get with Python 2.7.
File "midas.py", line 853
return self
SyntaxError: 'return' with argument inside generator

https://stackoverflow.com/questions/15809296/python-syntaxerror-return-with-argument-inside-generator?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

Do you have an input pipeline example

I am going through your documentation but am still a little confused for how I would need to implement the input pipeline functionality if I want to stream in data. Say you have data stored in a large CSV that has been scaled as you have specified, how would you go about creating the input pipeline function? Do you have an example one can use?

Research paper

Hi, Could you please mention the research paper that the program is based on ?

Issue in running demo

Hi,
While running the demo on jupyter notebook I ran in the following issue, while executing the codeblock:
imputer.batch_generate_samples(m= 5)

print("Original value:", original_value)
imputed_vals = []
for dataset in imputer.output_list:
imputed_vals.append(pd.DataFrame(scaler.inverse_transform(dataset),
columns= dataset.columns).iloc[50, 0])
print("Imputed values:")
print(imputed_vals)
print("Imputation mean:", np.mean(imputed_vals))
print("Standard deviation of the imputation mean:", np.std(imputed_vals))

Can you please help me fix this. Thanks


INFO:tensorflow:Restoring parameters from tmp/MIDAS
Model restored.

InvalidIndexError Traceback (most recent call last)
in ()
----> 1 imputer.batch_generate_samples(m= 5)
2
3 print("Original value:", original_value)
4 imputed_vals = []
5 for dataset in imputer.output_list:

~/git/PycharmProjects/untitled/midas.py in batch_generate_samples(self, m, b_size, verbose)
910 columns= self.imputation_target.columns)
911 output_df = self.imputation_target.copy()
--> 912 output_df[np.invert(self.na_matrix.values)] = y_out[np.invert(self.na_matrix.values)]
913 self.output_list.append(output_df)
914 return self

~/git/PycharmProjects/untitled/venv/lib/python3.5/site-packages/pandas/core/frame.py in setitem(self, key, value)
3109
3110 if isinstance(key, DataFrame) or getattr(key, 'ndim', None) == 2:
-> 3111 self._setitem_frame(key, value)
3112 elif isinstance(key, (Series, np.ndarray, list, Index)):
3113 self._setitem_array(key, value)

~/git/PycharmProjects/untitled/venv/lib/python3.5/site-packages/pandas/core/frame.py in _setitem_frame(self, key, value)
3158 self._check_inplace_setting(value)
3159 self._check_setitem_copy()
-> 3160 self._where(-key, value, inplace=True)
3161
3162 def _ensure_valid_index(self, value):

~/git/PycharmProjects/untitled/venv/lib/python3.5/site-packages/pandas/core/generic.py in _where(self, cond, other, inplace, axis, level, errors, try_cast)
7543 not all(other._get_axis(i).equals(ax)
7544 for i, ax in enumerate(self.axes))):
-> 7545 raise InvalidIndexError
7546
7547 # slice me out of the other

InvalidIndexError:

Impute unseen/test data

Hello,
thank you for the great work.

I'm trying to impute missing values on data different than the training set, by initializing a new imputer object liko so:

imputer_test = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
imputer_test.build_model(X_test,categorical_columns=feature_cols)

Then I call imputer.generate_samples() to impute the missing data in the test set. However, when the test set consist of less than 100 samples, there are still NaN's in the .output_lists.
Is there a theoretical minimum input df size (sorry I'm not familiar with how autoencoders work) ?

Thanks!

Using MIDAS with Times series or sequence data

Greetings,

I would like to know if it is possible to impute time series or multilevel (especially two level) data with MIDAS.
If not, do you plan to extend your algorithm to that setting ? The micemd R package does this kind of imputation, Audigier, V. et al (2017) arXiv:1702.00971. Some LSTM based encoders try to cope with times series.
Best.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.