Giter Site home page Giter Site logo

fancyimpute's Introduction

Build Status Coverage Status DOI

plot

A variety of matrix completion and imputation algorithms implemented in Python 3.6.

To install:

pip install fancyimpute

If you run into tensorflow problems and use anaconda, you can try to fix them with conda install cudatoolkit.

Important Caveats

(1) This project is in "bare maintenance" mode. That means we are not planning on adding more imputation algorithms or features (but might if we get inspired). Please do report bugs, and we'll try to fix them. Also, we are happy to take pull requests for more algorithms and/or features.

(2) IterativeImputer started its life as a fancyimpute original, but was then merged into scikit-learn and we deleted it from fancyimpute in favor of the better-tested sklearn version. As a convenience, you can still from fancyimpute import IterativeImputer, but under the hood it's just doing from sklearn.impute import IterativeImputer. That means if you update scikit-learn in the future, you may also change the behavior of IterativeImputer.

Usage

from fancyimpute import KNN, NuclearNormMinimization, SoftImpute, BiScaler

# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN

# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).fit_transform(X_incomplete)

# matrix completion using convex optimization to find low-rank solution
# that still matches observed values. Slow!
X_filled_nnm = NuclearNormMinimization().fit_transform(X_incomplete)

# Instead of solving the nuclear norm objective directly, instead
# induce sparsity using singular value thresholding
X_incomplete_normalized = BiScaler().fit_transform(X_incomplete)
X_filled_softimpute = SoftImpute().fit_transform(X_incomplete_normalized)

# print mean squared error for the  imputation methods above
nnm_mse = ((X_filled_nnm[missing_mask] - X[missing_mask]) ** 2).mean()
print("Nuclear norm minimization MSE: %f" % nnm_mse)

softImpute_mse = ((X_filled_softimpute[missing_mask] - X[missing_mask]) ** 2).mean()
print("SoftImpute MSE: %f" % softImpute_mse)

knn_mse = ((X_filled_knn[missing_mask] - X[missing_mask]) ** 2).mean()
print("knnImpute MSE: %f" % knn_mse)

Algorithms

  • SimpleFill: Replaces missing entries with the mean or median of each column.

  • KNN: Nearest neighbor imputations which weights samples using the mean squared difference on features for which two rows both have observed data.

  • SoftImpute: Matrix completion by iterative soft thresholding of SVD decompositions. Inspired by the softImpute package for R, which is based on Spectral Regularization Algorithms for Learning Large Incomplete Matrices by Mazumder et. al.

  • IterativeImputer: A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. A stub that links to scikit-learn's IterativeImputer.

  • IterativeSVD: Matrix completion by iterative low-rank SVD decomposition. Should be similar to SVDimpute from Missing value estimation methods for DNA microarrays by Troyanskaya et. al.

  • MatrixFactorization: Direct factorization of the incomplete matrix into low-rank U and V, with per-row and per-column biases, as well as a global bias. Solved by SGD in pure numpy.

  • NuclearNormMinimization: Simple implementation of Exact Matrix Completion via Convex Optimization by Emmanuel Candes and Benjamin Recht using cvxpy. Too slow for large matrices.

  • BiScaler: Iterative estimation of row/column means and standard deviations to get doubly normalized matrix. Not guaranteed to converge but works well in practice. Taken from Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.

Citation

If you use fancyimpute in your academic publication, please cite it as follows:

@software{fancyimpute,
  author = {Alex Rubinsteyn and Sergey Feldman},
  title={fancyimpute: An Imputation Library for Python},
  url = {https://github.com/iskandr/fancyimpute},
  version = {0.7.0},
  date = {2016},
}

fancyimpute's People

Contributors

a-ozbek avatar brettbj avatar chelseaz avatar drkarthi avatar fritshermans avatar iskandr avatar johongo avatar pylipp avatar sergeyf avatar ssandanshi avatar timodonnell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fancyimpute's Issues

[MICE] ValueError: Must have equal len keys and value when setting with an iterable

from fancyimpute import MICE
X_filled_mice = MICE().complete(X_incomplete)
[MICE] Completing matrix with shape (902, 368)
[MICE] Starting imputation round 1/110, elapsed time 0.009
....
[MICE] Starting imputation round 110/110, elapsed time 1079.970
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-77-b8ec27551960> in <module>()
      3 
      4 
----> 5 X_filled_mice = MICE().complete(X_incomplete)

C:\Users\__\Anaconda3\lib\site-packages\fancyimpute\mice.py in complete(self, X)
    335         # average the imputed values for each feature
    336         average_imputated_values = imputed_arrays.mean(axis=0)
--> 337         X_completed[missing_mask] = average_imputated_values
    338         return X_completed

C:\Users\__\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   2324 
   2325         if isinstance(key, (Series, np.ndarray, list, Index)):
-> 2326             self._setitem_array(key, value)
   2327         elif isinstance(key, DataFrame):
   2328             self._setitem_frame(key, value)

C:\Users\__\Anaconda3\lib\site-packages\pandas\core\frame.py in _setitem_array(self, key, value)
   2344             indexer = key.nonzero()[0]
   2345             self._check_setitem_copy()
-> 2346             self.loc._setitem_with_indexer(indexer, value)
   2347         else:
   2348             if isinstance(value, DataFrame):

C:\Users\__\Anaconda3\lib\site-packages\pandas\core\indexing.py in _setitem_with_indexer(self, indexer, value)
    577 
    578                     if len(labels) != len(value):
--> 579                         raise ValueError('Must have equal len keys and value '
    580                                          'when setting with an iterable')
    581 

ValueError: Must have equal len keys and value when setting with an iterable

Citing fancyimpute

I'm using fancyimpute in a conference submission. Is there a citation format you'd like followed? I didn't see a paper on first check

Issue with autoencoder

Hi

I would like to report an issue with the AutoEncoder imputer.

Platform: Python 3.5 via Anaconda. Keras with TensorFlow backend on Mac OS 10.11.6

Error message:

Hidden layer sizes: [2000, 500, 1504]
Traceback (most recent call last):

File "", line 1, in
X_filled_ae = fi.AutoEncoder().complete(M)

File "/Users//anaconda3/lib/python3.5/site-packages/fancyimpute/solver.py", line 207, in complete
imputations = self.multiple_imputations(X)

File "/Users//anaconda3/lib/python3.5/site-packages/fancyimpute/solver.py", line 199, in multiple_imputations
return [self.single_imputation(X) for _ in range(self.n_imputations)]

File "/Users//anaconda3/lib/python3.5/site-packages/fancyimpute/solver.py", line 199, in
return [self.single_imputation(X) for _ in range(self.n_imputations)]

File "/Users//anaconda3/lib/python3.5/site-packages/fancyimpute/solver.py", line 184, in single_imputation
X_result = self.solve(X_filled, missing_mask)

File "/Users//anaconda3/lib/python3.5/site-packages/fancyimpute/auto_encoder.py", line 140, in solve
self.network = self._create_fresh_network(n_features)

File "/Users//anaconda3/lib/python3.5/site-packages/fancyimpute/auto_encoder.py", line 91, in _create_fresh_network
optimizer=self.optimizer)

File "/Users//anaconda3/lib/python3.5/site-packages/fancyimpute/neuralnet_helpers.py", line 94, in make_network
nn.compile(optimizer=optimizer, loss=loss_function)

File "/Users//anaconda3/lib/python3.5/site-packages/keras/models.py", line 553, in compile
**kwargs)

File "/Users//anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 630, in compile
sample_weight, mask)

File "/Users//anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 332, in weighted
score_array = fn(y_true, y_pred)

File "/Users//anaconda3/lib/python3.5/site-packages/fancyimpute/neuralnet_helpers.py", line 26, in reconstruction_loss
X_values.name = "$X_values"

AttributeError: can't set attribute

Many thanks

how to use knn with Biscale?

according to the knn code there is a parameter "normalizer". And says it could be object like Biscale. But when I do this:
data_clean = KNN(k=5, normalizer=BiScaler).complete(data)

the error below occurred:
File "/usr/local/lib/python2.7/site-packages/fancyimpute/solver.py", line 207, in complete imputations = self.multiple_imputations(X) File "/usr/local/lib/python2.7/site-packages/fancyimpute/solver.py", line 199, in multiple_imputations return [self.single_imputation(X) for _ in range(self.n_imputations)] File "/usr/local/lib/python2.7/site-packages/fancyimpute/solver.py", line 176, in single_imputation X = self.normalizer.fit_transform(X) TypeError: unbound method fit_transform() must be called with BiScaler instance as first argument (got ndarray instance instead)

Categorical features

Greetings!

Is it possible to impute categorical features with fancyimpute?

Thanks in advance!

MICE: ValueError: scale < 0

    df = self.clean_data(df)
  File "/Users/Sanchezj/scipyenv/scipy2/ClinicalData.py", line 72, in clean_data
    df[numerical_features_names])
  File "/Users/Sanchezj/scipyenv/scipy2/ClinicalData.py", line 150, in estimate_by_mice
    res = mice.complete(np.asarray(df.values, dtype=float))
  File "/Users/Sanchezj/scipyenv/scipy2/lib/python3.6/site-packages/fancyimpute/mice.py", line 334, in complete
    imputed_arrays, missing_mask = self.multiple_imputations(X)
  File "/Users/Sanchezj/scipyenv/scipy2/lib/python3.6/site-packages/fancyimpute/mice.py", line 325, in multiple_imputations
    visit_indices=visit_indices)
  File "/Users/Sanchezj/scipyenv/scipy2/lib/python3.6/site-packages/fancyimpute/mice.py", line 235, in perform_imputation_round
    imputed_values = np.random.normal(mus, sigmas)
  File "mtrand.pyx", line 1656, in mtrand.RandomState.normal
ValueError: scale < 0

I got this function to complete the database

def estimate_by_mice(df):
       df_estimated_var = df.copy()
       random.seed(129)
       mice = MICE()  # model=RandomForestClassifier(n_estimators=100))
       res = mice.complete(np.asarray(df.values, dtype=float))
       df_estimated_var.loc[:, df.columns] = res[:][:]
       return df_estimated_var

Then in this function in mice.py:

    def perform_imputation_round(
            self,
            X_filled,
            missing_mask,
            observed_mask,
            visit_indices):
        """
        Does one entire round-robin set of updates.
        """
        n_rows, n_cols = X_filled.shape

        if n_cols > self.n_nearest_columns:
            # make a correlation matrix between all the original columns,
            # excluding the constant ones
            correlation_matrix = np.corrcoef(X_filled, rowvar=0)
            abs_correlation_matrix = np.abs(correlation_matrix)

        n_missing_for_each_column = missing_mask.sum(axis=0)
        ordered_column_indices = np.arange(n_cols)

        for col_idx in visit_indices:
            # which rows are missing for this column
            missing_row_mask_for_this_col = missing_mask[:, col_idx]
            n_missing_for_this_col = n_missing_for_each_column[col_idx]
            if n_missing_for_this_col > 0:  # if we have any missing data at all
                observed_row_mask_for_this_col = observed_mask[:, col_idx]
                column_values = X_filled[:, col_idx]
                column_values_observed = column_values[observed_row_mask_for_this_col]

                if n_cols <= self.n_nearest_columns:
                    other_column_indices = np.concatenate([
                        ordered_column_indices[:col_idx],
                        ordered_column_indices[col_idx + 1:]
                    ])
                else:
                    # probability of column draw is proportional to absolute
                    # pearson correlation
                    p = abs_correlation_matrix[col_idx, :].copy()

                    # adding a small amount of weight to every bin to make sure
                    # every column has some small chance of being chosen
                    p += 0.0000001

                    # make the probability of choosing the current column
                    # zero
                    p[col_idx] = 0

                    p /= p.sum()
                    other_column_indices = np.random.choice(
                        ordered_column_indices,
                        self.n_nearest_columns,
                        replace=False,
                        p=p)
                X_other_cols = X_filled[:, other_column_indices]
                X_other_cols_observed = X_other_cols[observed_row_mask_for_this_col]
                brr = self.model
                brr.fit(
                    X_other_cols_observed,
                    column_values_observed,
                    inverse_covariance=None)

                # Now we choose the row method (PMM) or the column method.
                if self.impute_type == 'pmm':  # this is the PMM procedure
                    # predict values for missing values using random beta draw
                    X_missing = X_filled[
                        np.ix_(missing_row_mask_for_this_col, other_column_indices)]
                    col_preds_missing = brr.predict(X_missing, random_draw=True)
                    # predict values for observed values using best estimated beta
                    X_observed = X_filled[
                        np.ix_(observed_row_mask_for_this_col, other_column_indices)]
                    col_preds_observed = brr.predict(X_observed, random_draw=False)
                    # for each missing value, find its nearest neighbors in the observed values
                    D = np.abs(col_preds_missing[:, np.newaxis] - col_preds_observed)  # distances
                    # take top k neighbors
                    k = np.minimum(self.n_pmm_neighbors, len(col_preds_observed) - 1)
                    k_nearest_indices = np.argpartition(D, k, 1)[:, :k]  # <- bottleneck!
                    # pick one of the nearest neighbors at random! that's right!
                    imputed_indices = np.array([
                        np.random.choice(neighbor_index)
                        for neighbor_index in k_nearest_indices])
                    # set the missing values to be the values of the nearest
                    # neighbor in the output space
                    imputed_values = column_values_observed[imputed_indices]
                elif self.impute_type == 'col':
                    X_other_cols_missing = X_other_cols[missing_row_mask_for_this_col]
                    # predict values for missing values using posterior predictive draws
                    # see the end of this:
                    # https://www.cs.utah.edu/~fletcher/cs6957/lectures/BayesianLinearRegression.pdf
                    mus, sigmas_squared = brr.predict_dist(X_other_cols_missing)
                    # inplace sqrt of sigma_squared
                    sigmas = sigmas_squared
                    np.sqrt(sigmas_squared, out=sigmas)
                    imputed_values = np.random.normal(mus, sigmas)
                imputed_values = self.clip(imputed_values)
                X_filled[missing_row_mask_for_this_col, col_idx] = imputed_values
        return X_filled

particularly this line:
imputed_values = np.random.normal(mus, sigmas)

gives a ValueError: scale < 0

mus is a ndarray of 1x308 of nan values
sigmas is a ndarray with the same shape.

What could it happen? problem of Numpy?

ValueError in MICE

I tried the following:

from fancyimpute import MICE
imputer = MICE()
imputed = imputer.complete(dummied)

The code crashed with the following stack trace:

imputed = imputer.complete(dummied)
[MICE] Completing matrix with shape (244796, 723)
Traceback (most recent call last):
  File "<ipython-input-50-db9415235e63>", line 1, in <module>
    imputed = imputer.complete(dummied)
  File "build/bdist.macosx-10.5-x86_64/egg/fancyimpute/mice.py", line 364, in complete
    imputed_arrays, missing_mask = self.multiple_imputations(X)
  File "build/bdist.macosx-10.5-x86_64/egg/fancyimpute/mice.py", line 314, in multiple_imputations
    self._check_missing_value_mask(missing_mask)
  File "build/bdist.macosx-10.5-x86_64/egg/fancyimpute/solver.py", line 54, in _check_missing_value_mask
    if not missing.any():
  File "/Users/zeeshan.sayyed/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 887, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

When I applied SoftImpute to the same matrix, it seems to be working.

Thanks

many dependencies packages not found

Many dependencies packages like climate, downhill not found.I couldn't install it with anaconda.
Though it is mentioned here for downhill, it gives error as PackageNotFoundError: Package not found: Conda could not find. My conda version is conda 4.3.14. python version is Python 2.7.13 :: Continuum Analytics, Inc.

SoftImpute Batch Processing option on huge dataset

I am trying to use softimpute library to replace missing entries from the dataset. But my dataset size is very large to fit into memory. Does this library provide the batch processing option? If not how should I use this library on the huge dataset to replace missing attributes in records?

Reference(s) for the Autoencoder implementation

Hi,

All (or most) of the algorithms have reference papers associated with them. Can you provide some references for the AutoEncoder implementation?

Thank you for creating such a cool library!

MICE: imputing categorical data

It seems imputing categorical data (strings) is not supported by MICE().

I just converted categorical data to numerical by applying factorize() method to ordinal data and OneHotEncoding() to nominal data. Then it imputes values but those are not discrete.

Is it rounding off those values to its nearest discrete unit a valid approach?

make dependencies optional

Hi,

thanks for the great module

it will be great that if the dependencies can be optional.

for example, if I just use knn, I would like to not have keras, theano installed

using technique like:

pip install fancyimpute[knn]

NameError: global name 'warnings' is not defined

Although I imported warnings package, I keep getting this error while using KNN.
My python version is 2.7.12
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/taleb/Python/virtual_environment/local/lib/python2.7/site-packages/fancyimpute/solver.py", line 197, in complete imputations = self.multiple_imputations(X) File "/home/taleb/Python/virtual_environment/local/lib/python2.7/site-packages/fancyimpute/solver.py", line 189, in multiple_imputations return [self.single_imputation(X) for _ in range(self.n_imputations)] File "/home/taleb/Python/virtual_environment/local/lib/python2.7/site-packages/fancyimpute/solver.py", line 174, in single_imputation X_result = self.solve(X_filled, missing_mask) File "/home/taleb/Python/virtual_environment/local/lib/python2.7/site-packages/fancyimpute/knn.py", line 106, in solve if warnings: NameError: global name 'warnings' is not defined

Move to hammerlab

Given that we'll be writing a paper that leans heavily on the work happening in this repo we should probably migrate it over to the lab's GitHub.

Issue in pip install

I installed fancyimpute from pip. But there seems to be some issue with it when I import it. Also, when I cloned the repo and used the setup.py to install it, this issue didn't exit.

from fancyimpute import KNN


ImportError Traceback (most recent call last)
in ()
----> 1 from fancyimpute import KNN

/Users/zeeshan.sayyed/anaconda/lib/python2.7/site-packages/fancyimpute/init.py in ()
4 from .nuclear_norm_minimization import NuclearNormMinimization
5 from .bayesian_ridge_regression import BayesianRidgeRegression
----> 6 from .mice import MICE
7 from .auto_encoder import AutoEncoder
8 from .matrix_factorization import MatrixFactorization

ImportError: No module named mice

Applying to new data

Is there a way to apply an imputation model to new data without reestimating the model?

In a supervised setting you don't want to retrain your whole model every time you see a new test instance.

Thanks.

RuntimeError: Numpy version during compilation is different from local numpy version

Hello,

I've installed fancyimpute on my local system using pip install fancyimpute. When I try to import the package in my jupyter notebook, I encounter below error.


RuntimeError Traceback (most recent call last)
RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa


ImportError Traceback (most recent call last)
D:\Anaconda3\envs\dlwin36\lib\site-packages\theano\gof\cutils.py in ()
305 try:
--> 306 from cutils_ext.cutils_ext import * # noqa
307 except ImportError:

ImportError: numpy.core.multiarray failed to import

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa

Things that I've already tried:

  • Run conda update --all to update all anaconda packages to their latest version
  • Uninstall numpy and re-install it.

Both the workaround didn't work.

Any suggestions?

BiScaler doesn't comform to scikit-learn API

I tried to stick BiScaler into a scikit-learn pipeline as follows:

biscaler = BiScaler()
model = Ridge()
pipeline = Pipeline([('biscaler',biscaler),('model',model)])

But it doesn't seem to work. I get the following error:

Traceback (most recent call last):

  File "<ipython-input-18-ecdf2c799537>", line 5, in <module>
    pipeline.fit(X[train], y[train])

  File "D:\Anaconda2\lib\site-packages\sklearn\pipeline.py", line 164, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)

  File "D:\Anaconda2\lib\site-packages\sklearn\pipeline.py", line 145, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])

TypeError: fit_transform() takes exactly 2 arguments (3 given)

when trying this code:

n,d = X.shape
kfold = KFold(n=n,n_folds=5,shuffle=True)
errors = []
for k, (train, test) in enumerate(kfold):
    pipeline.fit(X[train], y[train])
    y_est = pipeline.predict(X[test])
    errors.append(np.mean(np.abs(y_est-y[test])))

Can you give a more detailed demo with real data?

Hello,Can you give a more detailed demo with real data?

You demo in readme, and test, had no real data of knn imputation?

can you give a real data processing example? So the others may understand the package more deeply.

Import error on python 3.4

I am getting the following error triggered by:
import fancyimpute


Using Theano backend.
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/theano/gof/lazylinker_c.py", line 74, in
raise ImportError()
ImportError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/theano/gof/lazylinker_c.py", line 91, in
raise ImportError()
ImportError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/theano/gof/vm.py", line 638, in
from . import lazylinker_c
File "/usr/local/lib/python3.4/site-packages/theano/gof/lazylinker_c.py", line 126, in
preargs=args)
File "/usr/local/lib/python3.4/site-packages/theano/gof/cmodule.py", line 2139, in compile_str
with open(cppfilename, 'w') as cppfile:
PermissionError: [Errno 13] Permission denied: '/home/jacob/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.4.1-64/lazylinker_ext/mod.cpp'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "runImputeData.py", line 15, in
from fancyimpute import KNN
File "", line 2237, in _find_and_load
File "", line 2226, in _find_and_load_unlocked
File "", line 1191, in _load_unlocked
File "", line 1161, in _load_backward_compatible
File "/usr/local/lib/python3.4/site-packages/fancyimpute-0.0.16-py3.4.egg/fancyimpute/init.py", line 7, in
File "", line 2237, in _find_and_load
File "", line 2226, in _find_and_load_unlocked
File "", line 1191, in _load_unlocked
File "", line 1161, in _load_backward_compatible
File "/usr/local/lib/python3.4/site-packages/fancyimpute-0.0.16-py3.4.egg/fancyimpute/auto_encoder.py", line 19, in
File "", line 2237, in _find_and_load
File "", line 2226, in _find_and_load_unlocked
File "", line 1191, in _load_unlocked
File "", line 1161, in _load_backward_compatible
File "/usr/local/lib/python3.4/site-packages/fancyimpute-0.0.16-py3.4.egg/fancyimpute/neuralnet_helpers.py", line 16, in
File "/usr/local/lib/python3.4/site-packages/Keras-1.0.3-py3.4.egg/keras/init.py", line 2, in
from . import backend
File "/usr/local/lib/python3.4/site-packages/Keras-1.0.3-py3.4.egg/keras/backend/init.py", line 51, in
from .theano_backend import *
File "/usr/local/lib/python3.4/site-packages/Keras-1.0.3-py3.4.egg/keras/backend/theano_backend.py", line 1, in
import theano
File "/usr/local/lib/python3.4/site-packages/theano/init.py", line 63, in
from theano.compile import (
File "/usr/local/lib/python3.4/site-packages/theano/compile/init.py", line 9, in
from theano.compile.function_module import *
File "/usr/local/lib/python3.4/site-packages/theano/compile/function_module.py", line 22, in
import theano.compile.mode
File "/usr/local/lib/python3.4/site-packages/theano/compile/mode.py", line 12, in
import theano.gof.vm
File "/usr/local/lib/python3.4/site-packages/theano/gof/vm.py", line 653, in
if x.fullname == 'linker'][0].default.startswith('cvm'), e

AssertionError: [Errno 13] Permission denied: '/home/jacob/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.4.1-64/lazylinker_ext/mod.cpp'

I can bypass it using sudo but then I get this:


Exception: Compilation failed (return status=1): /usr/bin/ld: /usr/local/lib/libpython3.4m.a(abstract.o): relocation R_X86_64_32S against `_Py_NotImplementedStruct' can not be used when making a shared object; recompile with -fPIC. /usr/local/lib/libpython3.4m.a: error adding symbols: Bad value. collect2: error: ld returned 1 exit status.

Any suggestion is welcomed.

TypeError in NuclearNormMinimization

Hello, in attempts to execute the Nuclear Norm example, I obtain a TypeError in the _create_objective function:

S = cvxpy.Variable(m, n, name="S")
TypeError: init() got multiple values for keyword argument 'name'

Typically this is a 'self' call error, but that doesn't appear to be the case here. Thoughts?

Covariance warning when running approximate MICE

@sergeyf I tried running MICE(approximate_but_fast_mode=True) on a some synthetic data (200x200, ~50% missing). I got the following warning:

/Users/iskander/code/fancyimpute/fancyimpute/bayesian_ridge_regression.py:69: RuntimeWarning: covariance is not positive-semidefinite.
  return np.random.multivariate_normal(self.beta_estimate, covar, num_draws)

LinAlgError: Singular matrix in MICE

I am getting a LinAlgError: Singular matrix Error when trying to impute with MICE..

from fancyimpute import MICE
vals = df[cols].copy()
vals = (vals /  vals.max()).values #important - make the max value as 1.0
imps = MICE(n_imputations = 50, impute_type ="col", verbose=1, min_value=0.0, max_value=1.0).complete(vals)

> /usr/local/lib/python2.7/dist-packages/fancyimpute/mice.pyc in complete(self, X)
>     364             print("[MICE] Completing matrix with shape %s" % (X.shape,))
>     365         X_completed = X.copy()
> --> 366         imputed_arrays, missing_mask = self.multiple_imputations(X)
>     367         # average the imputed values for each feature
>     368         average_imputated_values = imputed_arrays.mean(axis=0)
> 
> /usr/local/lib/python2.7/dist-packages/fancyimpute/mice.pyc in multiple_imputations(self, X)
>     352                 missing_mask=missing_mask,
>     353                 observed_mask=observed_mask,
> --> 354                 visit_indices=visit_indices)
>     355             if m >= self.n_burn_in:
>     356                 results_list.append(X_filled[missing_mask])
> 
> /usr/local/lib/python2.7/dist-packages/fancyimpute/mice.pyc in perform_imputation_round(self, X_filled, missing_mask, observed_mask, visit_indices)
>     220                     X_other_cols_observed,
>     221                     column_values_observed,
> --> 222                     inverse_covariance=None)
>     223 
>     224                 # Now we choose the row method (PMM) or the column method.
> 
> /usr/local/lib/python2.7/dist-packages/fancyimpute/bayesian_ridge_regression.pyc in fit(self, X, y, inverse_covariance)
>      66                 # interpreter with a savings of allocated arrays.
>      67                 outer_product[i, i] += lambda_reg
> ---> 68             self.inverse_covariance = inv(outer_product)
>      69         else:
>      70             self.inverse_covariance = inverse_covariance
> 
> /usr/local/lib/python2.7/dist-packages/numpy/linalg/linalg.pyc in inv(a)
>     524     signature = 'D->D' if isComplexType(t) else 'd->d'
>     525     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
> --> 526     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
>     527     return wrap(ainv.astype(result_t, copy=False))
>     528 
> 
> /usr/local/lib/python2.7/dist-packages/numpy/linalg/linalg.pyc in _raise_linalgerror_singular(err, flag)
>      88 
>      89 def _raise_linalgerror_singular(err, flag):
> ---> 90     raise LinAlgError("Singular matrix")
>      91 
>      92 def _raise_linalgerror_nonposdef(err, flag):
> 
> LinAlgError: Singular matrix
> 

Not getting into a similar problem with other imputation methods other than MICE, on same data.

Versions of libs in requirements.txt

Please mention the exact version of libs in requirements.txt.

Latest version of Cvpxy is giving error due as it has moved ahead of version you used.

Approximate mode for MICE is only slightly faster

@sergeyf I tried running MICE on a random 1000x200 matrix with ~50% sparsity for 10 imputation rounds (and 10 burn-in rounds). With approximate mode it took about 33s on my Macbook and the full mode took ~50s. I suspect we should be able to get the approximate mode a lot faster than that!

Running again under a profiler, here are the top culprits:

 44001   10.532    0.000   10.532    0.000 {built-in method dot}
       20    5.201    0.260   21.036    1.052 MICE.py:178(perform_imputation_round)
     4000    1.340    0.000    1.843    0.000 MICE.py:111(_sub_inverse_covariance)
     4000    0.980    0.000    5.503    0.001 bayesian_ridge_regression.py:72(predict_dist)
    16604    0.589    0.000    0.589    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     4000    0.390    0.000    2.791    0.001 MICE.py:137(_update_inverse_covariance)
     4000    0.328    0.000    0.343    0.000 numeric.py:1003(outer)
     4000    0.232    0.000    4.790    0.001 bayesian_ridge_regression.py:23(fit)
    12000    0.201    0.000    0.537    0.000 index_tricks.py:28(ix_)
     4000    0.194    0.000    0.273    0.000 {method 'normal' of 'mtrand.RandomState' objects}
    48009    0.173    0.000    0.173    0.000 {built-in method array}
     8000    0.150    0.000    0.150    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
     4001    0.118    0.000    0.258    0.000 linalg.py:458(inv)

(unsurprisingly spending most of our time doing matrix multiplications)

MatrixFactorization method doesn't stop

Hi,

MatrixFactorization method doesn't stop after almost 3000 iterations and extermely small improvements.

Here are latest iterations:
I 2016-06-12 21:25:18 downhill.base:232 Adam 2832 loss=0.017913 error=0.016215 grad(V)=0.000000 grad(U)=0.000006
I 2016-06-12 21:25:21 downhill.base:232 Adam 2833 loss=0.017908 error=0.016215 grad(V)=0.000000 grad(U)=0.000006
I 2016-06-12 21:25:24 downhill.base:232 Adam 2834 loss=0.017904 error=0.016215 grad(V)=0.000000 grad(U)=0.000006
I 2016-06-12 21:25:28 downhill.base:232 Adam 2835 loss=0.017899 error=0.016215 grad(V)=0.000000 grad(U)=0.000006
I 2016-06-12 21:25:32 downhill.base:232 Adam 2836 loss=0.017894 error=0.016214 grad(V)=0.000000 grad(U)=0.000006
I 2016-06-12 21:25:35 downhill.base:232 Adam 2837 loss=0.017890 error=0.016214 grad(V)=0.000000 grad(U)=0.000006
I 2016-06-12 21:25:39 downhill.base:232 Adam 2838 loss=0.017885 error=0.016214 grad(V)=0.000000 grad(U)=0.000006
I 2016-06-12 21:25:42 downhill.base:232 Adam 2839 loss=0.017881 error=0.016214 grad(V)=0.000000 grad(U)=0.000006

All parameters are by default, but the following ones: l1_penalty=0.5, min_value = 0.0, max_value = 200

Does min_improvement=0.005 is supposed to stop the process after such small improvements?

How can I stop it earlier?

Cannot install fancyimpute

64 bit windows 10, Anaconda3, python 3
setup.py install gives me
capture png

this is what I get from jupyter
@
capture
pointing me to the right direction will be highly appreciated. cheers

RandomForest with MICE?

Is it possible to use Random Forest model with MICE Model?
I tried to use RandomForestRegressor from scikit-learn in MICE Model like below but got an error.

import fancyimpute as fi
from sklearn.ensemble import RandomForestRegressor
# X is the incomplete data matrix that has some values as NaN that has to be imputed.
rf = RandomForestRegressor(n_estimators = 100, oob_score = True, random_state = 42)
X_filled_MICE = fi.MICE(model=rf).complete(X_incomplete) 

I get the following error

[MICE] Completing matrix with shape (1309, 1865)
[MICE] Starting imputation round 1/110, elapsed time 0.052
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-49-1f0dcbdff7c9> in <module>()
      1 # X is the incomplete data matrix that has some values as NaN that has to be imputed.
      2 rf = RandomForestRegressor(n_estimators = 100, oob_score = True, random_state = 42)
----> 3 X_filled_MICE = fi.MICE(model=rf).complete(X_incomplete)

E:\Anaconda3\lib\site-packages\fancyimpute\mice.py in complete(self, X)
    364             print("[MICE] Completing matrix with shape %s" % (X.shape,))
    365         X_completed = X.copy()
--> 366         imputed_arrays, missing_mask = self.multiple_imputations(X)
    367         # average the imputed values for each feature
    368         average_imputated_values = imputed_arrays.mean(axis=0)

E:\Anaconda3\lib\site-packages\fancyimpute\mice.py in multiple_imputations(self, X)
    352                 missing_mask=missing_mask,
    353                 observed_mask=observed_mask,
--> 354                 visit_indices=visit_indices)
    355             if m >= self.n_burn_in:
    356                 results_list.append(X_filled[missing_mask])

E:\Anaconda3\lib\site-packages\fancyimpute\mice.py in perform_imputation_round(self, X_filled, 
missing_mask, observed_mask, visit_indices)
    220                     X_other_cols_observed,
    221                     column_values_observed,
--> 222                     inverse_covariance=None)
    223 
    224                 # Now we choose the row method (PMM) or the column method.

TypeError: fit() got an unexpected keyword argument 'inverse_covariance'`

Was not able to find documentation on this. Please let me know how I can use RandomForest with MICE?

Installation issue in Windows (Python 2.7)

I'm trying to install fancyimpute in Windows 8.1, Anaconda 2.4.1, Python 2.7.12. But when I try to import it (both on terminal and Spyder environments), I get the error "No module named _multiprocess". I do have multiprocess installed on my system. Any help on this would be great. Please let me know if you need any more information.

Thanks.

Got TypeError when trying imputing a Pandas DataFrame

Hi everyone,

I'm trying to use MICE on a pandas dataframe:

from fancyimpute import MICE
MICE(n_imputations=np.infty).complete(df)

But got TypeError:

TypeErrorTraceback (most recent call last)
<ipython-input-17-c1a653d27988> in <module>()
      1 # Impute AGE_CLI with MICE
      2 from fancyimpute import MICE
----> 3 MICE(n_imputations=np.infty).complete(df)

/usr/local/lib/python3.4/dist-packages/fancyimpute/mice.py in complete(self, X)
    332             print("[MICE] Completing matrix with shape %s" % (X.shape,))
    333         X_completed = X.copy()
--> 334         imputed_arrays, missing_mask = self.multiple_imputations(X)
    335         # average the imputed values for each feature
    336         average_imputated_values = imputed_arrays.mean(axis=0)

/usr/local/lib/python3.4/dist-packages/fancyimpute/mice.py in multiple_imputations(self, X)
    293         X = np.asarray(X)
    294         self._check_input(X)
--> 295         missing_mask = np.isnan(X)
    296         self._check_missing_value_mask(missing_mask)
    297 

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

After a brief searching, I found that

np.isnan can be applied to NumPy arrays of native dtype but raises TypeError when applied to object arrays

How could I use fancyimpute with pandas dataframe ?

No option to limit the number of threads

Thanks for your work putting together all these methods!

I'm using NuclearNormMinimization on a shared computer, and it seems to default to use all of the threads on the machine. Is there any way to limit this? I imagine it has to do with CVXPY, but I read through their documentation and there's no mention of how to do that.

Maybe some of you all know where I would look to limit that? It would be nice if you could pass an n_threads argument to NuclearNormMinimization so my process plays nicely with others.

Import error

I installed using pip, also cloned and used setup.py install but this error is still there:

ImportError                               Traceback (most recent call last)
<ipython-input-2-4d43d6d23aa2> in <module>()
     12 from sklearn.svm import SVC
     13 from xgboost import XGBClassifier
---> 14 from fancyimpute import MICE
     15 
     16 import matplotlib.pyplot as plt

build/bdist.linux-x86_64/egg/fancyimpute/__init__.py in <module>()

build/bdist.linux-x86_64/egg/fancyimpute/nuclear_norm_minimization.py in <module>()

/media/pindaari/Softwares/anaconda2/lib/python2.7/site-packages/cvxpy/__init__.py in <module>()
     16 
     17 __version__ = "0.4.11"
---> 18 from cvxpy.atoms import *
     19 from cvxpy.expressions.variables import (Variable, Semidef, Symmetric, Bool,
     20                                          Int, NonNegative)

/media/pindaari/Softwares/anaconda2/lib/python2.7/site-packages/cvxpy/atoms/__init__.py in <module>()
     15 """
     16 
---> 17 from cvxpy.atoms.affine_prod import affine_prod
     18 from cvxpy.atoms.geo_mean import geo_mean
     19 from cvxpy.atoms.harmonic_mean import harmonic_mean

/media/pindaari/Softwares/anaconda2/lib/python2.7/site-packages/cvxpy/atoms/affine_prod.py in <module>()
     15 """
     16 
---> 17 from cvxpy.atoms.atom import Atom
     18 import cvxpy.utilities as u
     19 import numpy as np

/media/pindaari/Softwares/anaconda2/lib/python2.7/site-packages/cvxpy/atoms/atom.py in <module>()
     16 
     17 
---> 18 from .. import utilities as u
     19 from .. import interface as intf
     20 from ..expressions.constants import Constant, CallbackParam

/media/pindaari/Softwares/anaconda2/lib/python2.7/site-packages/cvxpy/utilities/__init__.py in <module>()
     16 
     17 from .canonical import Canonical
---> 18 from . import grad
     19 from . import shape
     20 from . import sign

ImportError: cannot import name grad

Proposal to replace nuclear norm minimization with a keras version

https://arxiv.org/pdf/1710.09026v1.pdf

See lemma 1. If we want to optimize something like:

|X - W|_F + lambda * |W|_T,

Where the F is the Frobenius norm (taken only over known values), and T is the trace/nuclear norm, we can instead optimize:

|X - UV|_F + lambda/2 * (|U|_F + |V|_F).

I find this extremely weird, but there you have it. If W is sized m by n, then U is sized m by min(n, m), and V is sized min(n, m) by n. Once you're done training, you can compute the truncated SVD of W = UV

One thing I don't get is how stage 2 actually works, but I don't think we would need it here. We could just do a single layer neural network, and optimize over U and V, and then take the truncated svd of W = UV and see if it has good reconstruction properties?

@iskandr what do you think?

Rank and MAE are both increasing for SoftImpute

Starting from Iter 13 both rank and MAE are increasing.

[SoftImpute] Max Singular Value of X_init = 1345.871031
[SoftImpute] Iter 1: observed MAE=0.002085 rank=2000
[SoftImpute] Iter 2: observed MAE=0.002495 rank=2000
[SoftImpute] Iter 3: observed MAE=0.002674 rank=1994
[SoftImpute] Iter 4: observed MAE=0.002806 rank=1977
[SoftImpute] Iter 5: observed MAE=0.002879 rank=1959
[SoftImpute] Iter 6: observed MAE=0.002915 rank=1943
[SoftImpute] Iter 7: observed MAE=0.002932 rank=1924
[SoftImpute] Iter 8: observed MAE=0.002943 rank=1909
[SoftImpute] Iter 9: observed MAE=0.002951 rank=1899
[SoftImpute] Iter 10: observed MAE=0.002960 rank=1891
[SoftImpute] Iter 11: observed MAE=0.002969 rank=1887
[SoftImpute] Iter 12: observed MAE=0.002977 rank=1884
[SoftImpute] Iter 13: observed MAE=0.002984 rank=1884
[SoftImpute] Iter 14: observed MAE=0.002991 rank=1883
[SoftImpute] Iter 15: observed MAE=0.002998 rank=1884
[SoftImpute] Iter 16: observed MAE=0.003003 rank=1887
[SoftImpute] Iter 17: observed MAE=0.003008 rank=1889
[SoftImpute] Iter 18: observed MAE=0.003012 rank=1892
[SoftImpute] Iter 19: observed MAE=0.003015 rank=1896
[SoftImpute] Iter 20: observed MAE=0.003019 rank=1897
[SoftImpute] Iter 21: observed MAE=0.003022 rank=1900
[SoftImpute] Iter 22: observed MAE=0.003024 rank=1903
[SoftImpute] Iter 23: observed MAE=0.003027 rank=1906
[SoftImpute] Iter 24: observed MAE=0.003029 rank=1908
[SoftImpute] Iter 25: observed MAE=0.003031 rank=1911
[SoftImpute] Iter 26: observed MAE=0.003033 rank=1912
[SoftImpute] Iter 27: observed MAE=0.003035 rank=1915
[SoftImpute] Iter 28: observed MAE=0.003037 rank=1916
[SoftImpute] Iter 29: observed MAE=0.003038 rank=1917
[SoftImpute] Iter 30: observed MAE=0.003040 rank=1919
[SoftImpute] Iter 31: observed MAE=0.003042 rank=1922
[SoftImpute] Iter 32: observed MAE=0.003043 rank=1923
[SoftImpute] Iter 33: observed MAE=0.003045 rank=1924
[SoftImpute] Iter 34: observed MAE=0.003046 rank=1925
[SoftImpute] Iter 35: observed MAE=0.003047 rank=1926
[SoftImpute] Iter 36: observed MAE=0.003049 rank=1927
[SoftImpute] Iter 37: observed MAE=0.003050 rank=1928
[SoftImpute] Iter 38: observed MAE=0.003051 rank=1929
[SoftImpute] Iter 39: observed MAE=0.003052 rank=1930
[SoftImpute] Iter 40: observed MAE=0.003054 rank=1932
[SoftImpute] Iter 41: observed MAE=0.003055 rank=1933
[SoftImpute] Iter 42: observed MAE=0.003056 rank=1934
[SoftImpute] Iter 43: observed MAE=0.003057 rank=1934
[SoftImpute] Iter 44: observed MAE=0.003058 rank=1935
[SoftImpute] Iter 45: observed MAE=0.003059 rank=1937
[SoftImpute] Iter 46: observed MAE=0.003060 rank=1937

Used parameters:

fancyimpute.SoftImpute(shrinkage_value=2, init_fill_method="random",min_value = 0.0, max_value=200.0, max_iters=100)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.