Hi, I tried to implement MCA on a dataframe with shape of (181115, 977). <code cla

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

MemoryError when data set is large about mca HOT 4 CLOSED

michelleowen commented on June 11, 2024 1

MemoryError when data set is large

from mca.

Comments (4)

esafak commented on June 11, 2024

There is no set limit; it depends on your computer. That's an interesting line for it to trip up at. I would have expected to fail before it got there, if at all. Does your dataframe have categorical variables with high cardinality? Is subsampling an option? I can try it on my computer if you are allowed to share the data.

from mca.

michelleowen commented on June 11, 2024

@esafak Sorry I cannot share the data. I already convert my categorical data to binaries via onehotencoder.

from mca.

GoingMyWay commented on June 11, 2024

@michelleowen Hi, have you solved this problem?

Same error here, and I write some demo code, the same issue happened.

_temp_data = []
for i in tqdm.tqdm_notebook(range(20)):
    _d = []
    for i in range(1244210):
        _d.append(np.random.choice([1, 2]))
    _temp_data.append(_d)

_temp_df = pd.DataFrame(data=np.array(_temp_data).T, columns=range(20))

mac_result = prince.MCA(_temp_df, n_components=2)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-85-38ea16b0891a> in <module>()
----> 1 mac_result = prince.MCA(_temp_df, n_components=2)

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/mca.py in __init__(self, dataframe, n_components, use_benzecri_rates, plotter)
     43             dataframe=pd.get_dummies(dataframe),
     44             n_components=n_components,
---> 45             plotter=plotter
     46         )
     47 

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in __init__(self, dataframe, n_components, plotter)
     26         self._set_plotter(plotter_name=plotter)
     27 
---> 28         self._compute_svd()
     29 
     30     def _compute_svd(self):

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in _compute_svd(self)
     29 
     30     def _compute_svd(self):
---> 31         self.svd = SVD(X=self.standardized_residuals, k=self.n_components)
     32 
     33     def _set_plotter(self, plotter_name):

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in standardized_residuals(self)
    123         """
    124         residuals = (self.P - self.expected_frequencies).values
--> 125         return self.row_masses.dot(residuals).dot(self.column_masses)
    126 
    127     @property

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in row_masses(self)
     99             represents the weight of the matching row; the non-diagonal cells are equal to 0.
    100         """
--> 101         return np.diag(1 / np.sqrt(self.row_sums))
    102 
    103     @property

/home/libertatis/anaconda3/lib/python3.6/site-packages/numpy/lib/twodim_base.py in diag(v, k)
    247     if len(s) == 1:
    248         n = s[0]+abs(k)
--> 249         res = zeros((n, n), v.dtype)
    250         if k >= 0:
    251             i = k

MemoryError:

Since in line 249 of twodim_base.py in numpy, n=1244210 that is the length of the data, and res = zeros((n, n), v.dtype) means create a very large matrix which may exceed the memory of your machine. I really get stuck on this issue. Since when applied with PCA, there is no error. However, PCA is not suitable for categorical variables.

Since MCA and PCA are similar algorithms which can do dimension reduction. So, I think must be some better methods to rewrite MCA to tackle this issue. However, I can't fix this problem because I am not an expert in this area.

from mca.

esafak commented on June 11, 2024

The problem is the formation of the large diagonal matrices. To alleviate the problem we can either use a sparse representation, or avoid forming the matrices altogether using BLAS/LAPACK. I could not find a diagonal matrix multiplication routine to do the latter, so I went the sparse matrix route.

from mca.

MemoryError when data set is large about mca HOT 4 CLOSED

Comments (4)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent