esafak / mca Goto Github PK

View Code? Open in Web Editor NEW

177.0 177.0 73.0 126 KB

Multiple correspondence analysis

License: BSD 3-Clause "New" or "Revised" License

Makefile 6.53% Python 93.47%

statistics

mca's People

Contributors

Stargazers

Watchers

Forkers

motherbox jakub-stejskal yue1harriet1 superf2t olafkouamo wojciechmigda laurent-brisson sandy4321 arita37 tomaugspurger zhongpeixiang flynsequeira nbansal90 jostineho jostine-ho-sp kiwi0217 28djs savourylie petergoodin juliosim jipg chenmoshushi athomsvphd crampolo chakravarthych jaykimbravekjh phillette wilfredcho drsamu cguevaraq suljin ganjingcatherine frenchyy1 coreycoole sidharthiimc mujeeb-merwat cemcdaniel ce50e ankur-chouragade krcatbagan jayanilakshika tkmdata gusari bashiiiwa manikant92 yoshihikom jmanuelin mnpathak1 haowei772 yousefilab aki983 theloudmute richakbee wzhao5 andryjovain yoshizaki-airi jiawen23 cherifzargouni mathilde22randria wakaba1226pkmn dkethos y-ishida3333 ampregnall mpikoula janmenc dmaruyama-51 mailofsnack flexibleprintedcircuits ahmeddeladly yikuide hu419946

mca's Issues

Functionality of fs_r_sup()

Hello and many thanks for this module!

I'd like to get the MCA components on new, unseen data (test set) and was going to use fs_r_sup() to do so. In order to verify that I would get something reasonable I tried running fs_r_sup() on the training set, expecting to get the same result as fs_r().

However, the result is in fact a scaled version of fs_r() - each column is multiplied by a factor and I can't figure out where it comes from or whether I should be expecting this. I reproduce this in your burgundies notebook example where X the original data matrix:

Input:
mca_ben.fs_r(N=3)

Output
array([[ 0.8617, 0.0786, -0.0213],
[-0.7130, -0.1571, -0.0192],
[-0.9221, 0.0786, -0.0051],
[-0.8617, 0.0786, 0.0213],
[ 0.9221, 0.0786, 0.0051],
[ 0.7130, -0.1571, 0.0192]])

Input:
mca_ben.fs_r_sup(X,N=3)

Output:
array([[ 0.9510, 0.3162, -0.4301],
[-0.7870, -0.6325, -0.3871],
[-1.0177, 0.3162, -0.1026],
[-0.9510, 0.3162, 0.4301],
[ 1.0177, 0.3162, 0.1026],
[ 0.7870, -0.6325, 0.3871]])

Equivalent to:

Input:
mca_ind.fs_r_sup(X,N=3)

Output:
array([[ 0.9510, 0.3162, -0.4301],
[-0.7870, -0.6325, -0.3871],
[-1.0177, 0.3162, -0.1026],
[-0.9510, 0.3162, 0.4301],
[ 1.0177, 0.3162, 0.1026],
[ 0.7870, -0.6325, 0.3871]])

Any help appreciated!

MCA from pypi is outdated

Hi, I just want to let you know that install mca from pip gets you and outdated version that has no "expl_var" method. I manually downloaded your program from github and moved it inside site-packages folder and now I have the up-to-date version working.

Thank you very much for your program, im writing a program for my PhD and this is really useful.

Cheers

Should number samples must be greater than reduced dimension value

I am trying MCA with bag of words dataset, i am using following code

ca = mca.MCA(pd.DataFrame(X_train))
new_x_train = ca.fs_r_sup(pd.DataFrame(X_train), 2000)
new_x_test = ca.fs_r_sup(pd.DataFrame(X_test), 2000)
print(new_x_train.shape)

where
X_train.shape = (1400, 6906)
X_test.shape = (700, 6906)

Output, i am getting
new_x_train.shape = (1400, 1299)

But output should be (1400, 2000)

Is it right to use MCA for dimension reduction, with high dimensional data sets like bag of words?

Include generic data samples in docs

I think it would be nice to just have a basic dataset example (eg. a simple list of lists or dict) to show mca in action from more generic data. The current data sample in the docs is confusing because it requires the user to look at the data table, imagine what read_table does to it and then go from there.

I understand you don't want to have to teach pandas to everyone, but a little more help might increase the usage of the tool!

MCA throwing Memory Error

I am trying to conduct MCA on my one-hot encoded matrix, but I am getting Memory Error when I execute the MCA command . I don't know whether it is because of the size of the data set, but the size I am using currently is only 33% of the entire data set.
The error occurs when Diagonal of the matrix is computed. I tried to do the mathematical operations separately , and I got the same error with "np.diag" as well.

mca_ben = MCA(mca_dummies)

MemoryError                               Traceback (most recent call last)
<ipython-input-145-01aa9b49bc65> in <module>()
----> 1 mca_ben = MCA(mca_dummies)

C:\Users\Alekya Kumar\AppData\Roaming\Python\Python36\site-packages\mca.py in __init__(self, DF, cols, ncols, benzecri, TOL, sparse, approximate)
     64 
     65                 eps = finfo(float).eps
---> 66                 self.D_r = (diags if sparse else diag)(1/(eps + sqrt(self.r)))
     67                 self.D_c = diag(1/(eps + sqrt(self.c)))  # can't use diags here
     68                 Z_c = Z - outer(self.r, self.c)  # standardized residuals matrix

C:\Users\Public\Anaconda3\lib\site-packages\numpy\lib\twodim_base.py in diag(v, k)
    253     if len(s) == 1:
    254         n = s[0]+abs(k)
--> 255         res = zeros((n, n), v.dtype)
    256         if k >= 0:
    257             i = k

MemoryError:

MemoryError when data set is large

Hi, I tried to implement MCA on a dataframe with shape of (181115, 977).
mca_data = mca.MCA(cat_data_df)
I got the error below:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-450-a700ce17892e> in <module>()
----> 1 mca_data = mca.MCA(cat_data_df[0:50000])

~/.local/lib/python3.4/site-packages/mca.py in __init__(self, DF, cols, ncols, benzecri, TOL)
     59         self._numitems = len(DF)
     60         self.cor = benzecri
---> 61         self.D_r = diag(1/sqrt(self.r))
     62         Z_c = Z - outer(self.r, self.c)  # standardized residuals matrix
     63         eps = finfo(float).eps  # avoid division-by-zero

~/.local/lib/python3.4/site-packages/numpy/lib/twodim_base.py in diag(v, k)
    247     if len(s) == 1:
    248         n = s[0]+abs(k)
--> 249         res = zeros((n, n), v.dtype)
    250         if k >= 0:
    251             i = k

What is the largest data size that this mca can be performed on?

Possible fix for error: “ValueError: array must not contain infs or NaNs”

Hi,
I received the error in subject and I looked for the reason in the source code.
I realized that my data, after the line below, in mca.py, has some of the values changed to "inf":
self.D_c = numpy.diag(1/numpy.sqrt(self.c))

then I inspect for the reason and I noticed that self.c had some 0.00000000 values.
So naturally, dividing 1/0 will give something wrong, although numpy didn't throw any error it didn't computed, but turned the value to "inf" and applied to self.D_c

So what I did was just alter the values that was 0 to some value next to 0( 0.00000000000001), and it did worked.

i'm sharing my code with you :

##my code
    # if self.c in the line below has any 0.0 then
    # self.D_c = numpy.diag(1/numpy.sqrt(self.c))
    # will occur division by 0, which numpy is not throwing
    # the code below intents to fix that error putting a number 
    # as close to 0 as founded by logs of numpy: decimal of 13 digits of 0
    # necessary use of panda 
  low_number = 0.00000000000001
  temp_dict = self.c.to_dict()
  temp_dtype = self.c.dtype
  for i in temp_dict:
  	if temp_dict[i] == 0:
  		temp_dict[i] = low_number
  self.c = pandas.Series(temp_dict, dtype=temp_dtype)
    ##end
  self.D_c = numpy.diag(1/numpy.sqrt(self.c))

Improvements Suggestions

Hello @esafak , I'm new to MCA so perhaps my comments won't be that precise -- or useful? :[].

There is a lot of potential for this library -- I don't know if that really interests you, but anyways...:

Could you reshape the function names and structure to emulate sklearn's conventions? Creating fit and predict methods would really relieve the user of a lot of mathematical clutter. The FactoMineR library uses a similar convention to sklearn's and has been very popular in R, I guess you could just transpile their structure.
When returning the transformed dataset, would it be possible to return the linear transformation matrix with the names of the original columns on top?
From what I understand so far, MCA does also calculate inertia for each variable, it would be nice to have a method that spits out a Pandas Series with the respective column names and inertias.
I'm not sure about this one but, when converting pandas uint8 to numpy arrays using X.values, the calculations simply cannot be done. I had to manually convert the original pandas DataFrame to float64 to get things going, which will be a huge memory hog for big datasets.
François Husson has published some tutorials on MCA and CA on his channel, it would be nice to have links to them on the front page.

I know the implementations take a lot of time and effort to be programmed so if you need any financial aid, just place a Paypal link and I'll try to help.

Verify factor scores under Benzecri correction

Make sure the calculation of factor scores with Benzecri correction is correct, because sources disagree (check the unit test to see what I mean).

ncols can't be larger than number of rows of the dataframe

I am confused with the number of ncols. I am trying to use mca library with dorothea dataset that has 100k features with 800 samples. I am only able to run mca with 800 or so ncols. Are the rows and cols somehow mixed, or am I missing something? The dataframe is 800x100000.

Fix unit tests under 2.x

The first test in test_abdi_valentin fails in 2.x as follows for some reason:

x: array([  7.00400000e-01,   1.23000000e-02,   3.00000000e-04])
y: array([ 0.72796771,  0.04      ,  0.01325114])

It's probably something that will sort itself out with a newer version of scipy/numpy.

Fails to return results and gets killed

Hello Emre, I have roughly 125 categorical features (820 after get_dummies), the library fails to return results and the process gets killed (Killed: 9), I guess after timing out. Do you know if that is due to having too many features? Do you have any suggestion on what I can do about it?

MCA having problems with pandas CategoricalIndex

It seems that, ironically, mca has some problems with handling a categorical index. I have such an index in my DataFrame which is created using pd.qcut(). When I run mca on this data I get an error:

AttributeError: 'Int64Index' object has no attribute '_is_dtype_compat'

I can work around this problem by using pd.to_numeric() on the offending column and then everything works. A minimal example that does not exactly yield the same error, but still does have a problem with the categorical index:

import numpy as np
import mca

a = np.random.normal(size=20)
aa = pd.qcut(a, 4, labels=list("1234"))
b = list(range(20))
c = list('abcdefghijklmnopqrst')
df = pd.DataFrame([aa,b,c], index=list(range(20)), columns=list('abc'))
# df.a = pd.to_numeric(df.a, downcast='integer')   # <- this fixes the problem
mymca = mca.mca(df, cols=['a','b'])

Running the above code yields the following error:

TypeError: cannot append a non-category item to a CategoricalIndex

Uncommenting the commented line fixes the problem. Hope this helps :-)

Cheers,
Omri

IndexError in fs_r

Hi folks,
I get the following error when I tried MCA. I had a data frame 200 * 3.
File "/Library/Python/2.7/site-packages/mca-1.0.1-py2.7.egg/mca.py", line 98, in fs_r
self.k = 1 + np.flatnonzero(np.cumsum(self.L) >= sum(self.L)*percent)[0]
IndexError: index 0 is out of bounds for axis 0 with size 0