esafak / mca Goto Github PK
View Code? Open in Web Editor NEWMultiple correspondence analysis
License: BSD 3-Clause "New" or "Revised" License
Multiple correspondence analysis
License: BSD 3-Clause "New" or "Revised" License
Hello and many thanks for this module!
I'd like to get the MCA components on new, unseen data (test set) and was going to use fs_r_sup() to do so. In order to verify that I would get something reasonable I tried running fs_r_sup() on the training set, expecting to get the same result as fs_r().
However, the result is in fact a scaled version of fs_r() - each column is multiplied by a factor and I can't figure out where it comes from or whether I should be expecting this. I reproduce this in your burgundies notebook example where X the original data matrix:
Input:
mca_ben.fs_r(N=3)
Output
array([[ 0.8617, 0.0786, -0.0213],
[-0.7130, -0.1571, -0.0192],
[-0.9221, 0.0786, -0.0051],
[-0.8617, 0.0786, 0.0213],
[ 0.9221, 0.0786, 0.0051],
[ 0.7130, -0.1571, 0.0192]])
Input:
mca_ben.fs_r_sup(X,N=3)
Output:
array([[ 0.9510, 0.3162, -0.4301],
[-0.7870, -0.6325, -0.3871],
[-1.0177, 0.3162, -0.1026],
[-0.9510, 0.3162, 0.4301],
[ 1.0177, 0.3162, 0.1026],
[ 0.7870, -0.6325, 0.3871]])
Equivalent to:
Input:
mca_ind.fs_r_sup(X,N=3)
Output:
array([[ 0.9510, 0.3162, -0.4301],
[-0.7870, -0.6325, -0.3871],
[-1.0177, 0.3162, -0.1026],
[-0.9510, 0.3162, 0.4301],
[ 1.0177, 0.3162, 0.1026],
[ 0.7870, -0.6325, 0.3871]])
Any help appreciated!
Hi, I just want to let you know that install mca from pip gets you and outdated version that has no "expl_var" method. I manually downloaded your program from github and moved it inside site-packages folder and now I have the up-to-date version working.
Thank you very much for your program, im writing a program for my PhD and this is really useful.
Cheers
I am trying MCA with bag of words dataset, i am using following code
ca = mca.MCA(pd.DataFrame(X_train))
new_x_train = ca.fs_r_sup(pd.DataFrame(X_train), 2000)
new_x_test = ca.fs_r_sup(pd.DataFrame(X_test), 2000)
print(new_x_train.shape)
where
X_train.shape = (1400, 6906)
X_test.shape = (700, 6906)
Output, i am getting
new_x_train.shape = (1400, 1299)
But output should be (1400, 2000)
Is it right to use MCA for dimension reduction, with high dimensional data sets like bag of words?
I think it would be nice to just have a basic dataset example (eg. a simple list of lists or dict) to show mca in action from more generic data. The current data sample in the docs is confusing because it requires the user to look at the data table, imagine what read_table does to it and then go from there.
I understand you don't want to have to teach pandas to everyone, but a little more help might increase the usage of the tool!
I am trying to conduct MCA on my one-hot encoded matrix, but I am getting Memory Error when I execute the MCA command . I don't know whether it is because of the size of the data set, but the size I am using currently is only 33% of the entire data set.
The error occurs when Diagonal of the matrix is computed. I tried to do the mathematical operations separately , and I got the same error with "np.diag" as well.
mca_ben = MCA(mca_dummies)
MemoryError Traceback (most recent call last)
<ipython-input-145-01aa9b49bc65> in <module>()
----> 1 mca_ben = MCA(mca_dummies)
C:\Users\Alekya Kumar\AppData\Roaming\Python\Python36\site-packages\mca.py in __init__(self, DF, cols, ncols, benzecri, TOL, sparse, approximate)
64
65 eps = finfo(float).eps
---> 66 self.D_r = (diags if sparse else diag)(1/(eps + sqrt(self.r)))
67 self.D_c = diag(1/(eps + sqrt(self.c))) # can't use diags here
68 Z_c = Z - outer(self.r, self.c) # standardized residuals matrix
C:\Users\Public\Anaconda3\lib\site-packages\numpy\lib\twodim_base.py in diag(v, k)
253 if len(s) == 1:
254 n = s[0]+abs(k)
--> 255 res = zeros((n, n), v.dtype)
256 if k >= 0:
257 i = k
MemoryError:
Hi, I tried to implement MCA on a dataframe with shape of (181115, 977).
mca_data = mca.MCA(cat_data_df)
I got the error below:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-450-a700ce17892e> in <module>()
----> 1 mca_data = mca.MCA(cat_data_df[0:50000])
~/.local/lib/python3.4/site-packages/mca.py in __init__(self, DF, cols, ncols, benzecri, TOL)
59 self._numitems = len(DF)
60 self.cor = benzecri
---> 61 self.D_r = diag(1/sqrt(self.r))
62 Z_c = Z - outer(self.r, self.c) # standardized residuals matrix
63 eps = finfo(float).eps # avoid division-by-zero
~/.local/lib/python3.4/site-packages/numpy/lib/twodim_base.py in diag(v, k)
247 if len(s) == 1:
248 n = s[0]+abs(k)
--> 249 res = zeros((n, n), v.dtype)
250 if k >= 0:
251 i = k
What is the largest data size that this mca can be performed on?
Hi,
I received the error in subject and I looked for the reason in the source code.
I realized that my data, after the line below, in mca.py, has some of the values changed to "inf":
self.D_c = numpy.diag(1/numpy.sqrt(self.c))
then I inspect for the reason and I noticed that self.c had some 0.00000000 values.
So naturally, dividing 1/0 will give something wrong, although numpy didn't throw any error it didn't computed, but turned the value to "inf" and applied to self.D_c
So what I did was just alter the values that was 0 to some value next to 0( 0.00000000000001), and it did worked.
i'm sharing my code with you :
##my code # if self.c in the line below has any 0.0 then # self.D_c = numpy.diag(1/numpy.sqrt(self.c)) # will occur division by 0, which numpy is not throwing # the code below intents to fix that error putting a number # as close to 0 as founded by logs of numpy: decimal of 13 digits of 0 # necessary use of panda low_number = 0.00000000000001 temp_dict = self.c.to_dict() temp_dtype = self.c.dtype for i in temp_dict: if temp_dict[i] == 0: temp_dict[i] = low_number self.c = pandas.Series(temp_dict, dtype=temp_dtype) ##end self.D_c = numpy.diag(1/numpy.sqrt(self.c))
Hello @esafak , I'm new to MCA so perhaps my comments won't be that precise -- or useful? :[].
There is a lot of potential for this library -- I don't know if that really interests you, but anyways...:
sklearn
's conventions? Creating fit
and predict
methods would really relieve the user of a lot of mathematical clutter. The FactoMineR
library uses a similar convention to sklearn
's and has been very popular in R
, I guess you could just transpile their structure.uint8
to numpy arrays using X.values
, the calculations simply cannot be done. I had to manually convert the original pandas DataFrame to float64
to get things going, which will be a huge memory hog for big datasets.I know the implementations take a lot of time and effort to be programmed so if you need any financial aid, just place a Paypal link and I'll try to help.
Make sure the calculation of factor scores with Benzecri correction is correct, because sources disagree (check the unit test to see what I mean).
I am confused with the number of ncols. I am trying to use mca library with dorothea dataset that has 100k features with 800 samples. I am only able to run mca with 800 or so ncols. Are the rows and cols somehow mixed, or am I missing something? The dataframe is 800x100000.
The first test in test_abdi_valentin
fails in 2.x as follows for some reason:
x: array([ 7.00400000e-01, 1.23000000e-02, 3.00000000e-04])
y: array([ 0.72796771, 0.04 , 0.01325114])
It's probably something that will sort itself out with a newer version of scipy/numpy.
Hello Emre, I have roughly 125 categorical features (820 after get_dummies), the library fails to return results and the process gets killed (Killed: 9), I guess after timing out. Do you know if that is due to having too many features? Do you have any suggestion on what I can do about it?
It seems that, ironically, mca
has some problems with handling a categorical index. I have such an index in my DataFrame
which is created using pd.qcut()
. When I run mca
on this data I get an error:
AttributeError: 'Int64Index' object has no attribute '_is_dtype_compat'
I can work around this problem by using pd.to_numeric()
on the offending column and then everything works. A minimal example that does not exactly yield the same error, but still does have a problem with the categorical index:
import numpy as np
import mca
a = np.random.normal(size=20)
aa = pd.qcut(a, 4, labels=list("1234"))
b = list(range(20))
c = list('abcdefghijklmnopqrst')
df = pd.DataFrame([aa,b,c], index=list(range(20)), columns=list('abc'))
# df.a = pd.to_numeric(df.a, downcast='integer') # <- this fixes the problem
mymca = mca.mca(df, cols=['a','b'])
Running the above code yields the following error:
TypeError: cannot append a non-category item to a CategoricalIndex
Uncommenting the commented line fixes the problem. Hope this helps :-)
Cheers,
Omri
Hi folks,
I get the following error when I tried MCA. I had a data frame 200 * 3.
File "/Library/Python/2.7/site-packages/mca-1.0.1-py2.7.egg/mca.py", line 98, in fs_r
self.k = 1 + np.flatnonzero(np.cumsum(self.L) >= sum(self.L)*percent)[0]
IndexError: index 0 is out of bounds for axis 0 with size 0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.