rajatsen91 / ccit Goto Github PK

View Code? Open in Web Editor NEW

43.0 43.0 9.0 106 KB

Classifier Conditional Independence Test: A CI test that uses a binary classifier (XGBoost) for CI testing

Python 100.00%

ccit's People

Contributors

Stargazers

Watchers

Forkers

kkpatel1 ajiljalal jiansun411 ec769 gaabrielfranco franktiantt yanlirock harel-coffee goodluckigweze

ccit's Issues

Bootstrap data is not being used

In the function XGBOUT2, you have the following code:

num_samp = len(all_samples)
if bootstrap:
    np.random.seed()
    random.seed()
    I = np.random.choice(num_samp, size=num_samp, replace=True)
    samples = all_samples[I, :]
else:
    samples = all_samples
Xtrain, Ytrain, Xtest, Ytest, CI_data = CI_sampler_conditional_kNN(
    all_samples[:, Xcoords],
    all_samples[:, Ycoords],
    all_samples[:, Zcoords],
    train_samp,
    k,
)

You create the variable samples when bootstrap is True, but when you call the CI_sampler_conditional_kNN function, you use the variable all_samples. In my understanding, you should use the variable samples in this case. Am I right?

BTW, this is an excellent paper!

Add versions to package requirements in setup.py

I tried running CCIT on a clean virtualenv. While the installation works fine, there are numerous sklearn and XGBoost based errors which are mostly due to depreciation. Even the example in the README doesn't work.

Would it be possible to specify the exact versions of the requirements for CCIT? I believe that would solve most of the problems and make CCIT future-proof.

For example, in setup.py we have:

install_requires=[
          'markdown',
          'xgboost',
          'pandas',
          'numpy',
          'scikit-learn',
          'scipy',
          'matplotlib'
      ],

So the correct/working package versions of these 7 packages is needed. In fact, I think only xgboost and sklearn's correct package number should solve it.

Some question about the distribution of acc1-acc2

Hey there, CCIT contributors,
From line 389 in CCIT.py file, I think you believe that acc1-acc2 obeys the normal distribution N(0, 2\sigma(acc2)^2) where \sigma(acc2) is the standard variance of acc2. I think this is right, too. But based on this thought, there are two inconsistent points in the other part of the codes:

In line 373, only "s2 = np.std(cleaned, axis = 0, doff = 1)[4]" is the sample variance, the unbiased estimator of \sigma(acc2) (the standard variance of acc2). "np.std(cleaned, axis = 0)[4]" is the population standard variance which is not the unbiased estimator of \sigma(acc2).
In line 391, when bootstrap == False, why the standard variance is np.sqrt(2) * 1/np.sqrt(ntot) (np.sqrt(2) is multiplied in function "pvalue", line 325)? I think it should be np.sqrt(2) * np.sqrt(acc2 * (1-acc2)/ntot) since acc2 obeys the distribution N(acc2, acc2*(1-acc2)/ntot) (acc2 follows the normal distribution since it is generated from a Binomial Distribution where y_pret == y_test)

BTY, I appreciate your paper Model-powered Conditional Independence Test. It is great!

DeprecationWarning: The truth value of an empty array is ambiguous.

Hey there, CCIT contributors,

I tryed the following commands.

data = DataGen.generate_samples_cos(dx=1,dy=1,dz=20,sType='NI')  #non-CI dataset, pvalue should be low

X = data[:,0:1]
Y = data[:,1:2]
Z = data[:,2::]

pvalue = CCIT.CCIT(X,Y,Z)

And received the following warning.

C:\Users\XXX\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use array.size > 0 to check that an array is not empty.
if diff:

The output pvalue is 0.014717305527501225, which looks fine. So is everything alright with my installation? Could you kindly let me know if there is anything I can do to with the warning?

Thank you,
gogotrace

Possible error in the implementation compared to paper?

Hi,

Cool paper and thanks for uploading the code. Really interesting concepts put forward.

I perused the code and noticed that there is possibly a mismatch between what the implementation and the proposed algorithm do?

Specifically,

CCIT/CCIT/CCIT.py

Lines 97 to 103 in 0b9dce9

    
           nbrs = NearestNeighbors(n_neighbors=k + 1, algorithm="ball_tree", metric="l2").fit( 
        
               Z 
        
           ) 
        
           distances, indices = nbrs.kneighbors(Z) 
        
           for i in range(len(train_2)): 
        
               index = indices[i, k] 
        
               Yprime[i, :] = Y[index, :]

builds a NearestNeighbor tree search using all of the "training data" or what is called (U = U_1 \cup U_2) in Algorithm 1 of the paper. Then it queries the nearest neighbor inside the entire U dataset again.

Instead Algorithm 1 proposes that one should be construct the NearestNeighbor tree on a distinct set, U_2 dataset, and then query elements in U_1, swapping y with y' for close by z' and z. @rajatsen91 is this an issue?

Discrete Support

Better support for discrete distributions.

Import Error

Import of CCIT in Jupyter (after installing with pip install ccit) throws an error:

ModuleNotFoundError: No module named 'DataGen'

rajatsen91 / ccit Goto Github PK

ccit's People

Contributors

Stargazers

Watchers

Forkers

ccit's Issues

Bootstrap data is not being used

Add versions to package requirements in setup.py

Some question about the distribution of acc1-acc2

DeprecationWarning: The truth value of an empty array is ambiguous.

Possible error in the implementation compared to paper?

Discrete Support

Import Error

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	nbrs = NearestNeighbors(n_neighbors=k + 1, algorithm="ball_tree", metric="l2").fit(
	Z
	)
	distances, indices = nbrs.kneighbors(Z)
	for i in range(len(train_2)):
	index = indices[i, k]
	Yprime[i, :] = Y[index, :]