jundongl / scikit-feature Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 449.0 194.84 MB

open-source feature selection repository in python

License: GNU General Public License v2.0

Python 100.00%

scikit-feature's People

Contributors

Stargazers

Watchers

Forkers

sandy4321 ahurriyetoglu gucasbrg codeaudit squash11 directorscut82 likaiguo nkhuyu daniharsh28 akbari59 dav009 wavelets stephanesbizzera bekterra henridwyer fangzheng354 mathn jxlijunhao pittmiqi wsongcnu irwenqiang chenglongchen weilamchung experimentmonty coolspiderghy chubbymaggie gillesj jason790 vmiliann ywang370 lovrozitnik pearlphilip github4ry goleo8 latuji guangzhan darioromero wanghaisheng ostefano jsonbao wahutch tadejmagajna codefly13 tpnguyen colinsongf andy12392 chenxofhit wanjun0511 emperorsmokey dcronkite alisaad mr-ngoc-tien-le mutual-ai mgawino halilbilgin rptrevin hiredd harryshil mei16 fdion muharremokutan juwlee jimsow libardo1 fage2016 redlin5 jonasteuwen ml-lab donovanr zhengxiu zlszhonglongshen jankim 6676401088 romanbrickie jgqysu livey josemacedo aydindemircioglu sunprog bacalfa ericschles jibybabu liujie3948 zshwuhan tandychao gfmartins kfolw hangyao a-li-peng josh-ring-jisc arita37 shreyasjoshi7 qiriro tuany xiaoyexixi renatocava newriverchan pchrapka autowonderman flyboy2

scikit-feature's Issues

Please publish on PyPi

Currently I have to go through a hassle to install this dependency for my project. When installed on PyPi, this package becomes more accessible.

Hi,
while installing the scikit-feature, I get the following error:
Could not find a version that satisfies the requirement scikit-feature (from versions: )
No matching distribution found for scikit-feature.

I tried it using python 2.7 and 3.6. but the same error occurred. Could you please help me?

SKLearn fit/transform compatability?

Is there an "out of the box" way to use this directly as a scikit learn "transformer"? i.e do the methods support fit, fit_transform, etc'?
Sorry if this exists and I missed it somewhere! (Without this, the methods can't be used directly with sklearn's pipelines or CV methods).

IndexError when using JMI and MRMR

Hi.
I have tried JMI and MRMR but I got following error:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

Also the same error raises when I run examples provided within the package like est_JMI.py.

TypeError: transpose() takes exactly 1 argument (2 given)

Hello:

I am facing the below error while using the unsupervised feature selection using Laplacian Score.

Traceback (most recent call last):
File "testfs.py", line 13, in
score = lap_score.lap_score(frame, W=W)
File "/usr/local/lib/python2.7/dist-packages/skfeature/function/similarity_based/lap_score.py", line 42, in lap_score
Xt = np.transpose(X)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 534, in transpose
return transpose(axes)
TypeError: transpose() takes exactly 1 argument (2 given)

Thanks
Sayantan Guha

The linear_assignment_ module is deprecated

warnings.warn(
"The linear_assignment_ module is deprecated in 0.21 "
"and will be removed from 0.23. Use "
"scipy.optimize.linear_sum_assignment instead.",
DeprecationWarning)

gini_index implementation

Hello, when I tried gini_index to get the importance of the features. The output is always be:
gini_index : [0.5 0.5 0.5 0.5 0.5]
Is there any problem of this function?

Dimension mismatch

Hello there ! hope you're doing fine.

I was just trying to use the SPEC and Laplacian Score modules to de-noise a BoW (489 docs, 7895 terms) and got the following errors:

SPEC:
**File "", line 2, in
spectral_fs.spec(x)

File "C:\Users\Erick Garciaoliva\Anaconda3\lib\site-packages\skfeature\function\similarity_based\SPEC.py", line 74, in spec
l = LA.norm(F_hat)

File "C:\Users\Erick Garciaoliva\Anaconda3\lib\site-packages\numpy\linalg\linalg.py", line 2450, in norm
sqnorm = dot(x, x)

File "C:\Users\Erick Garciaoliva\Anaconda3\lib\site-packages\scipy\sparse\base.py", line 481, in mul
raise ValueError('dimension mismatch')

ValueError: dimension mismatch**

LP-Score:
**File "", line 1, in
lapscore = LaplacianScore(x)

File "", line 35, in LaplacianScore
t=np.matmul(np.matmul(Xt,D.toarray()),I)/np.matmul(np.matmul(np.transpose(I),D.toarray()),I)

ValueError: matmul: Input operand 0 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)**

Maybe this has to do with the data type of the BoW ?

Module not found entropy_estimators when importing statical_based.cfs

I fixed this by changing:

 import entry_estimators as ee

import skfeature.utility.entropy_estimators as ee

in mutual_information.py

'function' object has no attribute 'entropyd'

AttributeError Traceback (most recent call last)
in ()
----> 1 mod(0,1,2,0,1)

in mod(cross, smote, norstd, model, graph)
96 if cross==0:
97 X2=stdselector(X,norstd)
---> 98 X3=fcb(X2,Y)
99 X_train, X_test, Y_train, Y_test=nocross(X3,Y)
100 print(X2.shape)

in fcb(X_train, Y_train)
88 print("10foldcrossvalidation mean SPECIFICITY",np.mean(spe))
89 def fcb(X_train,Y_train):
---> 90 idx =CFS.cfs(X_train,Y_train)
91 features = X[:, idx[0:num_fea]]
92 print(features)

c:\python\lib\site-packages\skfeature\function\statistical_based\CFS.py in cfs(X, y)
70 F.append(i)
71 # calculate the merit of current selected features
---> 72 t = merit_calculation(X[:, F], y)
73 if t > merit:
74 merit = t

c:\python\lib\site-packages\skfeature\function\statistical_based\CFS.py in merit_calculation(X, y)
28 for i in range(n_features):
29 fi = X[:, i]
---> 30 rcf += su_calculation(fi, y)
31 for j in range(n_features):
32 if j > i:

c:\python\lib\site-packages\skfeature\utility\mutual_information.py in su_calculation(f1, f2)
57 # calculate information gain of f1 and f2, t1 = ig(f1,f2)
58 t1 = information_gain(f1, f2)
---> 59 # calculate entropy of f1, t2 = H(f1)
60 t2 = ee.entropyd(f1)
61 # calculate entropy of f2, t3 = H(f2)

c:\python\lib\site-packages\skfeature\utility\mutual_information.py in information_gain(f1, f2)
17
18 ig = ee.entropyd(f1) - conditional_entropy(f1, f2)
---> 19 return ig
20
21

AttributeError: 'function' object has no attribute 'entropyd'

entropy value is negative

I found for some continuous variables, the entropy_estimators library return the negative number. Here is the reply I got from the author of this library,

For continuous variables, this package is calculating the differential entropy. Unfortunately, the differential entropy can be negative, making interpretation more difficult than in the discrete case. See chapter 8 of Cover and Thomas, for example, for a discussion of how to interpret negative differential entropies. (Consider, for instance, the differential entropy for a Gaussian which is proportional to log variance. If the variance is small, you get a negative number.)

My question is for the information theoretical based methods which use this library for entropy calculation, if the entropy result is negative, will the feature selection result still be valid?
Thanks

Return JMI Values

Is there a way I can get the JMI values being calculated for feature selection

No module named 'entropy_estimators'

I was trying to use CFS.py with python3. It gave me the following error:

File "/usr/local/lib/python3.5/dist-packages/skfeature/function/statistical_based/CFS.py", line 2, in <module>
    from skfeature.utility.mutual_information import su_calculation
  File "/usr/local/lib/python3.5/dist-packages/skfeature/utility/mutual_information.py", line 1, in <module>
    import entropy_estimators as ee
ImportError: No module named 'entropy_estimators'

Please tell me what should I do to solve this problem. Am I missing something or doing something wrong?
Thanks.

missing LCSI . What is this? and how I could resolve this error?

D:\prj>python findFeatures.py
Traceback (most recent call last):
File "findFeatures.py", line 7, in
from skfeature.function.information_theoretical_based import MIM # infogain
File "D:\ProgramData\Anaconda3\lib\site-packages\skfeature\function\information_theoretical_based\MIM.py", line 1, in
import LCSI
ModuleNotFoundError: No module named 'LCSI'

Please add compatibility with sparse matrices

For the library to work i have to convert the sparse matrix to dense and it takes a lot of memory and because of that sometimes i am unable to perform the task required due to memory error. Specifically i am talking about the statistical methods like CHI2 /giniIndex etc

Bug in `lap_score.py`

Hi there!

It seems to be an error in lap_score.py. Please, take a look at the following excerpt:

    # if 'W' is not specified, use the default W
    if 'W' not in kwargs.keys():
        W = construct_W(X)
    # construct the affinity matrix W
    W = kwargs['W']

If the user does not pre-compute W, then the last line results in a KeyError. I think it's easy to fix, since there seems to be a missing else:

    if 'W' not in kwargs.keys():
        # if 'W' is not specified, use the default W
        W = construct_W(X)
    else:
        # construct the affinity matrix W
        W = kwargs['W']

Thanks. Regards.

In MCFS, mod of matrix should be taken before calculating maximum value

According to reference paper, 'Unsupervised Feature Selection for Multi-Cluster Data' by Cai,Deng, in equation 4, max of mod value is taken. I think that needs to be corrected here in MCFS.py in line 69 from W.max(1) to np.absolute(W).max(1)

FCBF.py Error fixed

Hi,
There is a typo on line 39 of the FCBF.py file. I fixed it and was able to get it working on my system, by
changing dytpes to datatype and removing the quotes around object.

Current:t1 = np.zeros((n_features, 2), dtypes='object')

Correct = t1 = np.zeros((n_features, 2), dtype=object)

Best,
Sparkle

TypeError: init() got an unexpected keyword argument 'n_folds'

When trying score = svm_forward.svm_forward(X_train_values, y_train_reg_values, 50), I got the following error:

TypeError Traceback (most recent call last)
in
8
9 # score = svm_backward.svm_backward(X_train_values, y_train_reg_values,50)
---> 10 score = svm_forward.svm_forward(X_train_values, y_train_reg_values, 50)

~/anaconda3/envs/ish3test/lib/python3.6/site-packages/skfeature/function/wrapper/svm_forward.py in svm_forward(X, y, n_selected_features)
26 n_samples, n_features = X.shape
27 # using 10 fold cross validation
---> 28 cv = KFold(n_samples, n_folds=10, shuffle=True)
29 # choose SVM as the classifier
30 clf = SVC()

TypeError: init() got an unexpected keyword argument 'n_folds'

I'm wondering whether it's because I'm using Python3?

MemoryError

Hello:

I am using the unsupervised feature selection using Laplacian Score. But I am facing the below error message

Traceback (most recent call last):
File "fs.py", line 12, in
W = construct_W.construct_W(frame, *_kwargs_W)
File "/usr/local/lib/python2.7/dist-packages/skfeature/utility/construct_W.py", line 141, in construct_W
D = pairwise_distances(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 1207, in pairwise_distances
return _parallel_pairwise(X, Y, func, n_jobs, *_kwds)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 1054, in _parallel_pairwise
return func(X, Y, **kwds)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 231, in euclidean_distances
distances = safe_sparse_dot(X, Y.T, dense_output=True)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
return fast_dot(a, b)
MemoryError

I have updated the scikit-learn; but the issue persists. Any inputs will be helpful

Regards
Sayantan Guha

Issue with UDFS; MCFS

Hi,
when I try to run the code for UDFS and MCFS, I get the following issue:

Kindly help me on this.

Quickly, show some code...anywhere...

It would be nice to know how to use the tool (also to show that it has a scikit interface).

IndexError in FCBF.py

scikit-feature/skfeature/function/information_theoretical_based/FCBF.py

Line 53 in 48cffad

fp = X[:, s_list[idx, 0]]

In the line # 53 in FCBF algorithm we are getting IndexError, it can be fixed by writing: fp= X[:, int(s_list[idx, 0])] instead of,
fp= X[:, s_list[idx, 0]]

is K-L refers to Kozachenko-Leonenko k-nearest neighbour estimator used to estimate the entropy

K -L used in
https://github.com/jundongl/scikit-feature/blob/master/skfeature/utility/entropy_estimators.py
def entropy(x, k=3, base=2):
"""
The classic K-L k-nearest neighbor continuous entropy estimator x should be a list of vectors,
e.g. x = [[1.3],[3.7],[5.1],[2.4]] if x is a one-dimensional scalar and we have four samples
"""

is it
Kozachenko-Leonenko k-nearest neighbour estimator used to estimate the entropy

https://stackoverflow.com/questions/43265770/entropy-python-implementation

midd example pls

do have examples how to use your code
especially for
def midd(x, y):
"""
Discrete mutual information estimator given a list of samples which can be any hashable object
"""

return -entropyd(list(zip(x, y)))+entropyd(x)+entropyd(y)

from
https://github.com/jundongl/scikit-feature/blob/master/skfeature/utility/entropy_estimators.py

Sample datasets

Can we include sample datasets to have a proper testing suite?

Speed up recomendation for `cfs` function

In the code for cfs(X, y), you are calling repeatedly the function merit_calculation(X, y), which it self calls repeatedly the function su_calculation, sometimes with exactly the same feature(s) as in previous rounds.

To avoid repeatedly computing su_calculation(fi, y) with the same feature fi, it would be ideal to save the computation results into a list or a dictionary when they are called the first time, and to load those values instead of recomputing them when they are called afterwards. That would ensure the linear complexity of the algorithm and improve its speed.

This could be the code to achieve that:

def merit_calculation(X, y, F, memo):
    rff = 0
    rcf = 0
    for i in F:
        if i not in memo:
            fi = X[:, i]
            memo[i] = su_calculation(fi, y)
        rcf += memo[i]
        for j in F:
            if j > i:
                if (i,j) not in memo:
                    fj = X[:, j]
                    memo[(i,j)] = su_calculation(fi, fj)
                rff += memo[(i,j)]
    rff *= 2
    merits = rcf / np.sqrt(len(F) + rff)
    return merits

And the usage, supplying the indices and the memory dictionary on each call.

...
memo = {}
...
t = merit_calculation(X, y, F, memo)
F = ...
t = merit_calculation(X, y, F, memo)
F = ...
...

Maybe a mistake in CFS

Maybe I'm wrong but I think that

rff *= 2

should be:

rff *= (n_features **2 - n_features)

in the merit calculation function

Thanks for sharing your code!

Examples are not compatible with scikit-learn 0.20.4

I have been trying to execute the examples in source code (in particular http://featureselection.asu.edu) and I am struggling with which scikit-learn version to use.

By default (if no particular version is specified) pip download scikit-learn version 0.20.4. This version yields the following error:

ImportError: cannot import name cross_validation

I have tried manually installing older versions, but got different errors.

Versions 0.10 and 0.12 yield

ImportError: cannot import name accuracy_score

Version 0.15 yields

ImportError: No module named skfeature.function.similarity_based

Could you provide which version of scikit-learn, numpy and scipy should be installed to execute the examples?

Anyway, I am able to import the algorithms from skfeature.function.*, the issue is on running the examples.

Thank you.

System configuration:
Python 2.7.17 (Anaconda)
Numpy 1.16.4
SciPy 1.2.3

Parallel evaluation of MI based measures

I added a PR for a prototype solution (which you closed without comment), if you would rather use another library eg Job lib it should be relatively straightforward and I can take a look at that?

How can I use the entropy score for unsupervised feature selection with python?

I would rank my features with the entropy score and select m relevant features. My dataset is unlabeled. How can I use this with python?

CFS Return values [0 1 2 3 4 5]

here, CFS always Return values [0 1 2 3 4 5], doesn't matter whatever dataset or size of dataset is.

number of selected features

# perform evaluation on classification task
num_fea = 100 # number of selected features
clf = svm.LinearSVC() # linear SVM
here is the code from your script.
i want to know how you are setting num_fea = 100 # number of selected features
is their any criteria? because in some scripts you set it 10.
if i have 193 features how much i will give to num_fea?

please help me to understand this

Thanks

setup.py is missing from the repo

module 'skfeature.function.similarity_based.reliefF' has no attribute 'feature_ranking'

This error popped up... is the feature ranking embedded as the output score now?

Get " sparse matrix length is ambiguous; use getnnz() or shape[0]" error!

corpus, categories = get_detail_content_category()

vectorizer = TfidfVectorizer(max_df=1.0, max_features=6000, min_df=1,
                             stop_words=get_cn_stopwords(),
                             encoding='utf-8', decode_error='ignore',
                             analyzer='word', tokenizer=cn_tokenize)

X = vectorizer.fit_transform(corpus)
y = categories

idx = ICAP.icap(X, y, n_selected_features=1000)
selected_X = X[:, idx[0:1000]]

After I run this code, i get error like title. I don't why, any help is appreciable.

Bug in `construct_W.py`

Hi there,

I'm trying to construct the W weight matrix to work with lap_score on the following simple dataset: employes-region.txt. I've tried the following code, which is provided as an example in file test_lap_score.py:

    kwargs_W = {"metric": "euclidean", "neighbor_mode": "knn", "weight_mode": "heat_kernel", "k": 5, 't': 1}
    W = construct_W.construct_W(X, **kwargs_W)

Unfortunately, it fails with the following exception at line 152 of file construct_W.py:

could not broadcast input array from shape (25) into shape (30)

I've gone through the code, and I think that the problem's that the dimensions of G are wrong. This is the piece of code involved in the exception:

            t = kwargs['t']
            # compute pairwise euclidean distances
            D = pairwise_distances(X)
            D **= 2
            # sort the distance matrix D in ascending order
            dump = np.sort(D, axis=1)
            idx = np.argsort(D, axis=1)  #  *** 1
            idx_new = idx[:, 0:k+1]  #  *** 2
            dump_new = dump[:, 0:k+1] #  *** 2
            # compute the pairwise heat kernel distances
            dump_heat_kernel = np.exp(-dump_new/(2*t*t))
            G = np.zeros((n_samples*(k+1), 3)) #  *** 2
            G[:, 0] = np.tile(np.arange(n_samples), (k+1, 1)).reshape(-1) #  *** 2
            G[:, 1] = np.ravel(idx_new, order='F') # *** EXCEPTION HERE!!
            G[:, 2] = np.ravel(dump_heat_kernel, order='F')
            # build the sparse affinity matrix W
            W = csc_matrix((G[:, 2], (G[:, 0], G[:, 1])), shape=(n_samples, n_samples))
            bigger = np.transpose(W) > W
            W = W - W.multiply(bigger) + np.transpose(W).multiply(bigger)

I think that there's a problem at line *** 1. Should it compute idxusing dump? I mean:

            idx = np.argsort(dump, axis=1)  #  *** 1

And the other problem is at the lines *** 2. Shouldn't they use k as a multiplier instead of k+1? That is:

            idx_new = idx[:, 0:k]  #  *** 2
            dump_new = dump[:, 0:k] #  *** 2
            # compute the pairwise heat kernel distances
            dump_heat_kernel = np.exp(-dump_new/(2*t*t))
            G = np.zeros((n_samples*(k), 3)) #  *** 2
            G[:, 0] = np.tile(np.arange(n_samples), (k, 1)).reshape(-1) #  *** 2

I've fixed my local installation using this path and I've run the system on a large collection with 200+ datasets. It works correctly now.

I've seen that there are many other lines in which a similar patch might apply, bu I haven't tried other configuration options.

Thanks! Regards

Limited Features

We have a dataset of 72 X 3571, which mean that our features are less than samples. We test out dataset on you spec() feature selection technique the output is array of zeros.
Kindly check this issue.

AttributeError: module 'skfeature.function.similarity_based.reliefF' has no attribute 'feature_ranking'

I already install skfeature-chappers (1.0.2) ,but i got an AttributeError.

AttributeError: module 'skfeature.function.similarity_based.reliefF' has no attribute 'feature_ranking'

Example code for calculating fisher score

Hey! Could you please provide an example code for calculating fisher score present in the path

skfeature.function.similarity_based.fisher_score

Could you please help me with what class labels have to be provided as the function should extract the fisher features and provide the labels.

test_svm_backward.py always chooses trivial features [0,1,...,1022,1023]

Hi,

In the test_svm_backward.py test file, the features chosen with the svm_backward selection method is always the same ([0,1,...,1023]).

Please add Python 3 support

A large portion of the Python user base doesn't use Python 2 any more, and therefore can't use a package that doesn't support Python 3. Please add Python 3 support; it shouldn't take too much extra work.

Create a pip package

Dear authors

Congratulations. The package is very good.
My research group is using the scikit-feature inside other projects and we would like to know if is possible to generate a pypi package.

Error occours when run an example on Python3.

Hi, thanks for your great work.
And I run into an error when I try to run the test_MRMR.py example.
It seems like some code are on the Python2.x, but when try to run on the Python3.x, it crashed.

Could you please update these part of code, make it fit in Python3?
Thx a lot.

relieF error

Hi, when I run relieF, I got the following error:

File "C:\Users\Massimo\Anaconda3\lib\site-packages\skfeature\function\similarity_based\reliefF.py", line 101, in reliefF
score += near_miss_term[label]/(k*p_dict[label])

TypeError: ufunc 'add' output (typecode 'O') could not be coerced to provided output parameter (typecode 'd') according to the casting rule ''same_kind''

Can you help me please?

Massimo

Error in Laplacian Score function

The following is the error message:
C:\ProgramData\Anaconda3\envs\py27\lib\site-packages\skfeature\function\similarity_based\lap_score.pyc in lap_score(X, **kwargs)
34 W = construct_W(X)
35 # construct the affinity matrix W
---> 36 W = kwargs['W']
37 # build the diagonal D matrix from affinity matrix W
38 D = np.array(W.sum(axis=1))

To fix it, need to change line 34 as the following:
In line 34, kwargs['W ']= construct_W(X)

UDFS Errors

UDFS.py L97: The additive parameter \lambda should be independent from gamma (introduced in eq 8 in the paper). It should probably default to something small. It's just used to make the covariance invertible.

Also, construction of S_i seems incorrect to me: UDFS.py L100: indexing on idx_new should be idx_new[:,q]?

construct_W modifies X with cosine metric

Dear,

As the title indicates, when the function construct_W is called using the 'cosine' option, the following operation is done:

for i in range(n_samples):
X[i, :] = X[i, :]/max(1e-12, X_normalized[i])

But this actually changes the values contained in the input variable X outside the function. This may be problematic when we want to use X elsewhere and nothing indicated that X was modified.

I realized it by running the same code before and after calling construct_W and getting different results

best,

The typo in description

"Open" rather than "Oepn"