jundongl / scikit-feature Goto Github PK
View Code? Open in Web Editor NEWopen-source feature selection repository in python
License: GNU General Public License v2.0
open-source feature selection repository in python
License: GNU General Public License v2.0
Currently I have to go through a hassle to install this dependency for my project. When installed on PyPi, this package becomes more accessible.
Hi,
while installing the scikit-feature, I get the following error:
Could not find a version that satisfies the requirement scikit-feature (from versions: )
No matching distribution found for scikit-feature.
I tried it using python 2.7 and 3.6. but the same error occurred. Could you please help me?
Is there an "out of the box" way to use this directly as a scikit learn "transformer"? i.e do the methods support fit, fit_transform, etc'?
Sorry if this exists and I missed it somewhere! (Without this, the methods can't be used directly with sklearn's pipelines or CV methods).
Hi.
I have tried JMI and MRMR but I got following error:
IndexError: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
Also the same error raises when I run examples provided within the package like est_JMI.py.
Hello:
I am facing the below error while using the unsupervised feature selection using Laplacian Score.
Traceback (most recent call last):
File "testfs.py", line 13, in
score = lap_score.lap_score(frame, W=W)
File "/usr/local/lib/python2.7/dist-packages/skfeature/function/similarity_based/lap_score.py", line 42, in lap_score
Xt = np.transpose(X)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 534, in transpose
return transpose(axes)
TypeError: transpose() takes exactly 1 argument (2 given)
Thanks
Sayantan Guha
warnings.warn(
"The linear_assignment_ module is deprecated in 0.21 "
"and will be removed from 0.23. Use "
"scipy.optimize.linear_sum_assignment instead.",
DeprecationWarning)
Hello, when I tried gini_index to get the importance of the features. The output is always be:
gini_index : [0.5 0.5 0.5 0.5 0.5]
Is there any problem of this function?
Hello there ! hope you're doing fine.
I was just trying to use the SPEC and Laplacian Score modules to de-noise a BoW (489 docs, 7895 terms) and got the following errors:
SPEC:
**File "", line 2, in
spectral_fs.spec(x)
File "C:\Users\Erick Garciaoliva\Anaconda3\lib\site-packages\skfeature\function\similarity_based\SPEC.py", line 74, in spec
l = LA.norm(F_hat)
File "C:\Users\Erick Garciaoliva\Anaconda3\lib\site-packages\numpy\linalg\linalg.py", line 2450, in norm
sqnorm = dot(x, x)
File "C:\Users\Erick Garciaoliva\Anaconda3\lib\site-packages\scipy\sparse\base.py", line 481, in mul
raise ValueError('dimension mismatch')
ValueError: dimension mismatch**
LP-Score:
**File "", line 1, in
lapscore = LaplacianScore(x)
File "", line 35, in LaplacianScore
t=np.matmul(np.matmul(Xt,D.toarray()),I)/np.matmul(np.matmul(np.transpose(I),D.toarray()),I)
ValueError: matmul: Input operand 0 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)**
Maybe this has to do with the data type of the BoW ?
I fixed this by changing:
import entry_estimators as ee
to
import skfeature.utility.entropy_estimators as ee
in mutual_information.py
AttributeError Traceback (most recent call last)
in ()
----> 1 mod(0,1,2,0,1)
in mod(cross, smote, norstd, model, graph)
96 if cross==0:
97 X2=stdselector(X,norstd)
---> 98 X3=fcb(X2,Y)
99 X_train, X_test, Y_train, Y_test=nocross(X3,Y)
100 print(X2.shape)
in fcb(X_train, Y_train)
88 print("10foldcrossvalidation mean SPECIFICITY",np.mean(spe))
89 def fcb(X_train,Y_train):
---> 90 idx =CFS.cfs(X_train,Y_train)
91 features = X[:, idx[0:num_fea]]
92 print(features)
c:\python\lib\site-packages\skfeature\function\statistical_based\CFS.py in cfs(X, y)
70 F.append(i)
71 # calculate the merit of current selected features
---> 72 t = merit_calculation(X[:, F], y)
73 if t > merit:
74 merit = t
c:\python\lib\site-packages\skfeature\function\statistical_based\CFS.py in merit_calculation(X, y)
28 for i in range(n_features):
29 fi = X[:, i]
---> 30 rcf += su_calculation(fi, y)
31 for j in range(n_features):
32 if j > i:
c:\python\lib\site-packages\skfeature\utility\mutual_information.py in su_calculation(f1, f2)
57 # calculate information gain of f1 and f2, t1 = ig(f1,f2)
58 t1 = information_gain(f1, f2)
---> 59 # calculate entropy of f1, t2 = H(f1)
60 t2 = ee.entropyd(f1)
61 # calculate entropy of f2, t3 = H(f2)
c:\python\lib\site-packages\skfeature\utility\mutual_information.py in information_gain(f1, f2)
17
18 ig = ee.entropyd(f1) - conditional_entropy(f1, f2)
---> 19 return ig
20
21
AttributeError: 'function' object has no attribute 'entropyd'
I found for some continuous variables, the entropy_estimators library return the negative number. Here is the reply I got from the author of this library,
For continuous variables, this package is calculating the differential entropy. Unfortunately, the differential entropy can be negative, making interpretation more difficult than in the discrete case. See chapter 8 of Cover and Thomas, for example, for a discussion of how to interpret negative differential entropies. (Consider, for instance, the differential entropy for a Gaussian which is proportional to log variance. If the variance is small, you get a negative number.)
My question is for the information theoretical based methods which use this library for entropy calculation, if the entropy result is negative, will the feature selection result still be valid?
Thanks
Is there a way I can get the JMI values being calculated for feature selection
I was trying to use CFS.py with python3. It gave me the following error:
File "/usr/local/lib/python3.5/dist-packages/skfeature/function/statistical_based/CFS.py", line 2, in <module>
from skfeature.utility.mutual_information import su_calculation
File "/usr/local/lib/python3.5/dist-packages/skfeature/utility/mutual_information.py", line 1, in <module>
import entropy_estimators as ee
ImportError: No module named 'entropy_estimators'
Please tell me what should I do to solve this problem. Am I missing something or doing something wrong?
Thanks.
D:\prj>python findFeatures.py
Traceback (most recent call last):
File "findFeatures.py", line 7, in
from skfeature.function.information_theoretical_based import MIM # infogain
File "D:\ProgramData\Anaconda3\lib\site-packages\skfeature\function\information_theoretical_based\MIM.py", line 1, in
import LCSI
ModuleNotFoundError: No module named 'LCSI'
For the library to work i have to convert the sparse matrix to dense and it takes a lot of memory and because of that sometimes i am unable to perform the task required due to memory error. Specifically i am talking about the statistical methods like CHI2 /giniIndex etc
Hi there!
It seems to be an error in lap_score.py
. Please, take a look at the following excerpt:
# if 'W' is not specified, use the default W
if 'W' not in kwargs.keys():
W = construct_W(X)
# construct the affinity matrix W
W = kwargs['W']
If the user does not pre-compute W
, then the last line results in a KeyError. I think it's easy to fix, since there seems to be a missing else
:
if 'W' not in kwargs.keys():
# if 'W' is not specified, use the default W
W = construct_W(X)
else:
# construct the affinity matrix W
W = kwargs['W']
Thanks. Regards.
According to reference paper, 'Unsupervised Feature Selection for Multi-Cluster Data' by Cai,Deng, in equation 4, max of mod value is taken. I think that needs to be corrected here in MCFS.py in line 69 from W.max(1) to np.absolute(W).max(1)
Hi,
There is a typo on line 39 of the FCBF.py file. I fixed it and was able to get it working on my system, by
changing dytpes to datatype and removing the quotes around object.
Current:t1 = np.zeros((n_features, 2), dtypes='object')
Correct = t1 = np.zeros((n_features, 2), dtype=object)
Best,
Sparkle
TypeError Traceback (most recent call last)
in
8
9 # score = svm_backward.svm_backward(X_train_values, y_train_reg_values,50)
---> 10 score = svm_forward.svm_forward(X_train_values, y_train_reg_values, 50)
~/anaconda3/envs/ish3test/lib/python3.6/site-packages/skfeature/function/wrapper/svm_forward.py in svm_forward(X, y, n_selected_features)
26 n_samples, n_features = X.shape
27 # using 10 fold cross validation
---> 28 cv = KFold(n_samples, n_folds=10, shuffle=True)
29 # choose SVM as the classifier
30 clf = SVC()
TypeError: init() got an unexpected keyword argument 'n_folds'
I'm wondering whether it's because I'm using Python3?
Hello:
I am using the unsupervised feature selection using Laplacian Score. But I am facing the below error message
Traceback (most recent call last):
File "fs.py", line 12, in
W = construct_W.construct_W(frame, *_kwargs_W)
File "/usr/local/lib/python2.7/dist-packages/skfeature/utility/construct_W.py", line 141, in construct_W
D = pairwise_distances(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 1207, in pairwise_distances
return _parallel_pairwise(X, Y, func, n_jobs, *_kwds)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 1054, in _parallel_pairwise
return func(X, Y, **kwds)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 231, in euclidean_distances
distances = safe_sparse_dot(X, Y.T, dense_output=True)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
return fast_dot(a, b)
MemoryError
I have updated the scikit-learn; but the issue persists. Any inputs will be helpful
Regards
Sayantan Guha
It would be nice to know how to use the tool (also to show that it has a scikit interface).
In the line # 53 in FCBF algorithm we are getting IndexError, it can be fixed by writing: fp= X[:, int(s_list[idx, 0])] instead of,
fp= X[:, s_list[idx, 0]]
K -L used in
https://github.com/jundongl/scikit-feature/blob/master/skfeature/utility/entropy_estimators.py
def entropy(x, k=3, base=2):
"""
The classic K-L k-nearest neighbor continuous entropy estimator x should be a list of vectors,
e.g. x = [[1.3],[3.7],[5.1],[2.4]] if x is a one-dimensional scalar and we have four samples
"""
is it
Kozachenko-Leonenko k-nearest neighbour estimator used to estimate the entropy
https://stackoverflow.com/questions/43265770/entropy-python-implementation
do have examples how to use your code
especially for
def midd(x, y):
"""
Discrete mutual information estimator given a list of samples which can be any hashable object
"""
return -entropyd(list(zip(x, y)))+entropyd(x)+entropyd(y)
from
https://github.com/jundongl/scikit-feature/blob/master/skfeature/utility/entropy_estimators.py
Can we include sample datasets to have a proper testing suite?
In the code for cfs(X, y)
, you are calling repeatedly the function merit_calculation(X, y)
, which it self calls repeatedly the function su_calculation
, sometimes with exactly the same feature(s) as in previous rounds.
To avoid repeatedly computing su_calculation(fi, y)
with the same feature fi
, it would be ideal to save the computation results into a list or a dictionary when they are called the first time, and to load those values instead of recomputing them when they are called afterwards. That would ensure the linear complexity of the algorithm and improve its speed.
This could be the code to achieve that:
def merit_calculation(X, y, F, memo):
rff = 0
rcf = 0
for i in F:
if i not in memo:
fi = X[:, i]
memo[i] = su_calculation(fi, y)
rcf += memo[i]
for j in F:
if j > i:
if (i,j) not in memo:
fj = X[:, j]
memo[(i,j)] = su_calculation(fi, fj)
rff += memo[(i,j)]
rff *= 2
merits = rcf / np.sqrt(len(F) + rff)
return merits
And the usage, supplying the indices and the memory dictionary on each call.
...
memo = {}
...
t = merit_calculation(X, y, F, memo)
F = ...
t = merit_calculation(X, y, F, memo)
F = ...
...
Hi
Maybe I'm wrong but I think that
rff *= 2
should be:
rff *= (n_features **2 - n_features)
in the merit calculation function
Thanks for sharing your code!
I have been trying to execute the examples in source code (in particular http://featureselection.asu.edu
) and I am struggling with which scikit-learn
version to use.
By default (if no particular version is specified) pip
download scikit-learn
version 0.20.4. This version yields the following error:
ImportError: cannot import name cross_validation
I have tried manually installing older versions, but got different errors.
Versions 0.10 and 0.12 yield
ImportError: cannot import name accuracy_score
Version 0.15 yields
ImportError: No module named skfeature.function.similarity_based
Could you provide which version of scikit-learn
, numpy
and scipy
should be installed to execute the examples?
Anyway, I am able to import the algorithms from skfeature.function.*
, the issue is on running the examples.
Thank you.
System configuration:
Python 2.7.17 (Anaconda)
Numpy 1.16.4
SciPy 1.2.3
I added a PR for a prototype solution (which you closed without comment), if you would rather use another library eg Job lib it should be relatively straightforward and I can take a look at that?
I would rank my features with the entropy score and select m relevant features. My dataset is unlabeled. How can I use this with python?
here, CFS always Return values [0 1 2 3 4 5], doesn't matter whatever dataset or size of dataset is.
# perform evaluation on classification task
num_fea = 100 # number of selected features
clf = svm.LinearSVC() # linear SVM
here is the code from your script.
i want to know how you are setting num_fea = 100 # number of selected features
is their any criteria? because in some scripts you set it 10.
if i have 193 features how much i will give to num_fea?
please help me to understand this
Thanks
This error popped up... is the feature ranking embedded as the output score now?
corpus, categories = get_detail_content_category()
vectorizer = TfidfVectorizer(max_df=1.0, max_features=6000, min_df=1,
stop_words=get_cn_stopwords(),
encoding='utf-8', decode_error='ignore',
analyzer='word', tokenizer=cn_tokenize)
X = vectorizer.fit_transform(corpus)
y = categories
idx = ICAP.icap(X, y, n_selected_features=1000)
selected_X = X[:, idx[0:1000]]
After I run this code, i get error like title. I don't why, any help is appreciable.
Hi there,
I'm trying to construct the W
weight matrix to work with lap_score
on the following simple dataset: employes-region.txt. I've tried the following code, which is provided as an example in file test_lap_score.py
:
kwargs_W = {"metric": "euclidean", "neighbor_mode": "knn", "weight_mode": "heat_kernel", "k": 5, 't': 1}
W = construct_W.construct_W(X, **kwargs_W)
Unfortunately, it fails with the following exception at line 152 of file construct_W.py
:
could not broadcast input array from shape (25) into shape (30)
I've gone through the code, and I think that the problem's that the dimensions of G
are wrong. This is the piece of code involved in the exception:
t = kwargs['t']
# compute pairwise euclidean distances
D = pairwise_distances(X)
D **= 2
# sort the distance matrix D in ascending order
dump = np.sort(D, axis=1)
idx = np.argsort(D, axis=1) # *** 1
idx_new = idx[:, 0:k+1] # *** 2
dump_new = dump[:, 0:k+1] # *** 2
# compute the pairwise heat kernel distances
dump_heat_kernel = np.exp(-dump_new/(2*t*t))
G = np.zeros((n_samples*(k+1), 3)) # *** 2
G[:, 0] = np.tile(np.arange(n_samples), (k+1, 1)).reshape(-1) # *** 2
G[:, 1] = np.ravel(idx_new, order='F') # *** EXCEPTION HERE!!
G[:, 2] = np.ravel(dump_heat_kernel, order='F')
# build the sparse affinity matrix W
W = csc_matrix((G[:, 2], (G[:, 0], G[:, 1])), shape=(n_samples, n_samples))
bigger = np.transpose(W) > W
W = W - W.multiply(bigger) + np.transpose(W).multiply(bigger)
I think that there's a problem at line *** 1
. Should it compute idx
using dump? I mean:
idx = np.argsort(dump, axis=1) # *** 1
And the other problem is at the lines *** 2
. Shouldn't they use k
as a multiplier instead of k+1
? That is:
idx_new = idx[:, 0:k] # *** 2
dump_new = dump[:, 0:k] # *** 2
# compute the pairwise heat kernel distances
dump_heat_kernel = np.exp(-dump_new/(2*t*t))
G = np.zeros((n_samples*(k), 3)) # *** 2
G[:, 0] = np.tile(np.arange(n_samples), (k, 1)).reshape(-1) # *** 2
I've fixed my local installation using this path and I've run the system on a large collection with 200+ datasets. It works correctly now.
I've seen that there are many other lines in which a similar patch might apply, bu I haven't tried other configuration options.
Thanks! Regards
We have a dataset of 72 X 3571, which mean that our features are less than samples. We test out dataset on you spec() feature selection technique the output is array of zeros.
Kindly check this issue.
I already install skfeature-chappers (1.0.2) ,but i got an AttributeError.
AttributeError: module 'skfeature.function.similarity_based.reliefF' has no attribute 'feature_ranking'
Hey! Could you please provide an example code for calculating fisher score present in the path
skfeature.function.similarity_based.fisher_score
Could you please help me with what class labels have to be provided as the function should extract the fisher features and provide the labels.
Hi,
In the test_svm_backward.py test file, the features chosen with the svm_backward selection method is always the same ([0,1,...,1023]).
A large portion of the Python user base doesn't use Python 2 any more, and therefore can't use a package that doesn't support Python 3. Please add Python 3 support; it shouldn't take too much extra work.
Dear authors
Congratulations. The package is very good.
My research group is using the scikit-feature inside other projects and we would like to know if is possible to generate a pypi package.
Hi, thanks for your great work.
And I run into an error when I try to run the test_MRMR.py
example.
It seems like some code are on the Python2.x, but when try to run on the Python3.x, it crashed.
Could you please update these part of code, make it fit in Python3?
Thx a lot.
Hi, when I run relieF, I got the following error:
File "C:\Users\Massimo\Anaconda3\lib\site-packages\skfeature\function\similarity_based\reliefF.py", line 101, in reliefF
score += near_miss_term[label]/(k*p_dict[label])
TypeError: ufunc 'add' output (typecode 'O') could not be coerced to provided output parameter (typecode 'd') according to the casting rule ''same_kind''
Can you help me please?
Massimo
The following is the error message:
C:\ProgramData\Anaconda3\envs\py27\lib\site-packages\skfeature\function\similarity_based\lap_score.pyc in lap_score(X, **kwargs)
34 W = construct_W(X)
35 # construct the affinity matrix W
---> 36 W = kwargs['W']
37 # build the diagonal D matrix from affinity matrix W
38 D = np.array(W.sum(axis=1))
To fix it, need to change line 34 as the following:
In line 34, kwargs['W ']= construct_W(X)
UDFS.py L97: The additive parameter \lambda should be independent from gamma (introduced in eq 8 in the paper). It should probably default to something small. It's just used to make the covariance invertible.
Also, construction of S_i seems incorrect to me: UDFS.py L100: indexing on idx_new should be idx_new[:,q]?
Dear,
As the title indicates, when the function construct_W is called using the 'cosine' option, the following operation is done:
for i in range(n_samples):
X[i, :] = X[i, :]/max(1e-12, X_normalized[i])
But this actually changes the values contained in the input variable X outside the function. This may be problematic when we want to use X elsewhere and nothing indicated that X was modified.
I realized it by running the same code before and after calling construct_W and getting different results
best,
"Open" rather than "Oepn"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.