albahnsen / costsensitiveclassification Goto Github PK

View Code? Open in Web Editor NEW

206.0 206.0 82.0 8.47 MB

CostSensitiveClassification Library in Python

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

costsensitiveclassification's People

Contributors

Stargazers

Watchers

Forkers

pogigroo springcoil roygao ltyscu royshan wuxiaoqiang1993 pengmiao2014 weylew ddsmarques jmwdpk sanjaykim salemameen beijinger 23fredibaron gjmulder python3pkg research-machine-learning binghe2727 denson yanhuiguo yangruipis wh-forker libardo1 rainflyw hangngopham cwerger sathish4788 charlesdelascasas bdavisx jordanhagan ashinzeng steliosrammos tonyfieit chunchill danjunlu imera88 diksha2008 atschalz hanytran fyfef1 jchanlau yuqingsherry alvarocalle edemirbag fsistemas aabdygaziev concenterate baotianjiazhi romanoww zhongbineden dlbt96 maz2198 ckubudi abv-hub trhflybingo zhangww123 tastymrchef koonn josepsmartinez bhaskarkvvsr yukkiasuna-sao fuzzymanzy annadont leesihan bouritosse davidakinpelu eldrin mitchshack limingbei longshen931 bocinfor 0maxwell0 empenguinxh cashkillr juliandpr anandb96 recurze breoganpardo neilpradhan strugglingzoey

costsensitiveclassification's Issues

I want to draw a cost-sensitive decision tree, but it has an error.

f = CostSensitiveDecisionTreeClassifier()
f.fit(X_train, y_train, cost_mat_train)
y_pred_test_csdt = f.predict(X_test)
dot_data = tree.export_graphviz(f, out_file=None,
                                filled=True,
                                rounded=True)
graph = graphviz.Source(dot_data)
graph

The error:

AttributeError                            Traceback (most recent call last)
<ipython-input-53-1889f70abfe8> in <module>
      2 dot_data = tree.export_graphviz(f, out_file=None,
      3                                 filled=True,
----> 4                                 rounded=True)
      5 graph = graphviz.Source(dot_data)
      6 type(f)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in export_graphviz(decision_tree, out_file, max_depth, feature_names, class_names, label, filled, leaves_parallel, impurity, node_ids, proportion, rotate, rounded, special_characters, precision)
    773             rounded=rounded, special_characters=special_characters,
    774             precision=precision)
--> 775         exporter.export(decision_tree)
    776 
    777         if return_string:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in export(self, decision_tree)
    407         else:
    408             self.recurse(decision_tree.tree_, 0,
--> 409                          criterion=decision_tree.criterion)
    410 
    411         self.tail()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in recurse(self, tree, node_id, criterion, parent, depth)
    451             raise ValueError("Invalid node_id %s" % _tree.TREE_LEAF)
    452 
--> 453         left_child = tree.children_left[node_id]
    454         right_child = tree.children_right[node_id]
    455 

AttributeError: '_tree_class' object has no attribute 'children_left'

Is the problem of versions?

Imbalanced prediction on obesity datasets using cost sensitive algorithms??

hii,
@albahnsen ,can you please help me in predicting the data imbalance on obesity datasets??

Please leave a reply as fast as possible!!!

ModuleNotFoundError: No module named 'sklearn.externals.joblib'

This library is not working anymore.
Please update the code to import joblib instead of from sklearn.external import joblibs

Update example notebook to work with latest version of sklearn

Update tutorial notebook (https://nbviewer.jupyter.org/github/albahnsen/CostSensitiveClassification/blob/master/doc/tutorials/tutorial_edcs_credit_scoring.ipynb) to work with latest version of sklearn .

costcla.model is not working with latest version of sklearn

Problem with imports from pyea

I get an error from from costcla.models import *. It is due to the issue in from pyea import GeneticAlgorithmOptimizer.
I get the info: No module named 'sklearn.externals.joblib'. Is it possible to fix this?

Getting AttributeError: 'bool' object has no attribute 'astype' for CostSensitiveLogisticRegression()

I have code like:

costClassifier = CostSensitiveLogisticRegression()
costClassifier.fit(train_data_features, train_property_labels, open_cost_mat_train)
y_open_pred_test_cslr = costClassifier.predict(test_data_features)

Where train data features are a bag of words for 15,000 sentences, train_property_labels are categorical labels for sentences, and open_cost_mat_train is a cost matrix, respectively:

   [[0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
     ..., 
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]] 

 [u'/location/statistical_region/net_migration', u'/location/statistical_region/net_migration', u'/location/statistical_region/net_migration', .....]

 [[ 0.36303512  0.          0.          0.        ]
  [ 0.24472353  0.          0.          0.        ]
  [ 0.18386408  0.          0.          0.        ]
  ..., 
  [ 0.00650667  0.          0.          0.        ]
  [ 0.06445714  0.          0.          0.        ]
  [ 0.05        0.          0.          0.        ]]

My stack trace however is:

/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py:76: FutureWarning:   elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  y_true = (y_true == 1).astype(np.float)
   Traceback (most recent call last):
     File "/Users/dhruv/Documents/university/ClaimDetection/src/main/costSensitiveClassifier.py", line 272, in <module>
     openCostClassifier.fit(train_data_features, train_property_labels, open_cost_mat_train)
    File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 237, in fit
    res.fit()
  File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 165, in fit
   self.cost_ = self._fitness_function()
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 151, in _fitness_function
  for i in range(n_jobs))
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
       self.dispatch(function, args, kwargs)
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
       job = ImmediateApply(func, args, kwargs)
       File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
       self.results = func(*args, **kwargs)
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 35, in     _fitness_function_parallel
     return fitness_function(pop, *fargs).tolist()
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 96, in  _logistic_cost_loss
        out[i] = _logistic_cost_loss_i(w[i], X, y, cost_mat, alpha)
    File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 53, in _logistic_cost_loss_i
       out = cost_loss(y, y_prob, cost_mat) / n_samples
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py", line 76, in cost_loss
   y_true = (y_true == 1).astype(np.float)
    AttributeError: 'bool' object has no attribute 'astype'

Push to PyPi

Looks like you have updated the codebase to be compatible with the latest scikit-learn release. Would you please push this (i.e. codebase with recent commits) to PyPi?

Calculation of cost saving

Hello, thank you very much for the project. I am trying to apply a cost-sensitive learning approach to a marketing problem, where a False Negative (the record is predicted as no-customer, whereas he currently is a customer) is much worse than the opposite. As you can see, the cost-matrix is not example dependent, but class-dependent. I create the example-dependent cost-matrix such the following (it is an example for the first 4 records):

	FP	FN
1	2	60
2	2	60
3	2	60
4	2	60

When I calculate the savings_score() function I get a negative score. Which is the meaning of the 1-cost at the numerator?

I am referring to the following piece of code:

cost = cost_loss(y_true, y_pred, cost_mat)
    return 1.0 - cost / cost_base

The cost value is not normalized, so I really don't understand the meaning of doing 1-.
Thank you very much!

cannot work on python3.5

I cannot import costcla on python3.5(anaconda). The error is "No module named metrics".
from metrics import *
from dataset import *
from models import *
I install python2.7, then it works.

Models cannot be saved and restored

attempt to save a trained CostSensitiveRandomForestClassifier model using python pickle

pickle.dump(model,open("m1","wb")) ## works without an error
m=pickle.load(open("m1","rb")) ## fails with error "AttributeError: 'module' object has no attribute '_tree_class'"

attempt to save the same model using joblib.dump from sklearn.externals.joblib:

joblib.dump(model,"m2") ## fails with error "PicklingError: Can't pickle <class costcla.models.cost_tree._tree_class at 0x7f45d8760db8>: it's not found as costcla.models.cost_tree._tree_class"

costcla incompatible with latest version of Scikit-Learn

It seems that scikit-learn versions 0.21 and above have deprecated and removed sklearn.externals.six and sklearn.externals.joblib according to this release notes. Is there any plans to accommodate this change or raise this issue in the setup.py script (as of now the sklearn requirements are explicitly listed as 'scikit-learn>=0.15.0b2')? I can also take a stab at creating a PR for this issue but am unaware of the etiquette and guidelines. Furthermore, I notice there is a lack of automated testing so I am unsure if I would break something :D .

CS-RF parallel error.

When run CS-RF using n_job=2 on jupyter notebook, I got the following error. Python version: 2.7. And when setting n_job=1, no error occurs.

/usr/lib/python2.7/site-packages/costcla/models/bagging.pyc in fit(self, X, y, cost_mat, sample_weight)
    273                 seeds[starts[i]:starts[i + 1]],
    274                 verbose=self.verbose)
--> 275             for i in range(n_jobs))
    276 
    277         # Reduce

/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time

/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
    697             try:
    698                 if getattr(self._backend, 'supports_timeout', False):
--> 699                     self._output.extend(job.get(timeout=self.timeout))
    700                 else:
    701                     self._output.extend(job.get())

/usr/lib64/python2.7/multiprocessing/pool.pyc in get(self, timeout)
    552             return self._value
    553         else:
--> 554             raise self._value
    555 
    556     def _set(self, i, obj):

MaybeEncodingError: Error sending result: '[([CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True)], [array([False,  True, False, ...,  True, False,  True], dtype=bool), array([False,  True,  True, ...,  True, False, False], dtype=bool), array([ True,  True,  True, ...,  True, False,  True], dtype=bool), array([ True,  True,  True, ...,  True,  True,  True], dtype=bool), array([False, False,  True, ...,  True,  True,  True], dtype=bool), array([ True,  True,  True, ...,  True,  True,  True], dtype=bool), array([ True,  True,  True, ...,  True, False,  True], dtype=bool), array([ True,  True, False, ...,  True,  True, False], dtype=bool)], [array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39])])]'. Reason: 'PicklingError("Can't pickle <class costcla.models.cost_tree._tree_class at 0x7261ae0>: it's not found as costcla.models.cost_tree._tree_class",)'

pyea imports

ModuleNotFoundError

Hi,
Importing the library fails with errors including:

No module named 'sklearn.externals.joblib'
ImportError: cannot import name 'six' from 'sklearn.externals'
ModuleNotFoundError: No module named 'sklearn.externals.six.moves'
Do you familiar with those errors and know how to handle them?
Thanks in advance!
Danit.

Support for Pandas DataFrame

I guess currently this library doesn't support taking Pandas DataFrame as input data because the following error will be raised

KeyError: '[58976 73597   747 ..., 32194 57641 57730] not in index'

so to use this library, one must call DataFrame's .as_matrix() to transfer DataFrame into numpy ndarray.

getting error: 'bool' object has no attribute 'astype'

AttributeError Traceback (most recent call last)
in
1 while True:
2 img,ret = capture.read()
----> 3 img=img.astype('uint8')
4 gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
5 faces = face_cascade.detectMultiscale(gray,1.3,5)

AttributeError: 'bool' object has no attribute 'astype'

How to calculate my own cost matrix?

In your tutorial, I saw:

X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = \
train_test_split(data.data, data.target, data.cost_mat)

But data is built-in. How about my own data which doesn't have .cost_mat attribute?

Shape of the cost matrix

Is it necessary to create a 4 column cost matrix for each training and test set per cost_mat[C_FP,C_FN,C_TP,C_TN]? I only have one cost for each training instance.

What do you recommend in this? Can my cost just be the cost of classifying the instance as positive (C_FP), and then would you advise to leave the rest at 0?

Also, does the value of the cost matter? Or is it more the distribution of the cost vector across C_FP, C_FN etc?

Why examples of BMR use y_test to fit?

A part of the Example of Bayes Minimum Risk is that

f = RandomForestClassifier(random_state=0).fit(X_train, y_train)
y_prob_test = f.predict_proba(X_test)
y_pred_test_rf = f.predict(X_test)
f_bmr = BayesMinimumRiskClassifier()
f_bmr.fit(y_test, y_prob_test)
y_pred_test_bmr = f_bmr.predict(y_prob_test, cost_mat_test)

I just can't understand why you use the test data(y_test) to fit the BMR model? It don't make any sense because we want to predict the y_test.

If there is something wrong with my understanding, please be kind to tell me.

Imbalanced Dataset vs Cost Function

Hi,

@albahnsen Do you have any thoughts on the relationship between the cost matrix and re-balancing the data? I notice that you do not rebalance in your final logistic regression model in your wonderful paper.

If I have a highly imbalanced dataset where:
<1% are Positive
99% are Negative

But the theoretical cost is:
30 if all are labelled positive
and 1 if all are labelled negative

What should I be adjusting to stop it predicting all Positive? The imbalance? The cost? THe iterations?

Thanks!

Edit: I've done some Cross Validation to check different C aand max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).

Cannot use predict_proba with cost matrix ?

Hi,

Thank you very much for this library, it is exactly what I was looking for.

I noticed there is a difference in the required arguments of two methods of the model CostSensitiveRandomForestClassifier, predict and predict_proba. The former accepts a cost matrix as input but the latter does not while I thought they are essentially the same with predict giving a crisp label if the predicted probability (or voting result) is greater than the default threshold of 0.5. Can you explain if I'm missing something here ?

Thanks.

`load_bankmarketing` no longer works

Tried on a fresh installation. I think it's because sklearn changed their interface for a couple things; should just be a quick swap. I'll clone the package and attempt to fix it.

To Reproduce:
python3 -m virtualenv venv
source venv/bin/activate
pip3 install costcla

from costcla.datasets import load_bankmarketing
load_bankmarketing()

The traceback is

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from costcla.datasets import load_bankmarketing
  File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/__init__.py", line 31, in <module>
    from .models import *
  File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/models/__init__.py", line 11, in <module>
    from .cost_ensemble import CostSensitivePastingClassifier
  File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/models/cost_ensemble.py", line 8, in <module>
    from sklearn.cross_validation import train_test_split
ModuleNotFoundError: No module named 'sklearn.cross_validation'

Visualize features being used by CostSensitiveDecisionTreeClassifier

Hi,

First of all, thank you for developing this great theory and library.

I'm doing some tests using CostSensitiveDecisionTreeClassifier , but, after I fit the model, I can't export the tree and neither visualize what are the features being used in each node and their respective comparison.

How can I do this?

Cost sensitive Decision Tree predict the same value

I am running into an issue with the class of the cost sensitive decision tree. It is working properly on the go to dataset that are available in the package. But once I try with my own datasets and I try to predict a value the algorithm returns the exact same value for every single observations like on the image below. Do you know what could be the reason of this behavior ? Note that my code is exactly the same with the go-to datasets and mine the only difference are the outcomes