Giter Site home page Giter Site logo

costsensitiveclassification's People

Contributors

albahnsen avatar anilkumarpanda avatar jordanhagan avatar ralic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

costsensitiveclassification's Issues

I want to draw a cost-sensitive decision tree, but it has an error.

f = CostSensitiveDecisionTreeClassifier()
f.fit(X_train, y_train, cost_mat_train)
y_pred_test_csdt = f.predict(X_test)
dot_data = tree.export_graphviz(f, out_file=None,
                                filled=True,
                                rounded=True)
graph = graphviz.Source(dot_data)
graph

The error:

AttributeError                            Traceback (most recent call last)
<ipython-input-53-1889f70abfe8> in <module>
      2 dot_data = tree.export_graphviz(f, out_file=None,
      3                                 filled=True,
----> 4                                 rounded=True)
      5 graph = graphviz.Source(dot_data)
      6 type(f)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in export_graphviz(decision_tree, out_file, max_depth, feature_names, class_names, label, filled, leaves_parallel, impurity, node_ids, proportion, rotate, rounded, special_characters, precision)
    773             rounded=rounded, special_characters=special_characters,
    774             precision=precision)
--> 775         exporter.export(decision_tree)
    776 
    777         if return_string:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in export(self, decision_tree)
    407         else:
    408             self.recurse(decision_tree.tree_, 0,
--> 409                          criterion=decision_tree.criterion)
    410 
    411         self.tail()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in recurse(self, tree, node_id, criterion, parent, depth)
    451             raise ValueError("Invalid node_id %s" % _tree.TREE_LEAF)
    452 
--> 453         left_child = tree.children_left[node_id]
    454         right_child = tree.children_right[node_id]
    455 

AttributeError: '_tree_class' object has no attribute 'children_left'

Is the problem of versions?

Problem with imports from pyea

I get an error from from costcla.models import *. It is due to the issue in from pyea import GeneticAlgorithmOptimizer.
I get the info: No module named 'sklearn.externals.joblib'. Is it possible to fix this?

Getting AttributeError: 'bool' object has no attribute 'astype' for CostSensitiveLogisticRegression()

I have code like:

costClassifier = CostSensitiveLogisticRegression()
costClassifier.fit(train_data_features, train_property_labels, open_cost_mat_train)
y_open_pred_test_cslr = costClassifier.predict(test_data_features)

Where train data features are a bag of words for 15,000 sentences, train_property_labels are categorical labels for sentences, and open_cost_mat_train is a cost matrix, respectively:

   [[0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
     ..., 
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]] 

 [u'/location/statistical_region/net_migration', u'/location/statistical_region/net_migration', u'/location/statistical_region/net_migration', .....]

 [[ 0.36303512  0.          0.          0.        ]
  [ 0.24472353  0.          0.          0.        ]
  [ 0.18386408  0.          0.          0.        ]
  ..., 
  [ 0.00650667  0.          0.          0.        ]
  [ 0.06445714  0.          0.          0.        ]
  [ 0.05        0.          0.          0.        ]] 

My stack trace however is:

/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py:76: FutureWarning:   elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  y_true = (y_true == 1).astype(np.float)
   Traceback (most recent call last):
     File "/Users/dhruv/Documents/university/ClaimDetection/src/main/costSensitiveClassifier.py", line 272, in <module>
     openCostClassifier.fit(train_data_features, train_property_labels, open_cost_mat_train)
    File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 237, in fit
    res.fit()
  File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 165, in fit
   self.cost_ = self._fitness_function()
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 151, in _fitness_function
  for i in range(n_jobs))
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
       self.dispatch(function, args, kwargs)
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
       job = ImmediateApply(func, args, kwargs)
       File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
       self.results = func(*args, **kwargs)
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 35, in     _fitness_function_parallel
     return fitness_function(pop, *fargs).tolist()
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 96, in  _logistic_cost_loss
        out[i] = _logistic_cost_loss_i(w[i], X, y, cost_mat, alpha)
    File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 53, in _logistic_cost_loss_i
       out = cost_loss(y, y_prob, cost_mat) / n_samples
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py", line 76, in cost_loss
   y_true = (y_true == 1).astype(np.float)
    AttributeError: 'bool' object has no attribute 'astype'

Push to PyPi

Looks like you have updated the codebase to be compatible with the latest scikit-learn release. Would you please push this (i.e. codebase with recent commits) to PyPi?

Calculation of cost saving

Hello, thank you very much for the project. I am trying to apply a cost-sensitive learning approach to a marketing problem, where a False Negative (the record is predicted as no-customer, whereas he currently is a customer) is much worse than the opposite. As you can see, the cost-matrix is not example dependent, but class-dependent. I create the example-dependent cost-matrix such the following (it is an example for the first 4 records):

FP FN TP TN
1 2 60 0 0
2 2 60 0 0
3 2 60 0 0
4 2 60 0 0

When I calculate the savings_score() function I get a negative score. Which is the meaning of the 1-cost at the numerator?

I am referring to the following piece of code:

cost = cost_loss(y_true, y_pred, cost_mat)
    return 1.0 - cost / cost_base

The cost value is not normalized, so I really don't understand the meaning of doing 1-.
Thank you very much!

cannot work on python3.5

I cannot import costcla on python3.5(anaconda). The error is "No module named metrics".
from metrics import *
from dataset import *
from models import *
I install python2.7, then it works.

Models cannot be saved and restored

  1. attempt to save a trained CostSensitiveRandomForestClassifier model using python pickle
  • pickle.dump(model,open("m1","wb")) ## works without an error
  • m=pickle.load(open("m1","rb")) ## fails with error "AttributeError: 'module' object has no attribute '_tree_class'"
  1. attempt to save the same model using joblib.dump from sklearn.externals.joblib:
  • joblib.dump(model,"m2") ## fails with error "PicklingError: Can't pickle <class costcla.models.cost_tree._tree_class at 0x7f45d8760db8>: it's not found as costcla.models.cost_tree._tree_class"

costcla incompatible with latest version of Scikit-Learn

It seems that scikit-learn versions 0.21 and above have deprecated and removed sklearn.externals.six and sklearn.externals.joblib according to this release notes. Is there any plans to accommodate this change or raise this issue in the setup.py script (as of now the sklearn requirements are explicitly listed as 'scikit-learn>=0.15.0b2')? I can also take a stab at creating a PR for this issue but am unaware of the etiquette and guidelines. Furthermore, I notice there is a lack of automated testing so I am unsure if I would break something :D .

CS-RF parallel error.

When run CS-RF using n_job=2 on jupyter notebook, I got the following error. Python version: 2.7. And when setting n_job=1, no error occurs.

/usr/lib/python2.7/site-packages/costcla/models/bagging.pyc in fit(self, X, y, cost_mat, sample_weight)
    273                 seeds[starts[i]:starts[i + 1]],
    274                 verbose=self.verbose)
--> 275             for i in range(n_jobs))
    276 
    277         # Reduce

/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time

/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
    697             try:
    698                 if getattr(self._backend, 'supports_timeout', False):
--> 699                     self._output.extend(job.get(timeout=self.timeout))
    700                 else:
    701                     self._output.extend(job.get())

/usr/lib64/python2.7/multiprocessing/pool.pyc in get(self, timeout)
    552             return self._value
    553         else:
--> 554             raise self._value
    555 
    556     def _set(self, i, obj):

MaybeEncodingError: Error sending result: '[([CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
                  criterion_weight=False, max_depth=None,
                  max_features='auto', min_gain=0.001, min_samples_leaf=1,
                  min_samples_split=2, num_pct=100, pruned=True)], [array([False,  True, False, ...,  True, False,  True], dtype=bool), array([False,  True,  True, ...,  True, False, False], dtype=bool), array([ True,  True,  True, ...,  True, False,  True], dtype=bool), array([ True,  True,  True, ...,  True,  True,  True], dtype=bool), array([False, False,  True, ...,  True,  True,  True], dtype=bool), array([ True,  True,  True, ...,  True,  True,  True], dtype=bool), array([ True,  True,  True, ...,  True, False,  True], dtype=bool), array([ True,  True, False, ...,  True,  True, False], dtype=bool)], [array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39])])]'. Reason: 'PicklingError("Can't pickle <class costcla.models.cost_tree._tree_class at 0x7261ae0>: it's not found as costcla.models.cost_tree._tree_class",)'

ModuleNotFoundError

Hi,
Importing the library fails with errors including:

  • No module named 'sklearn.externals.joblib'
  • ImportError: cannot import name 'six' from 'sklearn.externals'
  • ModuleNotFoundError: No module named 'sklearn.externals.six.moves'
    Do you familiar with those errors and know how to handle them?
    Thanks in advance!
    Danit.

Support for Pandas DataFrame

I guess currently this library doesn't support taking Pandas DataFrame as input data because the following error will be raised

KeyError: '[58976 73597   747 ..., 32194 57641 57730] not in index'

so to use this library, one must call DataFrame's .as_matrix() to transfer DataFrame into numpy ndarray.

getting error: 'bool' object has no attribute 'astype'

AttributeError Traceback (most recent call last)
in
1 while True:
2 img,ret = capture.read()
----> 3 img=img.astype('uint8')
4 gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
5 faces = face_cascade.detectMultiscale(gray,1.3,5)

AttributeError: 'bool' object has no attribute 'astype'

How to calculate my own cost matrix?

In your tutorial, I saw:

X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = \
train_test_split(data.data, data.target, data.cost_mat)

But data is built-in. How about my own data which doesn't have .cost_mat attribute?

Shape of the cost matrix

Is it necessary to create a 4 column cost matrix for each training and test set per cost_mat[C_FP,C_FN,C_TP,C_TN]? I only have one cost for each training instance.

What do you recommend in this? Can my cost just be the cost of classifying the instance as positive (C_FP), and then would you advise to leave the rest at 0?

Also, does the value of the cost matter? Or is it more the distribution of the cost vector across C_FP, C_FN etc?

Why examples of BMR use y_test to fit?

A part of the Example of Bayes Minimum Risk is that

f = RandomForestClassifier(random_state=0).fit(X_train, y_train)
y_prob_test = f.predict_proba(X_test)
y_pred_test_rf = f.predict(X_test)
f_bmr = BayesMinimumRiskClassifier()
f_bmr.fit(y_test, y_prob_test)
y_pred_test_bmr = f_bmr.predict(y_prob_test, cost_mat_test)

I just can't understand why you use the test data(y_test) to fit the BMR model? It don't make any sense because we want to predict the y_test.

If there is something wrong with my understanding, please be kind to tell me.

Imbalanced Dataset vs Cost Function

Hi,

@albahnsen Do you have any thoughts on the relationship between the cost matrix and re-balancing the data? I notice that you do not rebalance in your final logistic regression model in your wonderful paper.

If I have a highly imbalanced dataset where:
<1% are Positive
99% are Negative

But the theoretical cost is:
30 if all are labelled positive
and 1 if all are labelled negative

What should I be adjusting to stop it predicting all Positive? The imbalance? The cost? THe iterations?

Thanks!

Edit: I've done some Cross Validation to check different C aand max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).

Cannot use predict_proba with cost matrix ?

Hi,

Thank you very much for this library, it is exactly what I was looking for.

I noticed there is a difference in the required arguments of two methods of the model CostSensitiveRandomForestClassifier, predict and predict_proba. The former accepts a cost matrix as input but the latter does not while I thought they are essentially the same with predict giving a crisp label if the predicted probability (or voting result) is greater than the default threshold of 0.5. Can you explain if I'm missing something here ?

Thanks.

`load_bankmarketing` no longer works

Tried on a fresh installation. I think it's because sklearn changed their interface for a couple things; should just be a quick swap. I'll clone the package and attempt to fix it.

To Reproduce:
python3 -m virtualenv venv
source venv/bin/activate
pip3 install costcla

from costcla.datasets import load_bankmarketing
load_bankmarketing()

The traceback is

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from costcla.datasets import load_bankmarketing
  File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/__init__.py", line 31, in <module>
    from .models import *
  File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/models/__init__.py", line 11, in <module>
    from .cost_ensemble import CostSensitivePastingClassifier
  File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/models/cost_ensemble.py", line 8, in <module>
    from sklearn.cross_validation import train_test_split
ModuleNotFoundError: No module named 'sklearn.cross_validation'

Visualize features being used by CostSensitiveDecisionTreeClassifier

Hi,

First of all, thank you for developing this great theory and library.

I'm doing some tests using CostSensitiveDecisionTreeClassifier , but, after I fit the model, I can't export the tree and neither visualize what are the features being used in each node and their respective comparison.

How can I do this?

Cost sensitive Decision Tree predict the same value

I am running into an issue with the class of the cost sensitive decision tree. It is working properly on the go to dataset that are available in the package. But once I try with my own datasets and I try to predict a value the algorithm returns the exact same value for every single observations like on the image below. Do you know what could be the reason of this behavior ? Note that my code is exactly the same with the go-to datasets and mine the only difference are the outcomes

Screenshot 2021-09-05 at 16 43 05

multiclass classification example?

So thankful this project exists! Are there any multiclass examples available? What would the cost matrix object look like in this scenario?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.