albahnsen / costsensitiveclassification Goto Github PK
View Code? Open in Web Editor NEWCostSensitiveClassification Library in Python
License: BSD 3-Clause "New" or "Revised" License
CostSensitiveClassification Library in Python
License: BSD 3-Clause "New" or "Revised" License
f = CostSensitiveDecisionTreeClassifier()
f.fit(X_train, y_train, cost_mat_train)
y_pred_test_csdt = f.predict(X_test)
dot_data = tree.export_graphviz(f, out_file=None,
filled=True,
rounded=True)
graph = graphviz.Source(dot_data)
graph
The error:
AttributeError Traceback (most recent call last)
<ipython-input-53-1889f70abfe8> in <module>
2 dot_data = tree.export_graphviz(f, out_file=None,
3 filled=True,
----> 4 rounded=True)
5 graph = graphviz.Source(dot_data)
6 type(f)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in export_graphviz(decision_tree, out_file, max_depth, feature_names, class_names, label, filled, leaves_parallel, impurity, node_ids, proportion, rotate, rounded, special_characters, precision)
773 rounded=rounded, special_characters=special_characters,
774 precision=precision)
--> 775 exporter.export(decision_tree)
776
777 if return_string:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in export(self, decision_tree)
407 else:
408 self.recurse(decision_tree.tree_, 0,
--> 409 criterion=decision_tree.criterion)
410
411 self.tail()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\export.py in recurse(self, tree, node_id, criterion, parent, depth)
451 raise ValueError("Invalid node_id %s" % _tree.TREE_LEAF)
452
--> 453 left_child = tree.children_left[node_id]
454 right_child = tree.children_right[node_id]
455
AttributeError: '_tree_class' object has no attribute 'children_left'
Is the problem of versions?
hii,
@albahnsen ,can you please help me in predicting the data imbalance on obesity datasets??
Please leave a reply as fast as possible!!!
This library is not working anymore.
Please update the code to import joblib
instead of from sklearn.external import joblibs
Update tutorial notebook (https://nbviewer.jupyter.org/github/albahnsen/CostSensitiveClassification/blob/master/doc/tutorials/tutorial_edcs_credit_scoring.ipynb) to work with latest version of sklearn .
I get an error from from costcla.models import *. It is due to the issue in from pyea import GeneticAlgorithmOptimizer.
I get the info: No module named 'sklearn.externals.joblib'. Is it possible to fix this?
I have code like:
costClassifier = CostSensitiveLogisticRegression()
costClassifier.fit(train_data_features, train_property_labels, open_cost_mat_train)
y_open_pred_test_cslr = costClassifier.predict(test_data_features)
Where train data features are a bag of words for 15,000 sentences, train_property_labels are categorical labels for sentences, and open_cost_mat_train is a cost matrix, respectively:
[[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
...,
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]]
[u'/location/statistical_region/net_migration', u'/location/statistical_region/net_migration', u'/location/statistical_region/net_migration', .....]
[[ 0.36303512 0. 0. 0. ]
[ 0.24472353 0. 0. 0. ]
[ 0.18386408 0. 0. 0. ]
...,
[ 0.00650667 0. 0. 0. ]
[ 0.06445714 0. 0. 0. ]
[ 0.05 0. 0. 0. ]]
My stack trace however is:
/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py:76: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
y_true = (y_true == 1).astype(np.float)
Traceback (most recent call last):
File "/Users/dhruv/Documents/university/ClaimDetection/src/main/costSensitiveClassifier.py", line 272, in <module>
openCostClassifier.fit(train_data_features, train_property_labels, open_cost_mat_train)
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 237, in fit
res.fit()
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 165, in fit
self.cost_ = self._fitness_function()
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 151, in _fitness_function
for i in range(n_jobs))
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
self.dispatch(function, args, kwargs)
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
self.results = func(*args, **kwargs)
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 35, in _fitness_function_parallel
return fitness_function(pop, *fargs).tolist()
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 96, in _logistic_cost_loss
out[i] = _logistic_cost_loss_i(w[i], X, y, cost_mat, alpha)
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 53, in _logistic_cost_loss_i
out = cost_loss(y, y_prob, cost_mat) / n_samples
File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py", line 76, in cost_loss
y_true = (y_true == 1).astype(np.float)
AttributeError: 'bool' object has no attribute 'astype'
Looks like you have updated the codebase to be compatible with the latest scikit-learn release. Would you please push this (i.e. codebase with recent commits) to PyPi?
Hello, thank you very much for the project. I am trying to apply a cost-sensitive learning approach to a marketing problem, where a False Negative (the record is predicted as no-customer, whereas he currently is a customer) is much worse than the opposite. As you can see, the cost-matrix is not example dependent, but class-dependent. I create the example-dependent cost-matrix such the following (it is an example for the first 4 records):
FP | FN | TP | TN | |
---|---|---|---|---|
1 | 2 | 60 | 0 | 0 |
2 | 2 | 60 | 0 | 0 |
3 | 2 | 60 | 0 | 0 |
4 | 2 | 60 | 0 | 0 |
When I calculate the savings_score()
function I get a negative score. Which is the meaning of the 1-cost at the numerator?
I am referring to the following piece of code:
cost = cost_loss(y_true, y_pred, cost_mat)
return 1.0 - cost / cost_base
The cost value is not normalized, so I really don't understand the meaning of doing 1-.
Thank you very much!
I cannot import costcla on python3.5(anaconda). The error is "No module named metrics".
from metrics import *
from dataset import *
from models import *
I install python2.7, then it works.
It seems that scikit-learn versions 0.21 and above have deprecated and removed sklearn.externals.six and sklearn.externals.joblib according to this release notes. Is there any plans to accommodate this change or raise this issue in the setup.py script (as of now the sklearn requirements are explicitly listed as 'scikit-learn>=0.15.0b2')? I can also take a stab at creating a PR for this issue but am unaware of the etiquette and guidelines. Furthermore, I notice there is a lack of automated testing so I am unsure if I would break something :D .
When run CS-RF using n_job=2
on jupyter notebook, I got the following error. Python version: 2.7. And when setting n_job=1
, no error occurs.
/usr/lib/python2.7/site-packages/costcla/models/bagging.pyc in fit(self, X, y, cost_mat, sample_weight)
273 seeds[starts[i]:starts[i + 1]],
274 verbose=self.verbose)
--> 275 for i in range(n_jobs))
276
277 # Reduce
/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
697 try:
698 if getattr(self._backend, 'supports_timeout', False):
--> 699 self._output.extend(job.get(timeout=self.timeout))
700 else:
701 self._output.extend(job.get())
/usr/lib64/python2.7/multiprocessing/pool.pyc in get(self, timeout)
552 return self._value
553 else:
--> 554 raise self._value
555
556 def _set(self, i, obj):
MaybeEncodingError: Error sending result: '[([CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
criterion_weight=False, max_depth=None,
max_features='auto', min_gain=0.001, min_samples_leaf=1,
min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
criterion_weight=False, max_depth=None,
max_features='auto', min_gain=0.001, min_samples_leaf=1,
min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
criterion_weight=False, max_depth=None,
max_features='auto', min_gain=0.001, min_samples_leaf=1,
min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
criterion_weight=False, max_depth=None,
max_features='auto', min_gain=0.001, min_samples_leaf=1,
min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
criterion_weight=False, max_depth=None,
max_features='auto', min_gain=0.001, min_samples_leaf=1,
min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
criterion_weight=False, max_depth=None,
max_features='auto', min_gain=0.001, min_samples_leaf=1,
min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
criterion_weight=False, max_depth=None,
max_features='auto', min_gain=0.001, min_samples_leaf=1,
min_samples_split=2, num_pct=100, pruned=True), CostSensitiveDecisionTreeClassifier(criterion='direct_cost',
criterion_weight=False, max_depth=None,
max_features='auto', min_gain=0.001, min_samples_leaf=1,
min_samples_split=2, num_pct=100, pruned=True)], [array([False, True, False, ..., True, False, True], dtype=bool), array([False, True, True, ..., True, False, False], dtype=bool), array([ True, True, True, ..., True, False, True], dtype=bool), array([ True, True, True, ..., True, True, True], dtype=bool), array([False, False, True, ..., True, True, True], dtype=bool), array([ True, True, True, ..., True, True, True], dtype=bool), array([ True, True, True, ..., True, False, True], dtype=bool), array([ True, True, False, ..., True, True, False], dtype=bool)], [array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39])])]'. Reason: 'PicklingError("Can't pickle <class costcla.models.cost_tree._tree_class at 0x7261ae0>: it's not found as costcla.models.cost_tree._tree_class",)'
Hi,
Importing the library fails with errors including:
I guess currently this library doesn't support taking Pandas DataFrame as input data because the following error will be raised
KeyError: '[58976 73597 747 ..., 32194 57641 57730] not in index'
so to use this library, one must call DataFrame's .as_matrix()
to transfer DataFrame into numpy ndarray.
AttributeError Traceback (most recent call last)
in
1 while True:
2 img,ret = capture.read()
----> 3 img=img.astype('uint8')
4 gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
5 faces = face_cascade.detectMultiscale(gray,1.3,5)
AttributeError: 'bool' object has no attribute 'astype'
In your tutorial, I saw:
X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = \
train_test_split(data.data, data.target, data.cost_mat)
But data
is built-in. How about my own data which doesn't have .cost_mat
attribute?
Is it necessary to create a 4 column cost matrix for each training and test set per cost_mat[C_FP,C_FN,C_TP,C_TN]
? I only have one cost for each training instance.
What do you recommend in this? Can my cost just be the cost of classifying the instance as positive (C_FP)
, and then would you advise to leave the rest at 0?
Also, does the value of the cost matter? Or is it more the distribution of the cost vector across C_FP, C_FN
etc?
A part of the Example of Bayes Minimum Risk is that
f = RandomForestClassifier(random_state=0).fit(X_train, y_train)
y_prob_test = f.predict_proba(X_test)
y_pred_test_rf = f.predict(X_test)
f_bmr = BayesMinimumRiskClassifier()
f_bmr.fit(y_test, y_prob_test)
y_pred_test_bmr = f_bmr.predict(y_prob_test, cost_mat_test)
I just can't understand why you use the test data(y_test) to fit the BMR model? It don't make any sense because we want to predict the y_test.
If there is something wrong with my understanding, please be kind to tell me.
Hi,
@albahnsen Do you have any thoughts on the relationship between the cost matrix and re-balancing the data? I notice that you do not rebalance in your final logistic regression model in your wonderful paper.
If I have a highly imbalanced dataset where:
<1% are Positive
99% are Negative
But the theoretical cost is:
30 if all are labelled positive
and 1 if all are labelled negative
What should I be adjusting to stop it predicting all Positive? The imbalance? The cost? THe iterations?
Thanks!
Edit: I've done some Cross Validation to check different C aand max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).
Hi,
Thank you very much for this library, it is exactly what I was looking for.
I noticed there is a difference in the required arguments of two methods of the model CostSensitiveRandomForestClassifier, predict and predict_proba. The former accepts a cost matrix as input but the latter does not while I thought they are essentially the same with predict giving a crisp label if the predicted probability (or voting result) is greater than the default threshold of 0.5. Can you explain if I'm missing something here ?
Thanks.
Tried on a fresh installation. I think it's because sklearn changed their interface for a couple things; should just be a quick swap. I'll clone the package and attempt to fix it.
To Reproduce:
python3 -m virtualenv venv
source venv/bin/activate
pip3 install costcla
from costcla.datasets import load_bankmarketing
load_bankmarketing()
The traceback is
Traceback (most recent call last):
File "test.py", line 1, in <module>
from costcla.datasets import load_bankmarketing
File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/__init__.py", line 31, in <module>
from .models import *
File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/models/__init__.py", line 11, in <module>
from .cost_ensemble import CostSensitivePastingClassifier
File "/home/alex/test/venv/lib/python3.6/site-packages/costcla/models/cost_ensemble.py", line 8, in <module>
from sklearn.cross_validation import train_test_split
ModuleNotFoundError: No module named 'sklearn.cross_validation'
Hi,
First of all, thank you for developing this great theory and library.
I'm doing some tests using CostSensitiveDecisionTreeClassifier , but, after I fit the model, I can't export the tree and neither visualize what are the features being used in each node and their respective comparison.
How can I do this?
I am running into an issue with the class of the cost sensitive decision tree. It is working properly on the go to dataset that are available in the package. But once I try with my own datasets and I try to predict a value the algorithm returns the exact same value for every single observations like on the image below. Do you know what could be the reason of this behavior ? Note that my code is exactly the same with the go-to datasets and mine the only difference are the outcomes
So thankful this project exists! Are there any multiclass examples available? What would the cost matrix object look like in this scenario?
The link to the tutorial in the README file is broken . Issue to update the same .
Could CostSensitiveClassification
be added to conda-forge
?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.