shichenxie / scorecardpy Goto Github PK

View Code? Open in Web Editor NEW

702.0 35.0 302.0 240 KB

Scorecard Development in python, 评分卡

Home Page: http://shichen.name/scorecard

License: MIT License

Python 100.00%

python scorecard credit-scoring release woebinning woe binning

scorecardpy's Introduction

scorecardpy

This package is python version of R package scorecard. Its goal is to make the development of traditional credit risk scorecard model easier and efficient by providing functions for some common tasks.

data partition (split_df)
variable selection (iv, var_filter)
weight of evidence (woe) binning (woebin, woebin_plot, woebin_adj, woebin_ply)
scorecard scaling (scorecard, scorecard_ply)
performance evaluation (perf_eva, perf_psi)

Installation

Install the release version of scorecardpy from PYPI with:

pip install scorecardpy

Install the latest version of scorecardpy from github with:

pip install git+git://github.com/shichenxie/scorecardpy.git

Example

This is a basic example which shows you how to develop a common credit risk scorecard:

# Traditional Credit Scoring Using Logistic Regression
import scorecardpy as sc

# data prepare ------
# load germancredit data
dat = sc.germancredit()

# filter variable via missing rate, iv, identical value rate
dt_s = sc.var_filter(dat, y="creditability")

# breaking dt into train and test
train, test = sc.split_df(dt_s, 'creditability').values()

# woe binning ------
bins = sc.woebin(dt_s, y="creditability")
# sc.woebin_plot(bins)

# binning adjustment
# # adjust breaks interactively
# breaks_adj = sc.woebin_adj(dt_s, "creditability", bins) 
# # or specify breaks manually
breaks_adj = {
    'age.in.years': [26, 35, 40],
    'other.debtors.or.guarantors': ["none", "co-applicant%,%guarantor"]
}
bins_adj = sc.woebin(dt_s, y="creditability", breaks_list=breaks_adj)

# converting train and test into woe values
train_woe = sc.woebin_ply(train, bins_adj)
test_woe = sc.woebin_ply(test, bins_adj)

y_train = train_woe.loc[:,'creditability']
X_train = train_woe.loc[:,train_woe.columns != 'creditability']
y_test = test_woe.loc[:,'creditability']
X_test = test_woe.loc[:,train_woe.columns != 'creditability']

# logistic regression ------
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1)
lr.fit(X_train, y_train)
# lr.coef_
# lr.intercept_

# predicted proability
train_pred = lr.predict_proba(X_train)[:,1]
test_pred = lr.predict_proba(X_test)[:,1]

# performance ks & roc ------
train_perf = sc.perf_eva(y_train, train_pred, title = "train")
test_perf = sc.perf_eva(y_test, test_pred, title = "test")

# score ------
card = sc.scorecard(bins_adj, lr, X_train.columns)
# credit score
train_score = sc.scorecard_ply(train, card, print_step=0)
test_score = sc.scorecard_ply(test, card, print_step=0)

# psi
sc.perf_psi(
  score = {'train':train_score, 'test':test_score},
  label = {'train':y_train, 'test':y_test}
)

scorecardpy's People

Contributors

Stargazers

Watchers

Forkers

burnninghotel lihengtianxia lenny-cis wangscdm bingyingshao dljnh66 jlwang233 cj1201 jason8kang yuanjie-ai firecatzkj shenbennwdsl renewday 18600597055 hufengquna christbao lizzyzhan autowonderman zhlj98 maydayrg princeon hexinxuan yashkothari97 jackie-gj ndr-113-hub msw1535540 kizun nodamsimith hejiaho yumy-yumy breakend2010 moxiaoying yancey2126 xiami0012 aabbcc0812206523 emilyss904 xiaam faithlxy rockkb weiropan a0812206523 csu415 flysky1991 ala-sen commonll hongkongdavid lbcgithub wuxin957 zhangjunqiang saakaifoundry monica1005 shunsunsun fang0208 spencerai airanger beerbottle dataabel airuibel yuhuaizheng hannanazarkevych wangclover hyeon-jong-ham mzjdy ubadbad-foto liuyunliu2000 alinyx binzhouchn jiamingkong kstepanmpmg zixuedanxin gangao yangpingyan lidadreamer dean1977a simashanhe vincentwong1 chengcjk moak13 emsinko edwardljh batermj abhayiitk7 sunrisehang cywei23 mmejdoubi eshwihdi sulidaniel9010 qwshy bradbann jerryxue09 rrrcat drlqvu wish2018 hailprob zeng8280 zhiqiangxia wung richard-h-wang sangsf leeon2vec

scorecardpy's Issues

Huge variance in IV from

I have noticed a very huge difference in IV when doing this

dt_info_value = sc.iv(df, y = "FLAG")
dt_info_value
Variable                                | Info Value
HIGHEST_WITH_VALUE_6M | 0.299023

but when I do the binning

bins_var = sc.woebin(df,y="FLAG", x="HIGHEST_WITH_VALUE_6M")
sc.woebin_plot(bins_var)

what could be the reason and two how do i rotate the x axis values to be readable

perf.py #548

#ax1.legend((p1[0], p2[0]), list(distr_prob.dates.levels [1]), loc='upper left').
I have a question:
No change is made to the legend in the graph after removing the first parameter' (p1[0], p2[0]) '.
Excuse me, what job is the handles parameter ' (p1[0], p2[0]) 'to the ax.lend() function?

woebin_adj returns string

Hi! Woebin_adj function returns string, but it should return dict

TypeError: unhashable type: 'list‘’

我运行作者给的例子到“ train_woe = sc.woebin_ply(train, bins_adj)”这一步时报错

BINNING OF DATA

Is it possible that you include a function that bins the data ie the whole data frame. Especially for the numerical variable.
Create a score card that uses data as is, categorical and binned numerical variables. Woe variables for comparison purposes.

What kind of tree does it use?

Hello,

I am exploring the library in order to create a complex scorecard for my clients. I have always used 'smbinning' package in R because I really trust CIT when creating the tree, and I was wondering what kind of trees does scorecardpy use for the discretization process.

Thanks in advance

Dummy variables (0,1)

I have a data frame with dummy variables (among others).
However when I run:
bins = sc.woebin(mydf, y="MY_GOODBAD") the following error is received:
AttributeError: module 'main' has no attribute 'spec'

When I run the same command without dummy variables it is ok.
In R it was solved by 'factoring' dummy variables.
How to solve it in Python (Python 3.6)?

一个小问题

函数scorecard和ab中的默认参数不一致，导致最后scorecard计算的分值的常数项会变化。
非常感谢，您的代码写的非常好，向您学习。

关于WOE的计算方式问题

源码包里计算Woe时将频率0替换成0.9或者其他值的0.99的以方便计算IV合理性的依据来源是什么，为什么不直接合并成频率大于1为止，会不会对真实的iv影响很大，有没有证明过程，因为其他地方没看到过iv 值这么计算的？

List removed variables

When running
dt_s = sc.var_filter(dat, y="creditability", return_rm_reason=True)

how do i get to see the removed variables and the reason they have been removed.

Exception in woebin_adj function

Thank you for python version! My issue is about woebin_adj function:

def menu(i, xs_len, x_i):
       ...
        adj_brk = input("Selection: ")
        adj_brk = int(adj_brk)
        if adj_brk not in [0,1,2,3]:
            warnings.warn('Enter an item from the menu, or 0 to exit.')
            adj_brk = input("Selection: ")
            adj_brk = int(adj_brk)
        return adj_brk

If you make any mistake in selection, action adj_brk = int(adj_brk) will cause ValueError exception and throw you out of interactive mode.

I think you should change code as something like:

    def menu(i, xs_len, x_i):
        print('>>> Adjust breaks for ({}/{}) {}?'.format(i, xs_len, x_i))
        print('1: next \n2: yes \n3: back')
        adj_brk = input("Selection: ")
        while(True):
            if adj_brk not in ['0', '1', '2', '3']:
                warnings.warn('Enter an item from the menu, or 0 to exit.')
                adj_brk = input("Selection: ")
                continue
            else:
                break
        return int(adj_brk)

In addition, 'warnings' don't work for me in interactive mode. May be we should try to use logging instead of 'warnings' module.

gains table missing

I cant find the gains table function using
sc.gains_table

ValueError: fill value must be in categories pandas Version: 0.24.2

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "D:\Anaconda3\lib\multiprocessing\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "D:\Anaconda3\lib\multiprocessing\pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "D:\Anaconda3\lib\site-packages\scorecardpy\woebin.py", line 950, in woepoints_ply1
dtx = dtx.fillna('missing').assign(rowid = dtx.index)
File "D:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4034, in fillna
downcast=downcast, **kwargs)
File "D:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 6123, in fillna
downcast=downcast)
File "D:\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 525, in fillna
return self.apply('fillna', **kwargs)
File "D:\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 395, in apply
applied = getattr(b, f)(**kwargs)
File "D:\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 1834, in fillna
values = values.fillna(value=value, limit=limit)
File "D:\Anaconda3\lib\site-packages\pandas\util_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "D:\Anaconda3\lib\site-packages\pandas\core\arrays\categorical.py", line 1784, in fillna
raise ValueError("fill value must be in categories")
ValueError: fill value must be in categories
"""

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
in
1 # converting train and test into woe values
----> 2 train_woe = sc.woebin_ply(train, bins_adj)
3 test_woe = sc.woebin_ply(test, bins_adj)
4
5 y_train = train_woe.loc[:,'creditability']

D:\Anaconda3\lib\site-packages\scorecardpy\woebin.py in woebin_ply(dt, bins, no_cores, print_step)
1059 )
1060 # bins in dictionary
-> 1061 dat_suffix = pool.starmap(woepoints_ply1, args)
1062 dat = pd.concat([dat]+dat_suffix, axis=1)
1063 pool.close()

D:\Anaconda3\lib\multiprocessing\pool.py in starmap(self, func, iterable, chunksize)
274 func and (a, b) becomes func(a, b).
275 '''
--> 276 return self._map_async(func, iterable, starmapstar, chunksize).get()
277
278 def starmap_async(self, func, iterable, chunksize=None, callback=None,

D:\Anaconda3\lib\multiprocessing\pool.py in get(self, timeout)
655 return self._value
656 else:
--> 657 raise self._value
658
659 def _set(self, i, obj):

ValueError: fill value must be in categories

我在python3.7下运行不动

Scorecard

Hi @ShichenXie,

If we calculate the woe by using ln(B/G) and we expect more score to have more default, I am not sure you are correct to calculate the score by -b*x['woe']*coef_df[i] + basepoints/len_x.

I think we should "b*x['woe']*coef_df[i] + basepoints/len_x", please correct me if I am wrong.

scorecard

len_x = len(coef_df)
basepoints = a - b*model.intercept_[0]
card = {}
if basepoints_eq0:
    card['basepoints'] = pd.DataFrame({'variable':"basepoints", 'bin':np.nan, 'points':0}, index=np.arange(1))
    for i in coef_df.index:
        card[i] = bins.loc[bins['variable']==i,['variable', 'bin', 'woe']]\
          .assign(points = lambda x: round(-b*x['woe']*coef_df[i] + basepoints/len_x))\
          [["variable", "bin", "points"]]
else:
    card['basepoints'] = pd.DataFrame({'variable':"basepoints", 'bin':np.nan, 'points':round(basepoints)}, index=np.arange(1))
    for i in coef_df.index:
        card[i] = bins.loc[bins['variable']==i,['variable', 'bin', 'woe']]\
          .assign(points = lambda x: round(-b*x['woe']*coef_df[i]))\
          [["variable", "bin", "points"]]
return card

'chimerge' method problems in woebin.py

Hi Dr Xie,

Thank you response to me first. I use woebin.py in 'chimerge' method. There is no error to produce bins result. But when bins result is applied to dataset, there is no 'inf' in the last break point so that 'nan' value compared in dataset after woebin.py. That means breaks cut can't cover all value.

The result are as follows:

https://github.com/monicamn/test

I am very appreciated for your reply.

Can we get for multilabel data?

Is it possible to occupy multilabel data?
Thanks

TypeError: concat() got an unexpected keyword argument 'sort'

I was following the tutorial and I had this error upon calculating train_pref:

TypeError                                 Traceback (most recent call last)
<ipython-input-70-707eda962142> in <module>()
      3 
      4 # performance ks & roc ------
----> 5 train_perf = sc.perf_eva(y_train, train_pred, title = "train")
      6 test_perf = sc.perf_eva(y_test, test_pred, title = "test")

/usr/local/lib/python3.6/dist-packages/scorecardpy/perf.py in perf_eva(label, pred, title, groupnum, plot_type, show_plot, positive, seed)
    280     # dfkslift ------
    281     if any([i in plot_type for i in ['ks', 'lift']]):
--> 282         dfkslift = eva_dfkslift(df, groupnum)
    283         if 'ks' in plot_type: df_ks = dfkslift
    284         if 'lift' in plot_type: df_lift = dfkslift

/usr/local/lib/python3.6/dist-packages/scorecardpy/perf.py in eva_dfkslift(df, groupnum)
     31       pd.DataFrame({'group':0, 'good':0, 'bad':0, 'good_distri':0, 'bad_distri':0, 'badrate':0, 'cumbadrate':np.nan, 'cumgood':0, 'cumbad':0, 'ks':0, 'lift':np.nan}, index=np.arange(1)),
     32       df_kslift
---> 33     ], ignore_index=True, sort=False)
     34     # return
     35     return df_kslift

TypeError: concat() got an unexpected keyword argument 'sort'

please, add parametr

I want use woebin_ply on one row, but there's a check on unique:
"columns have only one unique values, which are removed from input dataset"
Please, add parametr in woebin_ply function to deactivate removing.

Thanks for your work!

Is it possible to set the scores manually after creating binning for features?

Error when running # converting train and test into woe values

When I'm running your example code using the same dataset, I got some error on this line :

#converting train and test into woe values

train_woe = sc.woebin_ply(train, bins_adj)
test_woe = sc.woebin_ply(test, bins_adj)

ValueError: fill value must be in categories

I would ask is there any step that I missed?

var_filter.py#87

I guess the variable 'len_diff' in the warning function should be 'len_diff_var_kp', after several unsuccessed trys to pass a set of to-keep variables which contains more ones than the origin dataframe really has.

关于变量筛选阈值设定的问题

我在调用您的工具包时发现有数据空值率92%，但是iv在初筛的时候还可以满足0.02的标准，但是在woe分箱后iv就不满足了，手动分箱时会出现错误'str' object cannot be interpreted as an integer这类错误，是不是应该在woe之后再测一次分箱iv筛选呢？

大兄弟，你的WOE怎么reverse了

大兄弟，WOE = ln(good distribution/ bad distribution), in your case, bad or event is 1, but it become bad distribution/ good distribution.

Please correct me if I use the function in the wrong way.

But, 这个package是目前最方便算WOE和手动调的，good job, 大兄弟

problem about woeply with special value

这个python包给了我很大的帮助，非常感谢！但我也发现了一些问题，在使用等频分组（我是先找出等频cut再通过breaks_list参数传入）并woe替换的时候，发现了以下两个问题：
1、对于存在特殊值的分组，woe替换后并没有被替换成对应的woe值，还是对应区间的woe值...稍微看了一下源码，问题好像出在woepoints_ply1()这个函数中，最后在替换的时候，是先用woebin的breaks来对每个值cut，然后再和woebin中的bin进行merge，在merge之前，缺失值nan替换成了missing，但是其他特殊值并没有替换成对应的bin，最终导致merge的时候被分到了实际区间对应的组，而不是单独的那一组；
2、数值型特征等频分组后会出现最后一个break不是inf的情况，即使我已经强制替换成了inf，这样导致woeply的时候，属于最后一组的值替换后的woe值变成了NaN，因为右半区间不一样，merge的时候没有关联上；

Manually adjust the cut-off

I am trying to manually adjust the cut off for my variable age as below

break_adj = age[0.99, 1.99, 2]
bins_adj = sc.woebin(df, y="TARGET", breaks_list=break_adj, print_info=FALSE)

but i get the below error
NameError: name 'age' is not defined

perf_eva caculate

the perf_eva function described: pred: Predicted probability or score.
But when I use Predicted probability or score ,the result doesn't equal .

use Predicted probability: KS =0.5286

train_pred = lr.predict_proba(x_train)[:,1]
train_perf = sc.perf_eva(y_train, train_pred, title = "train")

use score: KS = 0.3449

train_pred_c= lr.predict(x_train)
sc.perf_eva(y_train,train_pred_c,title='ttt')

hope for you help. thanks a lot.

Woebin function fails on MS SQL server

Trying to run on a MS SQL server instance using the german credit dataset. Error as below:

multiprocessing.pool.RemoteTraceback:
"""

Msg 39019, Level 16, State 2, Line 137
An external script error occurred:
Traceback (most recent call last):
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\multiprocessing\pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\site-packages\scorecardpy\woebin.py", line 702, in woebin2
stop_limit=stop_limit, max_num_bin=max_num_bin, breaks=breaks, spl_val=spl_val)
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\site-packages\scorecardpy\woebin.py", line 473, in woebin2_tree
bin_list = woebin2_init_bin(dtm, min_perc_fine_bin=min_perc_fine_bin, breaks=breaks, spl_val=spl_val)

Msg 39019, Level 16, State 2, Line 137
An external script error occurred:
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\site-packages\scorecardpy\woebin.py", line 292, in woebin2_init_bin
brk = list(filter(lambda x: x>np.nanmin(xvalue) and x<np.nanmax(xvalue), brk))
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\site-packages\scorecardpy\woebin.py", line 292, in
brk = list(filter(lambda x: x>np.nanmin(xvalue) and x<np.nanmax(xvalue), brk))
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\site-packages\pandas\core\generic.py", line 1478, in nonzero
.format(self.class.name))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 5, in

Msg 39019, Level 16, State 2, Line 137
An external script error occurred:
File "C:\PROGRA~~1\MICROS~~3\MSSQL1~~1.MSS\MSSQL\EXTENS~~1\MSSQLSERVER01\85A94319-2EE5-4271-A733-EFE1547B54C9\sqlindb.py", line 215, in transform
bins = sc.woebin(train, y="risk")
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\site-packages\scorecardpy\woebin.py", line 893, in woebin
bins = dict(zip(xs, pool.starmap(woebin2, args)))
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\multiprocessing\pool.py", line 268, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\multiprocessing\pool.py", line 608, in get
raise self._value
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Line of code that fails:
bins = sc.woebin(app_train_imp, y="risk")
app_train_imp is a pandas dataframe having the column named 'risk'

Python version being used on the server - 3.5.2
scorecardPy version - 0.1.7
Pandas version - 0.24.2
MS SQL Server version - 2017

Add class_weight='balanced' in lr, sc.perf_psi raise error

AttributeError: 'DataFrame' object has no attribute 'distr'

about the woebin

你好，打扰下，源码太长了没仔细看，想调用best-ks分箱是哪个函数呀，另外有api文档吗，因为有些代码好像没介绍完整

谢谢

Is multiprocessing module working in woebin function?

Hello~ 我有一点不太明白，woebin里自带的multiprocessing有没有实际效果，为什么CPU使用率比较低，我自己再在外面包一层multiprocessing效果就比较明显。感谢~

is non continuous binning possible?

I am working with another version of the dataset where, for exmaple "Purpose" was changed from qualitative to categorical, where:

car (new) -> 0 
car (used) -> 1
furniture/equipment -> 2
radio/television -> 3
domestic appliances -> 4
repairs -> 5
education -> 6
retraining -> 8
business -> 9
others -> 10

I would like to bin them, for example, business and education, so 6 and 9, but if I specify

breaks_list = {
      'Purpose': ['6%,%9']
    }

the model will bin from 6 to 9, not 6 & 9. I tried saying "6 or 9" and it didn't work either.
Is it possible to do this?

Thank you for working so hard on this library!

scorecard_ply can not convert some variables.

Hi Shichen

I have the following codes:
my_card <- scorecard(bins, model, points0 = 600, odds0 = 1/38, pdo = 20)

points_train <- scorecard_ply(train.data, my_card, only_total_score = FALSE, print_step = 0) %>% as.data.frame()

The above codes run normally, but when I check new transformed dataframe(points_train )
I see that some of the variables has NA values (like GENDER, MARITAL STATUS). Checking the my_card list I see that there is no problem for these variables, but once I convert them to get scores for trained data(points_train) these variables has NA values. Do you know why it can be?
Have been few days trying to solve it but still cant figure out. Many Thanks!

Save temporary result of woebin_adj

Hi! Now in woebin_adj function breaks results are not saved to save_breaks_list on each iteration, so when some exception occurs, all previos steps are lost. Please, write breaks to save_breaks_list on each iteration.

perf.py The calculation of KS

当我使用sc.perf_eva(y_train, train_pred , plot_type = ["ks"])计算KS时，KS曲线的定义貌似不是正确的。
正确的KS曲线，应该是以阈值点为x轴，TPR和FPR为y轴，显然源码中计算是错误。这里附上我，参考网上的部分代码，并自己更改后的代码，希望谢老师能参考一下：

def model_score_pro_ks(df_fact_tag, df_expected_score_or_pro, buckets, type_input='score'):
    # 初始等间隔分段区间列表
    breakpoints = np.arange(0, buckets + 1) / (buckets)
    # 将预期得分，按照等距分箱，分成buckets个，返回的是每段的上下界限,array
    # input指的是，初始分箱间隔，组成的列表
    # min和max指的是，df_expected_score_or_pro的最大最小值
    def transform_scale(bin_list, df_expected_score_or_pro_min, df_expected_score_or_pro_max):
        # 将最初间隔变成实际分数间隔
        bin_list /= np.max(bin_list) / (df_expected_score_or_pro_max - df_expected_score_or_pro_min)
        # 加入最小值，即可以实现将分数分隔
        bin_list += df_expected_score_or_pro_min
        return bin_list
    # 针对分数进行等间隔分段
    breakpoints = transform_scale(breakpoints, np.min(df_expected_score_or_pro), np.max(df_expected_score_or_pro))

    # 存在阈值情况下，计算KS值
    def calculate_ecpected_tag(fact_tag, expected_score_pro, bins_point):
        
        # 将真实和预期的组成数据框，便于打标签
        ksdf = pd.DataFrame({'fact_tag': fact_tag, 'expected_score_pro': expected_score_pro})
        # 概率小的是不违约，分数大的也是不违约，但这里计算ks时，大的设定为0，还是小的设定为0，不影响。
        # 因为最后的KS取绝对值，但会影响TPR和FRP曲线
        if type_input == 'score':
            ksdf.loc[:,'expected_tag'] = ksdf.apply(lambda x:1 if x['expected_score_pro'] <= bins_point else 0,axis=1)
        elif type_input == 'pro':
            ksdf.loc[:,'expected_tag'] = ksdf.apply(lambda x:0 if x['expected_score_pro'] <= bins_point else 1,axis=1)
        else:
            raise Exception("Incorrect inputs; the value of type should be choosed between 'score' and 'pro'.")

        #计算TPR和FPR 
        #shilian_tag为真实值,1表示失联,0表示未失联
        #expected_tag为预测值，1表示失联（分数低），0表示未失联（分数高）
        TP = sum([1 if a==b==1 else 0 for a,b in zip(ksdf['fact_tag'],ksdf['expected_tag'])])#正例被预测为正例
        FN = sum([1 if a==1 and b==0 else 0 for a,b in zip(ksdf['fact_tag'],ksdf['expected_tag'])])#正例被预测为反例
        TPR = TP/(TP+FN) 
        TN = sum([1 if a==b==0 else 0 for a,b in zip(ksdf['fact_tag'],ksdf['expected_tag'])])#反例被预测为反例
        FP = sum([1 if a==0 and b==1 else 0 for a,b in zip(ksdf['fact_tag'],ksdf['expected_tag'])])#反例被预测为正例
        FPR = FP/(TN+FP)

        KS = TPR - FPR

        return pd.DataFrame({'bins_point':[bins_point],'TPR':[TPR], 'FPR':[FPR],'KS':[abs(KS)]})

    df1 = pd.DataFrame()
    
    for i in breakpoints:
        # 不断组成数据框，直至所有KS计算完成
        df1 = pd.concat([df1,calculate_ecpected_tag(df_fact_tag, df_expected_score_or_pro, i)])
    return df1

Something Wrong In the woebin.py.

sc.woebin_ply for a Single Test Record?

Is it possible to apply the sc.woebin_ply function for a SINGLE test record using the train bins, it works in R, in python data frame becomes empty saying unique values, in R it converts based on the training bins.

train_bins = sc.woebin(training_data, y="Target", breaks_list=breaks_adj)
test_data = sc.woebin_ply(test_data, train_bins)

Thanks

When the different values of a feature are greater than 50, the feature is removed.Here, I have a question.

谢老师，打扰您一下！~

在您的源码里，rmcol_datetime_unique1(dat, check_char_num = False)函数中，如果这个变量的值，非数字且不同值大于50个，那么如果选择不分箱，也就是2，将会中止程序。

在几个星期之前，我使用pd.read_sql()函数时，就会将所有的数字全部都读取为object的格式。
起初我没注意到，代码运行到了这里，必须选1，我本想着选2，想着对这写数字列不分箱也可以。但这里格式不对，选2直接就停止运行了。

我想是不是再加个判断条件比较好呢？如果转换不成数字的格式，再进行中止程序。如果转换成数字的格式，可以提醒对方，该列已转换成数字格式，是否仍进行分箱？

这是我的想法哈~谢老师怎么认为的呢？

小数点问题

当variable为0-1的值时，分箱后会出现小数点很长的情况，如下0.30000000000000004这样
variable | bin | points
credit_rate | [-inf,0.1) | -7.0
credit_rate | [0.30000000000000004,0.8) | 9.0
credit_rate | [0.30000000000000004,0.8) | 92.0
credit_rate | [0.8,inf) | -40.0

thank you。

Error when using multiprocessing

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/miniconda3/envs/mlflow-323f7dfca0660eb56720b79b6e1754559df603e4/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/miniconda3/envs/mlflow-323f7dfca0660eb56720b79b6e1754559df603e4/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/miniconda3/envs/mlflow-323f7dfca0660eb56720b79b6e1754559df603e4/lib/python3.6/site-packages/scorecardpy-0.1.7-py3.6.egg/scorecardpy/woebin.py", line 702, in woebin2
stop_limit=stop_limit, max_num_bin=max_num_bin, breaks=breaks, spl_val=spl_val)
File "/miniconda3/envs/mlflow-323f7dfca0660eb56720b79b6e1754559df603e4/lib/python3.6/site-packages/scorecardpy-0.1.7-py3.6.egg/scorecardpy/woebin.py", line 495, in woebin2_tree
if binning_tree is None: binning_tree = initial_binning
UnboundLocalError: local variable 'binning_tree' referenced before assignment
"""

special_values bug in sc.woebin

您好！在学习sc包的过程中，遇到了问题。
sc.woebin的示例中数据集中加入了空值，使下述special_values 可以分箱。
special_values = {
'credit.amount': [2600, 9960, "6850%missing"]
}
但是在数据集的'credit.amount'列数值没有空值的情况下，想要得到"6850%missing"的分箱，会出现报错：
ValueError: cannot convert float NaN to integer
（在sc的示例中加入了上述special_values ，出现了问题，遂查询了woebin的函数，woebin的示例没有问题，遂对比学习了一下）
虽然一般情况下不会有这样的分箱，但是出现了bug。

sc.iv计算的结果有误

为啥sc.iv计算的结果和分箱的total_iv结果不一致？感觉是个bug

special values in woebin

  作者大大你好，我是一名Python初学者。最近在尝试使用您写的这个package，train 里面没有缺失值，已经由特殊值 -9999、-8888代替了，special_values 里面写其中一个特殊值时是可以正常跑完的，一旦填了2个就会报出以下的错误，

其中-9999是每个变量都会有的特殊值，-8888是部分变量才有的，以下问题我个人还没找到解决办法，能麻烦您看看嘛

分箱

bins=sc.woebin(train,'od',
init_count_distr=0.02,
count_distr_limit=0.05,
stop_limit=0.1,bin_num_limit=5,
special_values=[-9999,-8888])

AttributeError Traceback (most recent call last)
in
4 count_distr_limit=0.05,
5 stop_limit=0.1,bin_num_limit=5,
----> 6 special_values=[-9999,-8888])

D:\anaconda\lib\site-packages\scorecardpy\woebin.py in woebin(dt, y, x, var_skip, breaks_list, special_values, stop_limit, count_distr_limit, bin_num_limit, positive, no_cores, print_step, method, ignore_const_cols, ignore_datetime_cols, check_cate_num, replace_blank, save_breaks_list, **kwargs)
964 stop_limit=stop_limit,
965 bin_num_limit=bin_num_limit,
--> 966 method=method
967 )
968 # try catch:

D:\anaconda\lib\site-packages\scorecardpy\woebin.py in woebin2(dtm, breaks, spl_val, init_count_distr, count_distr_limit, stop_limit, bin_num_limit, method)
722 bin_list = woebin2_tree(
723 dtm, init_count_distr=init_count_distr, count_distr_limit=count_distr_limit,
--> 724 stop_limit=stop_limit, bin_num_limit=bin_num_limit, breaks=breaks, spl_val=spl_val)
725 elif method == "chimerge":
726 # 2.chimerge optimal binning

D:\anaconda\lib\site-packages\scorecardpy\woebin.py in woebin2_tree(dtm, init_count_distr, count_distr_limit, stop_limit, bin_num_limit, breaks, spl_val)
485 initial_binning = bin_list['initial_binning']
486 binning_sv = bin_list['binning_sv']
--> 487 if len(initial_binning.index)==1:
488 return {'binning_sv':binning_sv, 'binning':initial_binning}
489 # initialize parameters

AttributeError: 'NoneType' object has no attribute 'index'

var_filter.py

这个文件第87行的“len_diff”变量是未定义的。在特殊情况下会报错。
warnings.warn("Incorrect inputs; there are {} var_kp variables are not exist in input data, which are removed from var_kp. \n {}".format(len_diff, list(set(var_kp)-set(var_kp2))) )

woebin_ply: AttributeError: 'DataFrame' object has no attribute 'explode'

Hi,
I'm trying to build a scorecard and while transforming the input data to WOE values, I came across the above mentioned error. When I checked the source code, the input for woebin_ply is a dataframe. Could you please help me out on this?

TypeError: unhashable type: 'list'

[woebin.py]
line 1020: dtx_suffix = pd.merge(dtx, binx, how='left', on=x_i).sort_values('rowid')
.set_index(dtx.index)[['_'.join([x_i,woe_points])]]

raise error on col "other.installment.plans" :
TypeError: unhashable type: 'list'
++++++++++++++++++++++++++++++++
other.installment.plans
0 [bank, stores]
1 [none]

selecting the best 20 variables

is there possible to implement top 20 vaiariables only into scorecard?

how to do it in the best statistitical way?

AUC calculation

Hello
Why you treats scores like this (in perf_eva() function)?

if np.mean(pred) < 0 or np.mean(pred) > 1:
        warnings.warn('Since the average of pred is not in [0,1], it is treated as predicted score but not probability.')
        pred = -pred

I mean string pred = -pred exactly

In case of positive scores your computing works incorrectly

information value calculation in case of zero number of y class

Hi, iv_xy function replaces 0 with 0.9 in order to calculate information value. Would it be more appropriate to replace with a value closer to 0 , let's say 0.01?

Wildcard categorical bin?

We have a dataset of about 3mm records. We're building a model using a 700k training sample and a 300k test sample.

We're building the WoE bins based on the 700k training set, and it turns out that for a few of the categorical variables (e.g., 3-digit zipcode), there are values in the test set that aren't in the training set.

Two thoughts/questions:

Is there a way to define "wildcard" bins to catch these values?
Is there a logical way to infer what the WoE value should be? The count of bad's in the training sample is necessarily zero... so I'm kind of inclined to include it in the "lowest" WoE bin

sc.woebin() 输出 bins 多了 index 的分箱

scorecardpy 版本：v0.1.9.1
python : 3.7
问题描述：
采取woebin()函数分箱，如果输入DataFrame的index不是按顺序排列（没有reset_index()）
分箱结果将会产生index的分箱结果
代码：
dat = sc.germancredit()
dt_s = sc.var_filter(dat, y="creditability")
dt_s['index_x'] = [1i for i in range(len(dt_s)-100)] + [2i for i in range(100)]
dt_s = dt_s.set_index('index_x')
bins = sc.woebin(dt_s[['creditability','housing']], y="creditability")