Giter Site home page Giter Site logo

ppdai_risk_evaluation's Introduction

拍拍贷“魔镜风控系统”从平均400个数据维度评估用户当前的信用状态,给每个借款人打出当前状态的信用分,在此基础上,再结合新发标的信息,打出对于每个标的6个月内逾期率的预测,为投资人提供了关键的决策依据,促进健康高效的互联网金融。拍拍贷首次开放丰富而真实的历史数据,邀你PK“魔镜风控系统”,通过机器学习技术,你能设计出更具预测准确率和计算性能的违约预测算法吗?

我的成绩:在第一阶段数据集(没有使用第二阶段数据集)得到auc(官方确定衡量标准):0.794587,接近比赛冠军分数,因为比赛已经结束无法提交,所以这个结果不具有严格可对比性,不过很大程度上也已经很接近了。

一、思路

1.1 数据清洗

  • 删除数据缺失比例很大的列,比如超过20%为nan
  • 删除数据缺失比例大的行,并保持删除的行数不超过总体的1%
  • 填补剩余缺失值,通过value_count观察是连续/离散变量,然后用最高频/平均数填补nan。这里通过观察,而不是判断类型是否object,更贴近实际情况

1.2 feature分类

  • 所有的分类中,如果其中最大频率的值出现超过一定阈值(50%),则把这列转化成为2值。比如[0,1,2,0,0,0,4,0,3]转化为[0,1,1,0,0,0,1,0,1]
  • 剩余的feature中,根据dtype,把所有features分为numerical和categorical 2类
  • numerical中,如果unique num不超过10个,也归属为categorical分类

1.3 outlier删除

  • 所有的numerical feature,画出在不同target下的分布图,stripplot(with jitter),类似于boxplot,不过更方便于大值outlier寻找。
melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features])
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
  • 绘制所有numerical features的密度图,并且可以观察出,它们都可以通过求对数转化为更接近正态分布
for f in numerical_features_log:
    train_master[f + '_log'] = np.log1p(train_master[f])
  • 转化为log分布后,可以再删除一些极小的outlier。

1.4 Feature Engineering

other 2 datasets

train_loginfo:对Idx做group,提取记录数,LogInfo1独立数,活跃日期数,日期跨度

train_userinfo:对于Idx做group,提取记录数,UserupdateInfo1独立数、UserupdateInfo1/UserupdateInfo2独立数,日期跨度。以及每种UserupdateInfo1/UserupdateInfo2的数量。

解析日期

arrow lib,把日期解析成年、月、日、周、星期几、月初/月中/月末。带入模型前进行one-hot encoding

新feature

  • at_home,猜测UserInfo_2和UserInfo_8可能表示用户的当前居住地和户籍地,从而判断用户是否在老家。

1.5 训练前准备

指定one-hot encoding features

这里不要自动推算get_dummies所使用的列,pandas会自动选择object类型,而有些非object feature,实际含义也是categorical的,也需要被one-hot encoding

train_master_ = pd.get_dummies(train_master_, columns=finally_dummy_columns)

normalized

X_train = StandardScaler().fit_transform(X_train)

1.6 训练评估

Cross Validation

使用StratifiedKFold保证预测target的分布合理,并且shuffle随机。

cv = StratifiedKFold(n_splits=3, shuffle=True)

AUC评估

auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()

模型算法

  • XGBClassifier
  • RidgeClassifier
  • LogisticRegression
  • AdaBoostClassifier
  • VotingClassifier组合上面4种,做Ensembling

ppdai_risk_evaluation's People

Contributors

wikke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ppdai_risk_evaluation's Issues

代码中有一个bug,导致算出cv的auc不正常

跟着作者的代码一直执行到最后,在cv这一步输出的auc的确是挺高的,达到了0.79,然而当我自行分割训练集和验证集,并且用同样的模型参数训练模型时,效果却不如人意。

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_curve, auc, roc_auc_score

voting = VotingClassifier(estimators = estimators, voting='soft')
X_train_new, X_val, y_train_new, y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=0)
voting.fit(X_train_new, y_train_new)
y_train_predit = voting.predict(X_train_new)
y_val_predit = voting.predict(X_val)

print(classification_report(y_train_new, y_train_predit))
print(roc_auc_score(y_train_new, y_train_predit))
print(classification_report(y_val, y_val_predit))
print(roc_auc_score(y_val, y_val_predit))

输出如下:

              precision    recall  f1-score   support

           0       0.94      1.00      0.97     21212
           1       1.00      0.00      0.00      1247

   micro avg       0.94      0.94      0.94     22459
   macro avg       0.97      0.50      0.49     22459
weighted avg       0.95      0.94      0.92     22459

0.5008019246190858
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      5306
           1       0.00      0.00      0.00       309

   micro avg       0.94      0.94      0.94      5615
   macro avg       0.47      0.50      0.49      5615
weighted avg       0.89      0.94      0.92      5615

0.49962306822465136

模型的auc只有0.5不到,而且recall基本为0。因为这是一个不平衡预测集,违约人数较少,此时模型可能是把所有样本判断为0,虽然准确率很高,但是这样的模型却是没意义的。

那么到底是哪里出了问题呢,为什么前面的交叉验证显示的auc这么高呢。观察代码,我发现了一个bug。在这个地方:

cv = StratifiedKFold(n_splits=3, shuffle=True)

def estimate(estimator, name='estimator'):
    auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
    accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
    recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()

    print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))

作者在cross_val_score的cv参数传入了一个 StratifiedKFold的实例。阅读代码发现,如果传入数字cross_val_score也会默认使用StratifiedKFold(cv)来对数据集进行分割,但是不会传入shuffle=True。另外计算三个指标分别进行三次交叉验证计算也不合常理。于是我尝试把代码改成如下:

def estimate(estimator, name='estimator'):
    scoring = {'roc_auc': 'roc_auc',
               'accuracy': 'accuracy',
               'recall': 'recall'}
    
    scoring_result_dict= cross_validate(estimator, X_train, y_train, scoring=scoring, cv=3, return_estimator=True)
    auc = scoring_result_dict['test_roc_auc'].mean()
    accuracy = scoring_result_dict['test_accuracy'].mean()
    recall = scoring_result_dict['test_recall'].mean()
    print(scoring_result_dict)
    print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))

此时算出来的auc只有0.5左右,符合上面的结果。同时我也尝试传入cv = StratifiedKFold(n_splits=3, shuffle=True),算出来的auc也只有0.5左右。我猜测,shuffle=True是造成auc偏高的原因,但具体原因我还没找到。

作者在数据清洗和特征工程做了大量的工作,还是能给人不少启发的,不过最后在模型调参这一部分就显得有点粗糙了。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.