Giter Site home page Giter Site logo

sshimii / feature-selection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from duxuhao/feature-selection

0.0 1.0 0.0 10.26 MB

Features selector based on the self selected-algorithm, loss function and validation method

Home Page: https://pypi.org/project/MLFeatureSelection/

License: MIT License

Python 69.42% Jupyter Notebook 30.58%

feature-selection's Introduction

MLFeatureSelection

License: MIT PyPI version

General features selection based on certain machine learning algorithm and evaluation methods

Divesity, Flexible and Easy to use

More features selection method will be included in the future!

Quick Installation

pip3 install MLFeatureSelection

Modulus in version 0.0.9.5.1

  • Modulus for selecting features based on greedy algorithm (from MLFeatureSelection import sequence_selection)

  • Modulus for removing features based on features importance (from MLFeatureSelection import importance_selection)

  • Modulus for removing features based on correlation coefficient (from MLFeatureSelection import coherence_selection)

  • Modulus for reading the features combination from log file (from MLFeatureSelection.tools import readlog)

This features selection method achieved

  • 1st in Rong360

-- https://github.com/duxuhao/rong360-season2

  • 6th in JData-2018

-- https://github.com/duxuhao/JData-2018

  • 12nd in IJCAI-2018 1st round

-- https://github.com/duxuhao/IJCAI-2018-2

Modulus Usage

Example

  • sequence_selection
from MLFeatureSelection import sequence_selection
from sklearn.linear_model import LogisticRegression

sf = sequence_selection.Select(Sequence = True, Random = True, Cross = False) 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function handle and optimize direction, 'ascend' for AUC, ACC, 'descend' for logloss etc.
sf.InitialNonTrainableFeatures(notusable) #those features that is not trainable in the dataframe, user_id, string, etc
sf.InitialFeatures(initialfeatures) #initial initialfeatures as list
sf.GenerateCol() #generate features for selection
sf.SetFeatureEachRound(50, False) # set number of feature each round, and set how the features are selected from all features (True: sample selection, False: select chunk by chunk)
sf.clf = LogisticRegression() #set the selected algorithm, can be any algorithm
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, validate is the function handle of the validation function, return best features combination
  • importance_selection
from MLFeatureSelection import importance_selection
import xgboost as xgb

sf = importance_selection.Select() 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction
sf.InitialFeatures() #initial features, input
sf.SelectRemoveMode(batch = 2)
sf.clf = xgb.XGBClassifier() 
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, return best features combination
  • coherence_selection
from MLFeatureSelection import coherence_selection
import xgboost as xgb

sf = coherence_selection.Select() 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction
sf.InitialFeatures() #initial features, input
sf.SelectRemoveMode(batch = 2)
sf.clf = xgb.XGBClassifier() 
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, return best features combination
  • tools.readlog: read previous selected features from log
from MLFeatureSelection.tools import readlog

logfile = 'record.log'
logscore = 0.5 #any score in the logfile
features_combination = readlog(logfile, logscore)
  • tools.filldf: complete dataset when there is cross-term features
from MLFeatureSelection.tools import readlog, filldf

def add(x,y):
    return x + y

def substract(x,y):
    return x - y

def times(x,y):
    return x * y

def divide(x,y):
    return x/y

def sq(x,y):
    return x ** 2


CrossMethod = {'+':add,
               '-':substract,
               '*':times,
               '/':divide,
               } # set your own cross method

df = pd.read_csv('XXX')
logfile = 'record.log'
logscore = 0.5 #any score in the logfile
features_combination = readlog(logfile, logscore)
df = filldf(df, features_combination, CrossMethod)
  • format of validate and lossfunction

define your own:

validate: validation method in function , ie k-fold, last time section valdate, random sampling validation, etc

lossfunction: model performance evaluation method, ie logloss, auc, accuracy, etc

def validate(X, y, features, clf, lossfunction):
    """define your own validation function with 5 parameters
    input as X, y, features, clf, lossfunction
    clf is set by SetClassifier()
    lossfunction is import earlier
    features will be generate automatically
    function return score and trained classfier
    """
    clf.fit(X[features],y)
    y_pred = clf.predict(X[features])
    score = lossfuntion(y_pred,y)
    return score, clf
    
def lossfunction(y_pred, y_test):
    """define your own loss function with y_pred and y_test
    return score
    """
    return np.mean(y_pred == y_test)

multiple processing

Multiple processing can be set in validate function when you are doing N-fold.

DEMO

More examples are added in example folder include:

  • Demo contain all modulus can be found here (demo)

  • Simple Titanic with 5-fold validation and evaluated by accuracy (demo)

  • Demo for S1, S2 score improvement in JData 2018 predict purchase time competition (demo)

  • Demo for IJCAI 2018 CTR prediction (demo)

Function Parameters

Parameters

Algorithm details

Details

feature-selection's People

Contributors

duxuhao avatar ewen2015 avatar jepsonwong avatar jiaqiangbandongg avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.