Giter Site home page Giter Site logo

lizhen0909 / spark-infotheoretic-feature-selection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sramirez/spark-infotheoretic-feature-selection

0.0 2.0 0.0 926 KB

This package contains a generic implementation of greedy Information Theoretic Feature Selection (FS) methods. The implementation is based on the common theoretic framework presented by Gavin Brown. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided.

Home Page: http://sci2s.ugr.es/BigData

License: Apache License 2.0

Scala 100.00%

spark-infotheoretic-feature-selection's Introduction

An Information Theoretic Feature Selection Framework

The present framework implements Feature Selection (FS) on Spark for its application on Big Data problems. This package contains a generic implementation of greedy Information Theoretic Feature Selection methods. The implementation is based on the common theoretic framework presented in [1]. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided. In addition, the framework can be extended with other criteria provided by the user as long as the process complies with the framework proposed in [1].

Spark package: http://spark-packages.org/package/sramirez/spark-infotheoretic-feature-selection

Please cite as: S. Ramírez-Gallego; H. Mouriño-Talín; D. Martínez-Rego; V. Bolón-Canedo; J. M. Benítez; A. Alonso-Betanzos; F. Herrera, "An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, in press, pp.1-13, doi: 10.1109/TSMC.2017.2670926 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7970198&isnumber=6376248

Main features:

  • Version for new ml library.
  • Support for sparse data and high-dimensional datasets (millions of features).
  • Improved performance (less than 1 minute per iteration for datasets like ECBDL14 and kddb with 400 cores).

This work has associated two submitted contributions to international journals which will be attached to this request as soon as they are accepted. This software has been proved with two large real-world datasets such as:

Example (ml):

import org.apache.spark.ml.feature._
val selector = new InfoThSelector()
	.setSelectCriterion("mrmr")
      	.setNPartitions(100)
      	.setNumTopFeatures(10)
      	.setFeaturesCol("features")
      	.setLabelCol("class")
      	.setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)

Example (MLLIB):

import org.apache.spark.mllib.feature._
val criterion = new InfoThCriterionFactory("mrmr")
val nToSelect = 100
val nPartitions = 100

println("*** FS criterion: " + criterion.getCriterion.toString)
println("*** Number of features to select: " + nToSelect)
println("*** Number of partitions: " + nPartitions)

val featureSelector = new InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

val reduced = data.map(i => LabeledPoint(i.label, featureSelector.transform(i.features)))
reduced.first()

Design doc: https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing

Prerequisites:

LabeledPoint data must be discretized as integer values in double representation with a maximum of 256 distinct values. By doing so, data can be transformed to byte type directly, making the selection process much more efficient.

Contributors

References

[1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). "Conditional likelihood maximisation: a unifying framework for information theoretic feature selection." The Journal of Machine Learning Research, 13(1), 27-66.

spark-infotheoretic-feature-selection's People

Contributors

hbghhy avatar jabenitez avatar lizhen0909 avatar sramirez avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.