Giter Site home page Giter Site logo

04pallav / secom_class_imbalance Goto Github PK

View Code? Open in Web Editor NEW

This project forked from meena-mani/secom_class_imbalance

0.0 2.0 0.0 233 KB

Approaches for the class imbalance problem (in semicondutor manufacturing process line data)

License: MIT License

Jupyter Notebook 100.00%

secom_class_imbalance's Introduction

SECOM_class_imbalance

Approaches for the class imbalance problem (in semicondutor manufacturing process line data)

Description

The SECOM dataset in the UCI Machine Learning Repository is semicondutor manufacturing data which has 1567 records, 590 anonymized features and 104 fails. The process yield has a simple pass/fail response (encoded -1/1).

The dataset has the following characteristics:

  1. two-class problem
  2. an imbalance with a 14:1 skew of pass to fails
  3. large number of features -- 590
  4. missing data
  5. features/columns which do not have sufficient information
  6. 4% of the columns/features have more than 50% of their records missing
  7. some columns have constant values

Objective

The SECOM dataset presents us with two problems: (i) working with skewed data and (ii) feature selection. The main focus for this analysis will be the class imbalance issue and the ability to successfully predict fails. Strategies used in fraud/anomaly detection/rare disease diagnosis will be useful here. A secondary objective will be feature reduction. (In some to the literature pertaining to the SECOM dataset, this was the primary goal [1].) A streamlined feature set can not only lead to better prediction accuracy and data understanding but also save manufacturing resources.

###Software

  • Python 2.7
  • scikit-learn packages for algorithms
  • pandas for data wrangling
  • Matplotlib and Seaborn for plotting and visualization

###Methods We will look at some of the approaches that deal with class imbalance. These can be a cost sensitive learning approach or sampling-based. We will also be working with feature selection methods. We will begin with the following:

  1. Random Forest variable importance (feature selection)
  2. One-class SVM
  3. SVM with SMOTE (oversampling minority class/undersampling majority class)
  4. SVM, Undersampling and Data Cleaning for Imbalanced Data
  5. Random Forest (weighting the classes)
  6. Boosting methods

This is Work-In-Progress so new exercises will be added and old ones refined on an on-going basis.

###Further Reading [1] McCann, Michael, et al. "Causality Challenge: Benchmarking relevant signal components for effective monitoring and process control." NIPS Causality: Objectives and Assessment.2010.
[2] H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009.

secom_class_imbalance's People

Contributors

meena-mani avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.