Giter Site home page Giter Site logo

naive_bayes_pca_lda's Introduction

Naive Bayes and PCA/LDA

Python implementation of Naive Bayes and dimension reduction techniques - PCA & LDA, and running them on different datasets

Naive Bayes Classifier

Run: naive_bayes.py

This automatically runs on the census dataset:

  • census-income.data
  • census-income.names
  • census-income.test

The full versions of these files can be downloaded from https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29 and put in this directory.

  • naive_bayes.py - This implements the classifier and runs on the cenus-income dataset mentioned above.

  • Log probabilities were used to avoid calculation errors due to very small numbers. A small logconst was added to the probabilities and then the log was taken. These log probabilities were then summed up.

  • While training, if a missing entry occurs, we can do two things, either ignore the sample or take only the values of the features which aren’t missing. In this case I chose to take the values and not ignore the sample per se - sort of treating the ? or missing values as an attribute. While testing, we just use the features which have no missing values and ignore the ones with a missing value.

PCA & LDA

This work was done on the Dorothea dataset https://archive.ics.uci.edu/ml/datasets/Dorothea Training was done on the train file and testing on the valid file.

Run: pca_lda_NB.py

This implements the PCA technique.

Since the original dataset consists of a really large number of features - 100000 features, there are two ways one can go about the task - use the so called ‘kernel trick’ for calculating the eigen vectors of ATA matrix or to select a random subsample of the 100000 features and then implement a PCA on that feature set. I've gone with the former approach. More about it here: http://stats.stackexchange.com/questions/7111/how-to-perform-pca-for-data-of-very-high-dimensionality http://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca Here we have separate train and test data. For training, all the 600 samples were used. For testing, 200 of the 350 samples were used.

  • TOTAL_FEATURES: Total number of features in the dataset

  • KSPACE1, KSPACE2, and KSPACE3: There are the three K's for three k-dimensional PCA space reductions. As said above, I used the kernel trick as the number of features is too high. So since the AAT matrix has only max of 800 eigen values in this case, the highest K that I could use here (KSPACE3) is 800.

  • LDA_NUM_FEATURES_SAMPLED: Features sampled from the dataset for LDA reduction

naive_bayes_pca_lda's People

Contributors

susobhang70 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.