Statistical Learning For Data Mining

Naïve Bayes classifier for wine quality dataset

1)Built a naïve Bayes classifier for predicting the quality of wine data with 12 categorical variables and target variable from 0 to 10.

2)Evaluated the model on the basis of Confusion Matrix and Area under ROC curve.

3)Model resulted in 55% accuracy on the test data saying that this would not be a good model for such data.

Technology Stack - Scikit- Learn,matplotlib,numpy

Classification of Diabetes dataset using Decision tree,SVM,Regression Tree model.

1)A decision tree classifier,SVM,Regression Tree models were built and GridSearch Method was used to optimal values of parameters.

2)A suitable model was selected for classification considering the tradeoff between Bias and Variance.

3)The Models were evaluated on the basis of Confusion matrix,Precision, Recall and Area Under ROC.

4)Decision Tree Classifier gave Testing accuracy of 71.8%,SVM -70.56%,Rregression Tree model - 35.5%

Technology Stack - Scikit-learn,numpy,pandas,matplotlib,C5.0(CART)

K clustering of Diabetes dataset

1)Executed K means Clustering with different values of clusters.

2)Used Elbow method to determine the optimal number of clusters which came out to be 10.

3)Executed Hierarchical Clustering with single,complete,average and wards distance.

4)Used Cophenetic correlation Coefficient to compare the linkages resulting in Average linkage to be best with 0.8653.

Final Project(Application in prediction of minority class in unbalanced data)

1)Performed Descriptive and Exploratory analysis on the data given.

2)Preprocessed data using techniques like One hot encoding.

3)Applied Upsampling techniques(SMOTE) since the data was unbalanced with a minority class.

4)Test Various algorithms like Logistic regression,XGBoost,SVM with GridSearch method to optimize their parameters.

5)Models were evaluated on the basis of Balanced error rate,which is average rate on each class.

6)Random forest model gave the best results with the accuracy of 80%

Technology Stack - SMOTE,pandas,scikit-learn,numpy,XGboost,Logistic regression,matplotlib.

jaybhanushali3195 / statistical-learning-for-data-mining-projects Goto Github PK