Giter Site home page Giter Site logo

mustafahakkoz / classification_clustering_freq_pattern_mining Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 913 KB

3 notebooks covering Classification, Clustering Analysis and Frequent Pattern Mining in the scope of Data Mining lectures in Marmara University.

Jupyter Notebook 100.00%
k-means agnes dbscan apriori fp-growth eclat classification clustering-analysis association-rules frequent-pattern-mining

classification_clustering_freq_pattern_mining's Introduction

Classification_Clustering_Freq_Pattern_Mining

2019-2020 Fall CSE4063 - Data Mining

3 projects covering Classification, Clustering Analysis and Frequent Pattern Mining in the scope of Data Mining lectures in Marmara University. Notebooks are written on Kaggle platform so online versions of them are suggested for better visuals.


Online Notebooks:

  1. Phishing Websites | Kaggle

  2. Absenteeism at Work - Clustering | Kaggle

  3. Frequent Pattern Miningv2 | Kaggle


Repo Content and Implementation Steps:

1.phishing-websites.ipynb

  • 6 classifiers; CART, C4.5, Naive-Bayes, Support Vector Machine, Neural Network with 1 hidden layer and Neural Network with 2 hidden layers are trained on Phishing Websites dataset. Hyperparameter tuning is implemented in 5-fold cross-validation with necesarry preprocessing steps.

2.absenteeism-at-work-clustering.ipynb

  • Clustering Analysis on Absenteeism at Work Dataset is implemented. First EDA, outlier detection (IQR), normalization (Min-Max Scaler) and feature selection with Random Forest (and Permutation Importance) are completed.

  • K-means clusters are visaulized by 3D t-SNE plots after searching for possible elbow points (based on inertia attribute). After that PCA+K-means pipeline is tested. Most of valuable information about data is lost with PCA so, resulting graphs seem incomplete. Using k-means on original 7-dimensional data then plotting with t-SNE gives better results.

  • There's no inertia (Sum of squared distances of samples to their closest cluster center) attribute of AgglomerativeClustering class so we used silhouette coefficient (best:1, worst:-1) to select cluster number of AGNES. Again 3D t-SNE clusters and dendogram is plotted.

  • DBSCAN model is also implemented and best values for the parameters eps and min_samples are found in gridsearch manner with silhouette coefficient. Again best model is visaulized in 3D t-SNE plot.ly graphs.

  • And finally in evaluation step, best of 3 models are compared by using 9 metrics:

    • Estimated number of clusters
    • Estimated number of noise points
    • Homogeneity
    • Completeness
    • V-measure
    • Adjusted Rand Index
    • Adjusted Mutual Information
    • Fowlkes-Mallows score
    • Silhouette Coefficient

    Explanations and comments on the results can be found in notebooks.

3.frequent-pattern-miningv2.ipynb

  • Association rules for a given dataset is extracted by using Aprori, FP-Growth and ECLAT algorithms of mlxtend library after preprocessing with TransactionEncoder. Models are compared with memory usages and runtimes.

classification_clustering_freq_pattern_mining's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.