Giter Site home page Giter Site logo

avannaldas / emailsclassification Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 6.0 26 KB

Classification of emails received on a mass distribution group

Jupyter Notebook 100.00%
sklearn scikit-learn text-classification email-classifier tfidf countvectorizer

emailsclassification's Introduction

Text Classification using Scikit-Learn (sklearn)

This is a classification of emails received on a mass distribution group based on subject and hand labelled categories (supervised). The solution includes preprocessing (stopwords removal, lemmatization using nltk), features using count vectorizer and tfidf transformer. The solution is a vanilla implementation that can be used to extend from here to various text classification problems.

Things that can be tweaked to improve accuracy...

  • Add more parameter configurations to GridSearchCV
  • Increase number of K Folds used with GridSearchCV, default is 3.
  • Increase the dataset (current dataset is only 500 emails)
  • The classes in the dataset are skewed with varying proportions, the dataset can either be balanced by oversampling or the weights for each class can be adjusted if the classifier allows.
  • Try different classifiers or use model stacking with meta classifier

Quick Info...

  • Dataset: Dataset is a csv with columns 'Subject' and 'Categroy' (target variable) for about 500 emails. I'm not sharing dataset as it is from real emails taken from my inbox. Replace the dataset with your own dataset that has these two columns.

  • Features: Features matrix is created using a sklearn.feature_extraction.text.CountVectorizer, to get a counts matrix of all tokens and sklearn.feature_extraction.text.TfidfTransformer to normalize the count matrix.

  • Classifier: sklearn.linear_model.SGDClassifier

  • Pipeline and GridSearchCV: sklearn.pipeline.Pipeline and sklearn.model_selection.GridSearchCV are one of the best things in sklearn. Pipelines let you perform a series of steps on data without individually creating objects, handling parameters/return values and data hand off between steps. GridSearchCV helps with parameter tuning. It also performs cross validation with default 3 fold validation. Pipelines and GridSearchCV together reduce a lot of code complexity and improve readability of a solution.

emailsclassification's People

Contributors

avannaldas avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

emailsclassification's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.