This repository contains the code for a multi-label email classification system. The system is designed to classify emails based on multiple dependent variables (labels) such as Type 2, Type 3, and Type 4. It provides a modular and extensible architecture that allows for easy modification and addition of preprocessing steps, machine learning models, and evaluation metrics.
- Separation of concerns (SoC) architecture with components for preprocessing, embeddings, modeling, etc.
- Supports two design decisions for multi-label classification:
- Chained Multi-outputs Approach: Trains a single model instance on chained labels (e.g., Type 2, Type 2+3, Type 2+3+4)
- Hierarchical Modeling Approach: Trains multiple model instances on filtered data based on the classes of preceding labels
- Encapsulates input data using a
Data
class for consistent access across models - Implements multiple machine learning models with a consistent interface for training, prediction, and evaluation
- Provides a main controller (
main.py
) for orchestrating the preprocessing, embedding, modeling, and evaluation steps
main.py
: Main controller script for running the email classification systempreprocess.py
: Contains functions for data preprocessing, including de-duplication, noise removal, and translationembeddings.py
: Implements functions for generating embeddings from text data (e.g., TF-IDF)modelling/
: Directory containing modules related to modelingmodelling.py
: Defines functions for model training, prediction, and evaluationdata_model.py
: Implements theData
class for encapsulating input data
model/
: Directory containing implementations of various machine learning modelsbase.py
: Defines the abstract base class for all modelsrandomforest.py
: Implements the Random Forest modelsgd.py
: Implements the Stochastic Gradient Descent (SGD) modeladaboost.py
: Implements the AdaBoost modelvoting.py
: Implements the Voting Classifier modelhist_gb.py
: Implements the Histogram-based Gradient Boosting modelrandom_trees_ensembling.py
: Implements the Random Trees Embedding model
data/
: Directory containing the input data filesAppGallery.csv
: Input data file for the AppGallery domainPurchasing.csv
: Input data file for the Purchasing domain
Config.py
: Configuration file for storing constants and settings
- Install the required dependencies (TODO: Create
requirements.txt
file). - Place the input data files (
AppGallery.csv
andPurchasing.csv
) in thedata/
directory. - Modify the
Config.py
file to adjust any configuration settings if needed. - Run the
main.py
script to execute the email classification system.
Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
This project is licensed under the MIT License.