Giter Site home page Giter Site logo

nuhashhaque / bangla-textclassification-analysis-eda-and-models Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 2.0 2.6 MB

Jupyter Notebook 100.00%
nlp-machine-learning bangla-nlp deep-neural-networks lstm cnn hybrid wordembeddings text-classification bengali-natural-language-processing tfidf

bangla-textclassification-analysis-eda-and-models's Introduction

Bangla-TextClassification-EDA-and-Models

Bangla-TextClassification-EDA-and-Models: Project Overview

  • In this project i have built some models to classify texts catogorized (economy,sports,international,state,technology,entertainment,education') using machine learning and deeplearning models.
  • For dataset i have used a public dataset available called banglamct7-bangla-multiclass-text-dataset-7-tags which is avaiable in this link.
  • Word embeding is used for deep learning models and TFIDF is used for machine learning models for feature represtations for extracting the semantic meaning of the words.
  • Machine Learning models has been built by using Logistic Regression and Multinomial Naieve Bayes.
  • Deep learning models has been built by using Deep Neural Network,Convolutional Neural Network,BiDirectional LSTM and CNN-BiLSTM Mybrid model.
  • Finally, the models performance is evaluated using various evaluation measures such as confusion matrix, accuracy , precision, recall and f1-score with classification report.

Resources Used

  • Developement Envioronment : Kaggle
  • Python Version : 3.7
  • Framework and Packages : Tensorflow, Scikit-Learn, Pandas, Numpy, Matplotlib, Seaborn

Project Outline

  • Data Collection and Cleaning
  • Data Summary
  • Data Preparation
  • Model
  • Model Evaluation

Data Collection and Cleaning

The dataset contains clean data ready to be used for feature extraction and classification.But i have applied Stopwords removal and Stemming on the clean data.

Data Summary

Data summary is done in a single notebook available in this link.In EDA notebook, i have shown number of documents, words and unique words have in each category class, histogram analysis to text length and Ngram analysis upto trigram in each category.

Data Preparation

To prepare data before model building, i have used TFIDF for machine learning and Word Embedding for deep learning models.The parameters are optimized and tuned with respect to the EDA results.

Model

I have used Logistic Regression,Multinomial Naive Bayes,Deep Neural Networks,CNN,LSTM,CNN-BiLSTM hybrid model. All the models parameters are tuned and optimized.

Model Evaluation

Model Name Accuracy Precision Recall F1-Score
Logistic Regression 0.9156 0.9157 0.9156 0.9156
Multinomial Naive Bayes 0.8858 0.8863 0.8858 0.8859
Deep Neural Network 0.9318 0.9320 0.9318 0.9318
CNN 0.9061 0.9067 0.9061 0.9060
Bi-LSTM 0.9276 0.9277 0.9275 0.9275
CNN-BiLSTM Hybrid 0.9061 0.9067 0.9061 0.9060

In this project, i have found 93% accuracy with Deep Neural Network model which is the best score. Report of classification and the Confusion Matrix shows that the models miss classified the category of economy, state and education . It is because the text in this categories are very similar and contain similar words. In conclusion, I have achieved good accuracy with different models. This accuray can be further improved by doing hyperparameter tunning and by employing more shophisticated network architecture like Transformers and Pretrained Embeddings.

bangla-textclassification-analysis-eda-and-models's People

Contributors

nuhashhaque avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.