Giter Site home page Giter Site logo

hyprnx / text-classification Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 2.46 MB

Final Project for Deep Learning course at NEU

License: Apache License 2.0

Python 11.21% Jupyter Notebook 88.79%
deep-learning machine-learning onnx-models onnx-runtime onnx-torch python pytorch streamlit streamlit-application

text-classification's Introduction

Product category classification

This project is a source code for graduation project "Multi-stages Deep Learning based E-commerce Product Categorization with advanced text processing techniques".

Developed by To Duc Anh student of DSEB61, MFE, member of NEU's DSLab, and Data Management Associate at Techcombank Vietnam.

The project offers a solution for product category classification problem. The dataset is crawled from Shopee.

This project features a deep learning model built with PyTorch. The embedding data were calculated with Vietnamese sBERT.

The training process can be found in notebook directory.

The project is now LIVE and accessible via this link.

About Dataset:

The dataset contains roughly 1,000,000 products with 4 different categories. The categories are:

  • Electronics
  • Cosmetics
  • Fashion
  • Mom & Baby
  • Others

About the model and steps:

Access the introduction page here to get information on how the model is built and how the steps are done.

How to run the project on local machine:

This project was developed and deployed on a Python 3.10 machine. To run the project on your local machine, you should have Python 3.10 installed. The project was developed with the help of Streamlit, a Python library for building web apps.

Getting Started on your local machine:

  1. Clone the project to your local machine:
git clone https://github.com/Hyprnx/Text-Classification
  1. Setup and activate a virtual environment: Setup:
python -m venv <envname>

Activate:

  • On Mac:
    source <envname>/bin/activate
  • On Windows:
    <envname>\Scripts\activate
  1. Install requirements.txt
pip install -r requirement.txt
  1. Run the project:
  streamlit run streamlit_app.py
  1. Open the link provided by Streamlit in your browser.

This link should be run on port 8501 (eg: http://localhost:8501). If you want to change the port, you can do so by referencing the streamlit documentaion here.

The project should look like this: image

  1. Enjoy the project!

Others

The project also included a model (and also the ONNX version of the model) that can be used for inference. It located here: model.

Experimental:

ONNX and GPU acceleration:

The inference time could be accelerated with the help of ONNX Runtime. In the provided model, the embedding time is pretty slow due to the fact that the embedding module is purely taken from transformer without any optimization. The reason is that, firstly, the sentence embedding that we use have some problem with the vocab size, hence the ONNX conversion cannot be completed. We probably can retrain the model with a smaller vocab size to make it work later on. Secondly, the embedding process run entirely on CPU, which might not be so efficient for parallel computing. Provide a GPU will definitely speed up the process. The embedding process with ~1M sentences takes around 10 minutes to complete on a nVidia P100 GPU, kindly provided by Google on Kaggle.

DataFrame Library:

The project currently use Pandas (Pre 2.0 releases) and it multi-threading variance Modin to speed up data processing. But, these libraries are, still, not the fastest available. We could try to use Polars - a DataFrame manipulation framework written in Rust. Which provide a blazing fast data processing speed. Equipped with blazing-ly fast and memory-efficient property of Rust. But, due to it lack of features since its still in early development, we cannot use it for this project. But, we could try to use it in the future.

Deployment:

We use the free tier of Streamlit cloud, which limited to 1GB of resources. This limitation is the reason you don't see phoBERT classifier on the web app. The model is too big to be deployed on Streamlit cloud.

We can possibly rent a cloud base service from AWS, GCP or Azure to deploy the model. But you know, we are students, there are no financial benefit from doing that for demonstration purposes.

text-classification's People

Contributors

hyprnx avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.