Giter Site home page Giter Site logo

ferri's Introduction

product calssification challenge

This problemis simple but yet complex. Lack of labelled data add to complexity. So, I rely on pretrained models. The approach here is ensemble method. Predict on the basis of 1) product description and 2) use image to classify. Then take the average.

  • Run Jupyter Notebook product_classification.ipynb

  • Output file after classification is classified_product_data.json

Why are you designing the solution in this way?

More of NLP algorithms are moving towards deep learning. For this problem, however, I unable to leverage deep learning. Deep learning needs more data for training. Also, I did not find any pretrained models for eCommerce ( fashion or beauty ). There is no public data set available. So, I have choosen text classification word embeddings approach. Good pretrained word embedings are available in open source community.

What are the aspects that you considered when designing?

I considered not to spent much time, finish within 3 hours. I attempted to label data but did not pursue as this will take time. Basically, tried to solve the problem with many constraints like no labelled data, data set is very small etc. I leveraged word embedings and cosine similarity to find the highest match for product to a cetegory. Data clean-up ( preprocessing ) is important for vector space model to have better score. So, I considered data preprocess is importatnt. Liguistics or grammar is very important here as well. I changed "Intimate" category to "Intimate Collection" as is used in retail. This is required only when we match, not required when we output i.e in while display. We can still display "Intimate" to uses. Few categories, I manually changed to its singular form ( e.g. "Dresses" to "Dress) to get better accuracy and not clubbed into "Others" category.

In part-2, I have designed code to use deep learning to classify images, but this requires to label the data. I did few but itakes lot of manual effort and time, so I abandoned but code is still there in Jupyter Notebook. Using a pre-trained model removes the need for you to spend time obtaining, cleaning, and processing (intensively) such large datasets. This requires few thousand of laballed data to train for this specific project using transfer learning.

Second problem I found is images has multiple objects in it. So, simple image classification may not work. In this cases, object detection will give better result. Object detection in OpneCV also requires few labelled images, which is not given in this home-work.

What are the cases your solution covers, how are they covered and why are they important?

It covers all the cases, I believe, my only concern is accuracy. It can be improved, by training a word2vec model with eCommerce training data. Better alogrithm to find a good threshhold value which determines "Others" category.

What are the cases your solution does not cover and what are the ways you can extend your current solution for them?

The disadvantage of pre-trained word embeddings is that the words contained within may not capture the peculiarities of language in your specific application domain. For example, Wikipedia may not have great word exposure to particular aspects of legal doctrine or religious text, so if your application is specific to a domain like this, your results may not be optimal due to the generality of the downloaded model’s word embeddings.

Initially, I was thinking to use a hybrid model one from text i.e. product description and two from images. Given little bit more time, I would have tried "attention" and transformers.

ferri's People

Contributors

ppujari avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.