Giter Site home page Giter Site logo

ahmedelsayed968 / product-categorization Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 176 KB

License: MIT License

Makefile 1.28% Python 38.00% Jupyter Notebook 60.72%
cnn-classification deep-learning etl-pipeline machine-learning product-classification slash

product-categorization's Introduction

Product-Categorization

Project structure

.
├── data
│   ├── processed
│   │   └── v1
│   ├── raw
│   │   └── v1
│   ├── scripts
│   │   ├── extract.py
│   │   ├── transform.py
│   │   └── load.py
│   └── main.py
├── deploy
│   ├── scripts
│   └── tests
├── develop
│   ├── artifacts
│   ├── eda
│   ├── notebooks
│   ├── scripts
│   │   ├── eda.py
│   │   └── load.py
│   └── main.py
├── label
├── train
├── visualise
├── .flake8
├── .gitattributes
├── .gitignore
├── .pre-commit-config.yaml
├── Makefile
├── poetry.toml
├── pyproject.toml
└── README.md

Task Approachs

Collecting data

  • Initially, I attempted to gather data directly from the application by identifying the endpoints that the application communicates with on the backend. However, this approach proved to be labor-intensive, so I decided to skip it.
  • Instead, I obtained data relevant to the task from Kaggle. You can access the dataset here. It comprises approximately 140 CSV files, each containing valuable attributes such as main_category and image of the product, which will be instrumental for our task.
  • Finally, ingest the data in the created instance on GCP into /data/raw directory

Data Preprocessing and Preparation

  • Extract only the desired attribute from all 140 CSV files which was main_category, sub_category and image_url and save the output of this phase into /data/processed/v1

        def process_all_csv(
            source: str, file_names: List[str], cols: str,destination_base: str
        ):
  • Combine output of the previous step into single file with about 400k entries

  • Remove duplicated url after merging

  • As an intermediate step to handle large dataset and process it i have decide to implement a function to divide any given list into chunks and process from asynchronously in the following steps

    def chunk_list(lst, chunk_size):
        for i in range(0, len(lst), chunk_size):
            yield lst[i : i + chunk_size]
  • Reduce Labels there was about more that 14 category, i've reduce them into 7 category

  • Implement Async Function to validate the url of each Image

    @classmethod
    async def is_valid_url(cls, url: str, session: aiohttp.ClientSession) -> bool:
        try:
            async with session.get(url) as response:
                response.raise_for_status()
                return True
        except ClientResponseError:
            return False

    this step helps to reduce the number of entries from 400k to 200K

  • Then fetching image of each entry and persist it into the instance that have been created in the Cloud

    1. Divide the large list into chuncks
    2. Process each chunk asynchronously to get the images from the remote server
      async def get_image_from_links(cls, urls: List[str]) -> List[bytes]:
          async with aiohttp.ClientSession() as session:
              tasks = [ImageCollector.fetch_image(url, session) for url in urls]
              return await asyncio.gather(*tasks)
      
      @classmethod
      async def fetch_image(cls, url: str, session: aiohttp.ClientSession) -> bytes:
          try:
              async with session.get(url) as response:
                  response.raise_for_status()
                  return await response.read()
          except ClientResponseError:
              return None
    3. persist the collected images into the disk
      def persist_images(images: List[bytes], dirs: List[str]) -> bool:
      for image, dir in zip(images, dirs):
          img = parse_image_content(image)
          result = save_cvimage(img, dir)
          if not result:
              print(f"Faild To Persist {os.path.basename(dir)}")
      • parse the binary data that have been collect to 3d numpy arrays using OpenCV
      • save the parsed image into the disk
  • Push the data to kaggle

        # that create json file for configuring datasets
        kaggle datasets init -p /path/to/data
        # create the dataset after set the metadata
        # using -u argument to be public
        kaggle datasets create --dir-mode tar -p /path/to/data -u
  • Finally output

    here you can access the public dataset on kaggle that resulting from the previous steps

Modelling Phase

I have decided to use Tensorflow rather than Pytorch duo to time constraints and our use case that will include:

  1. Transfer Learning: Utilize pre-trained models available in TensorFlow Hub or models trained on large datasets like ImageNet. Fine-tuning these models for our specific task can significantly reduce training time and data requirements.

  2. TensorFlow Extended (TFX): building end-to-end machine learning pipelines for production, consider using TensorFlow Extended (TFX). TFX provides a suite of tools for building, deploying, and maintaining production-ready ML pipelines.

  3. TensorBoard: TensorFlow comes with TensorBoard, a powerful visualization tool that helps you monitor and debug your models. Utilize TensorBoard for visualizing metrics, model graphs, embeddings, and more to gain insights into your model's behavior.

  4. Model Serving and Deployment: due to deployment is a significant concern, TensorFlow provides tools like TensorFlow Serving, TensorFlow Lite, and TensorFlow.js for deploying models in various environments, including cloud, mobile, and web.

Tensorflow Dataset

Build A Wrapper Class to ingest the images from the saved directory into the model

  1. Process given path of any image and return the label and the image as tensor objects

        @classmethod
        def process_path(cls, file_path: pathlib.Path):
            label = ImageProcessor.get_label(file_path)
            img = tf.io.read_file(file_path)
            img = ImageProcessor.decode_img(img)
            return img, label
  2. ImageDataset class is the wrapper class that utilize the previous function to prepare the dataset and provide API to split the dataset into train,validation,and test splits

    class ImageDataSet:
        def __init__(
            self, path: str, train_size: float, test_size: float, val_size: float
        ) -> None:
    def get_train_val_test(self, batch_size, width, height) -> Tuple[tf.data.Dataset]:

    inside the get_train_val_test i have utiliz another functions from ImageProcessor such as

    @classmethod
    def prepare_for_training(cls, ds: tf.data.Dataset, shuffle_buffer_size=1000):
    
        ds = ds.shuffle(buffer_size=shuffle_buffer_size)
        ds = ds.batch(ImageProcessor.BATCH_SIZE)
        ds = ds.prefetch(buffer_size=ImageProcessor.AUTOTUNE)
        return ds

    this function take the dataset object and shuffle it then divide it into batches and prefetch it for training efficiency

Modelling

I have leverged the power transfer learning and used the architecture of ResNet with imageNet weights also I have tried another architecture called VGG19 with the mentioned weights. Building 4 different architectures using the previous ones as staring point and tried to adjust them into the our data here is the initial insights:

  • Model_v1 is perfoming well according to the shown metrics img
  • Performance of Model_v2 during training process img
  • The choosen data have some issues with data labelling so we have to do label correction to enhance the results from this baseline

Tracking all these experiments without using Tensorboard and Weights & Bias would be trouble indeed :(

product-categorization's People

Contributors

ahmedelsayed968 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.