Giter Site home page Giter Site logo

ativitbenz / image-similarity-vector-search Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mukundha/image-similarity-vector-search

1.0 0.0 0.0 40 KB

Demonstrate Astra's Vector search with Image similarity search using Amazon Berkeley Objects (ABO) Dataset

License: Apache License 2.0

JavaScript 23.51% Python 48.64% CSS 15.48% HTML 12.38%

image-similarity-vector-search's Introduction

Image Similarity with Vector Search

Demonstrate Datastax Astra's Vector search with Image similarity search using Amazon Berkeley Objects (ABO) Dataset

Demo

demo-480.mov

Click here for a live demo. Try it on your mobile.


Follow along to setup this demo yourself and learn how to do image similarity with Vector search

This repository includes 3 sections

  1. Data processing
  • Generate Vector embeddings for Images
  • Load Vector embeddings into Astra
  1. API - Similarity search
  • Exposes an API to perform Vector search and retrieve similar images
  1. UI App
  • Allows users to capture images and search for similar images

1. Data processing

Refer to Citation for how to download and get access to this dataset.

Dataset includes 147,702 products and 398,212 unique catalog images

Code is in data-processing folder.

Initialize DB

Review the Astra Getting started guide, if needed.

Create the products table with Vector<> column to represent main_image_id of the product and SAI index for the Vector column.

CREATE TABLE amazon_products (
    brand TEXT,
    bullet_point TEXT,
    color TEXT,
    item_id TEXT,
    item_name TEXT,
    item_weight TEXT,
    model_name TEXT,
    model_number TEXT,
    product_type TEXT,
    main_image_id TEXT,
    other_image_id TEXT,
    item_keywords TEXT,
    country TEXT,
    marketplace TEXT,
    domain_name TEXT,
    node TEXT,
    material TEXT,
    style TEXT,
    item_dimensions TEXT,
    fabric_type TEXT,
    product_description TEXT,
    color_code TEXT,
    finish_type TEXT,
    item_shape TEXT,
    pattern TEXT,
    spin_id TEXT,
    model_year TEXT,
    "3dmodel_id" TEXT,
    item_name_language TEXT,
    brand_language TEXT,
    image_embedding Vector<FLOAT,2048>,
    PRIMARY KEY (item_id,domain_name)
);

CREATE CUSTOM INDEX IF NOT EXISTS image_embedding_index ON demo.amazon_products (image_embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'

Generate Vector embedding and Load data

You should download abo-listings.tar and abo-images-small.tar to your workstation, untar it.

tar -xvf abo-listings.tar
gunzip abo-listings/listings/metadata/listings_*.gz
import pandas as pd
import glob

json_files = "abo-listings/listings/metadata/listings_*.json"

dfs = []

for file in glob.glob(json_files):
    df = pd.read_json(file, lines=True)
    dfs.append(df)

merged_df = pd.concat(dfs)
output_file = "items.csv"
merged_df.to_csv(output_file, index=False)

Setup your environment

Refer here, if you need help getting Astra credentials

export ITEMS_PATH=<path to items.csv generated in prev step>
export IMAGE_METADATA_FILE="<path>/abo-images-small/images/metadata/images.csv"
export IMAGES_FOLDER="<path>/abo-images-small/images/small"

export ASTRA_USER='ASTRA_CLIENTID'
export ASTRA_PASSWORD='ASTRA_CLIENTSECRET'
export SECURE_CONNECT_BUNDLE='SCB ZIP FILE PATH'
export KEYSPACE='KEYSPACE'
export TABLE_NAME='TABLE_NAME'

This code uses Inception model to get feature vector for Images.

Vector size: 2048


python3 process.py

[Optional] This might run long, depending on your machine and network speed, it needs to load >100k images. You might want to parallelize this or split the items.csv to smaller chunks and run from multiple machines. Optimizing this is a exercise to the reader

It took me about an hour to load all image embeddings, with 3 nodes.


Similarity Search API

API Spec

Request

POST /upload
Content-type: application/json

{"photoData":"..."}

Response: Similar Images

[
    {
        "image_id": "https://storage.googleapis.com/demo-product-images/35/35952a54.jpg",
        "item_id": "B07RC7R3TC",
        "item_name": "xxx"
    },
    {
        "image_id": "https://storage.googleapis.com/demo-product-images/5a/5a735214.jpg",
        "item_id": "B081HN8L6R",
        "item_name": 'XXX"
    },
    {
        "image_id": "https://storage.googleapis.com/demo-product-images/24/2435b28d.jpg",
        "item_id": "B07T7KPCM1",
        "item_name": "Amazon Brand - Solimo Designer Birds 3D Printed Hard Back Case Mobile Cover for LG K7"
    },
    {
        "image_id": "https://storage.googleapis.com/demo-product-images/2e/2e271e7c.jpg",
        "item_id": "B081HMTVLV",
        "item_name": "XXX"
    },
    {
        "image_id": "https://storage.googleapis.com/demo-product-images/cf/cf9f5a3a.jpg",
        "item_id": "B07Z497KW2",
        "item_name": "XXX"
    }
]

Query

f"SELECT * from {table_name} order by image_embedding ANN OF {input_image_vector} LIMIT 5"

Code: api-server/server.py


UI App

In App.js

Replace Line 24 with your API Server URL

const response = await axios.post('<api-server>/upload', { photoData });
npm start

Attribution

Description Credit
Credit for the data, including all images and 3D models Amazon.com
Credit for building the dataset, archives and benchmark sets

Jasmine Collins

Shubham Goel

Kenan Deng

Achleshwar Luthra

Leon Xu

Erhan Gundogdu

Xi Zhang

Tomas F. Yago Vicente

Thomas Dideriksen

Himanshu Arora

Matthieu Guillaumin

Jitendra Malik

UC Berkeley, Amazon, BITS Pilani

Citation

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

Dataset Homepage

@article{collins2022abo,
  title={ABO: Dataset and Benchmarks for Real-World 3D Object Understanding},
  author={Collins, Jasmine and Goel, Shubham and Deng, Kenan and Luthra, Achleshwar and
          Xu, Leon and Gundogdu, Erhan and Zhang, Xi and Yago Vicente, Tomas F and
          Dideriksen, Thomas and Arora, Himanshu and Guillaumin, Matthieu and
          Malik, Jitendra},
  journal={CVPR},
  year={2022}
}

imagenet/inception_v3/feature_vector

Feature vectors of images with Inception V3 trained on ImageNet (ILSVRC-2012-CLS).

Publisher Google License: Apache-2.0
Architecture Inception V3
Dataset ImageNet (ILSVRC-2012-CLS)
License Apache2.0

image-similarity-vector-search's People

Contributors

mukundha avatar

Stargazers

benzativit avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.