Giter Site home page Giter Site logo

rom1504 / clip-retrieval Goto Github PK

View Code? Open in Web Editor NEW
2.1K 24.0 198.0 3.84 MB

Easily compute clip embeddings and build a clip retrieval system with them

Home Page: https://rom1504.github.io/clip-retrieval/

License: MIT License

Python 4.95% JavaScript 0.91% Dockerfile 0.01% Jupyter Notebook 94.03% Shell 0.03% HTML 0.02% Makefile 0.05%
semantic-search deep-learning multimodal ai clip knn

clip-retrieval's Introduction

๐Ÿ‘‹

Hi, I'm Romain Beaumont aka rom1504. I build and deploy ML infra to solve important problems.

Recent work:

  • LLMs at YouTube
  • KNNs at Criteo
  • Laion5B and OpenClip in open source ML
  • Mineflayer and PrismarineJS in open source javascript

๐Ÿ—’ Blog posts

Blog posts with laion

Selection of papers

Google scholar

๐Ÿ”— Links

Website GitHub Medium Twitter Goodreads

clip-retrieval's People

Contributors

afiaka87 avatar afrendeiro avatar bigballon avatar cat-state avatar d0mih avatar dependabot[bot] avatar dmvaldman avatar fabiozappo avatar flimflamm avatar guillaume-lagarde avatar heyalexchoi avatar korjusk avatar krasserm avatar lluisgomez avatar luke-han avatar mehdidc avatar mgoin avatar nousr avatar padge91 avatar pvl avatar resloved avatar rom1504 avatar seonghaeom avatar sofianel5 avatar tom-pollak avatar yonatanbitton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clip-retrieval's Issues

advanced algebra mode

building queries out of several embeddings work well
could be interesting to add an algebra mode in the front

Image are not rendering in the clip-retrieval front UI

The images are being fetched and KNN has been performed on the data. It can be seen on the console that the similar images are fetched as an array object. However, the images are not being displayed on the front end.
Is there any solution to this issue?

explore solutions to decrease memory usage of the metadata

today the index is using a small amount of memory compared to the metadata.
This will not scale

Options:

  • loading in memory only what's strictly necessary (urls + captions). Better but might not be enough
  • explore on-disk solutions
    • memory mapped file (arrow ? numpy ?)
    • sqlite ?

I think sqlite might be the best solution, but speed needs to be checked.
Other solutions can also be considered, like some on disk kv store as the only need is to go from incremental index to value

Implement a pytorch dataloader that filter and download at run time

this is an online version of #31
Combine the whole pipeline not as a big batch job, but instead as a data loader that

  • query/filter in a knn index + metadata structure
  • download
  • resize
  • give to training

It makes sense in particular when the model training speed is low. For example dalle is such a model.
For clip it could make less sense

it could be a lot more convenient than downloading TB of webdataset if it works:

  1. download a 16GB knn index and 50GB of metadata
  2. write your best keyword and how much of each you'd like (with clip thresholds)
  3. start the training on up to 400M sample

consider implementing some more advanced features in back+front

ideas:

  • average of clip image+text
  • return arbitrary item from dataset (possible thanks to hdf5 collection)
  • safety filtering by using list of bad clip embeddings
  • do a dot product with a list of interesting clip embeddings, display these keywords as common attribute of items, propose to the user to add that to his query
  • image+text index and search #40

expand clip filter

  • also copy texts
  • add options to check image/text matching --matching_threshold 0.2

also allow filtering by an image ? a set of image/text ?

Property post filtering in clip back

As mentionned in #37 querying a knn index with k=10^6 is pretty fast (50ms=
This property combined with a bitarray would work well. A bit array takes only 1 bit per element, so that's 50MB for 400M elements. https://pypi.org/project/bitarray/ allow doing fast slices:

import time
def timeit(f):
  a = time.time()
  f()
  print(time.time() - a)

from bitarray import bitarray
x = bitarray(400*10**6)

>>> timeit(lambda:[x[i] for i in range(10**6)])
0.05558180809020996

50ms for a slice of 10**6

This allows supporting efficient post filtering with binary properties such as:

  • is nsfw
  • has a large resolution
  • is a clipart
  • is a real image
  • ...
    Each property taking 50MB, it doesn't scale to properties with a larger cardinality but only set based properties.

For properties with a cardinality of 2**32, 1.6GB of ram would be needed which is more costly.

As part of this issue, implementing the bitarray solution should make things a bit more interesting in the front.

(and it should fix the current implementation of safe mode which is limited in number of results due to the slowness of slicing an hdf5 collection)

It may be interesting to first check the speed of hdf5 slicing for 2 cases:

  • with no compression (allows to disable the chunking of hdf5)
  • with fixed size columns (ie ints), may be faster than a string column

expose the querying features better

partially done in a7fb73f
missing:

  • documentation
  • exposing other features (metadata fetching in particular)
  • making metadata fetching faster
  • building examples / features to export the large result set faster (for example if the result set is very large go over the whole collection and export back to parquet in batches)

How to setup grafana metrics

I tried to add the endpoint as a source in a Prometheus data source. When setting access to Browser I get this:
Access to fetch at 'http://splunk.vra.ro/metrics/api/v1/query?query=1%2B1&time=1634007938.867' from origin 'http://splunk.vra.ro:3000' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

Setting to Server does not do much. The data source seems inactive in Grafana

Custom keys for wds image, text

Presently, the expectation is that your webdataset consists of tars containing filenames with the suffix "jpg" for an image and "txt" for your captions.

The "jpg" enforcement here prevents one from using other formats. For instance, I have prepared the DALLE blogpost image-text pairs using the script at https://github.com/robvanvolt/DALLE-models. They are almost all pngs with a few bitmaps I think.

As of when I ran that code; robvanvolt had specified the key-names to be .img and .cap.

Screen Shot 2021-09-04 at 2 23 48 AM

It adds a bit of complexity for the user - and I think WebDataset should have very uniform expectations on the type of data you're working with. Is there any sort of standard they have for good defaults on various dataset modalities?

Otherwise; you could have arguments for specifying the keys similar to robs implementation in DALLE-pytorch:

Beginner/Verbose:

clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
    --output_folder embeddings_folder \
    --output_folder="./dalle_blog_embeds"\    
    --input_format="webdataset" \
    --image_key img --text_key cap

Bit more advanced:
Enable webdataset contigent upon either --webdataset/-wds is True or -wds=image_key,caption_key and remove the --input_format option altogether.

clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
    --output_folder embeddings_folder \
    --output_folder="./dalle_blog_embeds"\    
    --webdataset # alone just enables with default `txt,jpg`

or

clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
    --output_folder embeddings_folder \
    --output_folder="./dalle_blog_embeds"\    
    --webdataset img,cap

 # specify comma-delimited image_key, then caption_key

They're both relatively easy to implement. I personally prefer the comma-delimited variety but I think this codebase has opportunity for very mainstream appeal and it's perhaps worthwhile to consider that not everyone wants to deal with webdataset implementation details. Avoiding the need for such options completely seems preferable.

optimized batch metadata fetching

hdf5 is fast but maybe not optimal at doing random read
I tried leveldb and similar on disk k/v stores, leveldb seems to be 3x faster than hdf5 at this while increasing the compression, so it might be a good first step

the ideal solution is to extract the inverted list from the ivf index and use that to build an on disk structure with the correct locality
that should make it possible to avoid random reads and speed up metadata retrieval a lot

see https://github.com/facebookresearch/faiss/wiki/Inverted-list-objects-and-scanners and https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM on the topic

infinite scroll feature in front

Could be implemented by doing a new knn with a larger k and fetching the metadata only for the new slice

import faiss
index = faiss.read_index("the_big_index_16GB/image.index", faiss.IO_FLAG_MMAP|faiss.IO_FLAG_READ_ONLY)
import numpy as np
def timeit(f):
  a = time.time()
  f()
  print(time.time() - a)
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 1))
0.013533353805541992
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 10))
0.012979269027709961
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 100))
0.013437271118164062
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 1000))
0.015572071075439453
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 10000))
0.01698613166809082
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 100000))
0.033286094665527344
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 1000000))
0.08424758911132812
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 10000000))
0.5176742076873779

end2end test

url list -> screenshot of the front with checks along the way
can be done easily after #31 is done

`clip-retrieval filter` webdataset support

Having trouble debugging this; but I think after looking at the code briefly - there's not any webdataset retrieval I dont think?

The root problem is that I am not getting any images saved from the query; even though it finds some in the output.

train a safety model and try integrating it as several places

inputs can be the caption, the url or the image. Easier to use if only the url and the caption.
Using the clip embeddings as input is also an option and may be less costly to use.
The model could then be used either:

  • quantized in the front directly
  • as a post filter in the back (called live or using a precomputed tag)
  • as a strict filter before indexation
  • as a weak filter by using the nsfw value as a bias component

Training data can be generated by:

  • using heuristics to find positive and negative examples in laion400m
  • labeling some data
  • finding existing nsfw dataset
  • using nsfw API to generate labels
    Evaluation data should be human annotated

Images does not rendering on clip front

Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'length') at HTMLElement.renderImage (clip-front.js:385) at clip-front.js:426 at Array.map (<anonymous>) at HTMLElement.render (clip-front.js:426) at HTMLElement.update (lit-element.js:220) at HTMLElement.performUpdate (updating-element.js:555) at HTMLElement._enqueueUpdate (updating-element.js:508)

Build an end to end command

Url list
-> filtering (dedup)
-> downloading
-> clip inference
-> indexing
-> back + front (subprocess or host with back too)

clip-retrieval end2end <url list> <config.json>

It would start a prefect UI with what's going on and wandb links for each subtask
Then after a small while, it will start the back and front and open the demo in the browser

Build it with prefect, use a good config framework (fromconfig ?)

Would be ideal to make it incremental and schedulable too.
Making it distributed potentially could also be interesting but not necessary.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.