rom1504 / clip-retrieval Goto Github PK

View Code? Open in Web Editor NEW

2.1K 24.0 198.0 3.84 MB

Easily compute clip embeddings and build a clip retrieval system with them

Home Page: https://rom1504.github.io/clip-retrieval/

License: MIT License

Python 4.95% JavaScript 0.91% Dockerfile 0.01% Jupyter Notebook 94.03% Shell 0.03% HTML 0.02% Makefile 0.05%

semantic-search deep-learning multimodal ai clip knn

clip-retrieval's Introduction

👋

Hi, I'm Romain Beaumont aka rom1504. I build and deploy ML infra to solve important problems.

Recent work:

LLMs at YouTube
KNNs at Criteo
Laion5B and OpenClip in open source ML
Mineflayer and PrismarineJS in open source javascript

🗒 Blog posts

Semantic search at billions scale
Semantic search with embeddings: index anything - Building scalable semantic retrieval from image, text, graph, and interaction data
Image embeddings - Image similarity and building embeddings with modern computer vision
Learning computer vision - A short introduction to computer vision

Blog posts with laion

Selection of papers

Google scholar

🔗 Links

clip-retrieval's People

Contributors

Stargazers

Watchers

Forkers

johndpope afiaka87 thearmagan pvl rvencu talhaz aleksiknuutila borisdayma techthiyanes dzungdk christophschuhmann philipuss1 kapitsa2811 randomwalker300 haoyusoong theocoombes whitefu eleazhong dashstander boytjj cansakirt pocoq xpatronum dmvaldman cat-state vanga nashid peternara derek-k hwijune mcullan henrywoo charleoy jangocheng tonyzhanghm sizappaaigwat tcl9876 machingwen jdagdelen ggoggam marcus-arcadius d0mih un1tz3r0 imagr-ltd linhanxiao dmarx resloved epoz pkiage skallumadi kornesh codeaudit zetimente imclab guillaumegenthial b1sounours nousr koke2c95 grexzen pondatomo marbaws krasserm deyh2020 barnehemia dineshkumares nielsrolf kandy22 sterlingjosh lizhiustc zhilizju moerehman sonicviz jhailos chaoso ablattmann pysync varadgunjal drewwalkup laion-ai anminhhung researchoor hittrakkz afrendeiro hadryan nathanleclaire sermonlizhi nopeanuts edangomez maxwelljones14 gaominghao0201 natyren andre-beautrait beautrait andreasmhahn amankishore geocine 0x1355 delanduer ukaserge akamil-etsy

clip-retrieval's Issues

Generating prompts from an image (VQGAN)

I'm investigating how to generate prompts from an image
There was some instructions to encode_text + encode_image using faiss here
mlfoundations/open_clip#1

I was digging through github and came across this codebase.
Does this repo do that? It doesn't seem so. could it?

Basically want to throw conceptual captions + be able to introspect an image for captions.
https://github.com/google-research-datasets/conceptual-12m

consider packaging the front in python cli

advanced algebra mode

building queries out of several embeddings work well
could be interesting to add an algebra mode in the front

implement drag and drop of image in front

Run it on some known datasets

https://github.com/robvanvolt/DALLE-datasets
imagenet1k
imagenet21k
coco

store output somewhere convenient (kaggle, huggingface?)

add an example on how to expose the back from colab in the notebook

similar to https://github.com/rom1504/dalle-service/blob/master/dalle_back.ipynb
that seems to be something user want to try (see #42 )

implement block listing on text and image embeddings with a provided list of keywords

use that + reconstruct + raw text filtering + larger queries

Expose the knn index abstraction in a cleaner way in python without flask

Would make it more natural to use for other application than the back

Image are not rendering in the clip-retrieval front UI

The images are being fetched and KNN has been performed on the data. It can be seen on the console that the similar images are fetched as an array object. However, the images are not being displayed on the front end.
Is there any solution to this issue?

consider using 00001 format and/or double check sorting of files in autofaiss vs reading of parquet here

explore solutions to decrease memory usage of the metadata

today the index is using a small amount of memory compared to the metadata.
This will not scale

Options:

loading in memory only what's strictly necessary (urls + captions). Better but might not be enough
explore on-disk solutions
- memory mapped file (arrow ? numpy ?)
- sqlite ?

I think sqlite might be the best solution, but speed needs to be checked.
Other solutions can also be considered, like some on disk kv store as the only need is to go from incremental index to value

Simple improvements to front

image search
exploration by clicking on image/text icon next to each result item

Implement a pytorch dataloader that filter and download at run time

this is an online version of #31
Combine the whole pipeline not as a big batch job, but instead as a data loader that

query/filter in a knn index + metadata structure
download
resize
give to training

It makes sense in particular when the model training speed is low. For example dalle is such a model.
For clip it could make less sense

it could be a lot more convenient than downloading TB of webdataset if it works:

download a 16GB knn index and 50GB of metadata
write your best keyword and how much of each you'd like (with clip thresholds)
start the training on up to 400M sample

consider implementing some more advanced features in back+front

ideas:

average of clip image+text
return arbitrary item from dataset (possible thanks to hdf5 collection)
safety filtering by using list of bad clip embeddings
do a dot product with a list of interesting clip embeddings, display these keywords as common attribute of items, propose to the user to add that to his query
image+text index and search #40

Consider exposing the parquet to hdf5 transformer somewhere

might be good to avoid running that when starting the back the first time
maybe at the end of clip index ?

OpenAI DALL-E Blog Proof of Concept

This project works very well! I realize this example is perhaps a bit contrived but it's actually quite useful to be able to do a fuzzy search on these and you can even find generations that just look very similar to the caption you enter and get working samples sometimes. Cool stuff.

https://gist.github.com/afiaka87/f662486fc45199fa4394f3456c8246d7#file-dalle_blog_semantic_search-ipynb

run it on 100M samples to confirm scalability

Add feature to build an image+text index in clip index and to use it in clip back

it can then be used to do either:

image+text query by concatenating the query
image query by making a query with image twice (it would do image-image and image-text search at the same time)
text query by making a query with text twice

it might be better than one modality only search

support webdataset

to make it easy to use dataset coming out of https://github.com/rom1504/img2dataset
handle ordering/keys properly

Package the front in the python package as a clip-retrieval front

keep or not the npm packaging, but it would be quite convenient if it was also bundled in the python package
it would make it easy to make clip-retrieval end2end do stuff then boot a frontend with no additional user manipulation
related #31

expand clip filter

also copy texts
add options to check image/text matching --matching_threshold 0.2

also allow filtering by an image ? a set of image/text ?

Property post filtering in clip back

As mentionned in #37 querying a knn index with k=10^6 is pretty fast (50ms=
This property combined with a bitarray would work well. A bit array takes only 1 bit per element, so that's 50MB for 400M elements. https://pypi.org/project/bitarray/ allow doing fast slices:

import time
def timeit(f):
  a = time.time()
  f()
  print(time.time() - a)

from bitarray import bitarray
x = bitarray(400*10**6)

>>> timeit(lambda:[x[i] for i in range(10**6)])
0.05558180809020996

50ms for a slice of 10**6

This allows supporting efficient post filtering with binary properties such as:

is nsfw
has a large resolution
is a clipart
is a real image
...
Each property taking 50MB, it doesn't scale to properties with a larger cardinality but only set based properties.

For properties with a cardinality of 2**32, 1.6GB of ram would be needed which is more costly.

As part of this issue, implementing the bitarray solution should make things a bit more interesting in the front.

(and it should fix the current implementation of safe mode which is limited in number of results due to the slowness of slicing an hdf5 collection)

It may be interesting to first check the speed of hdf5 slicing for 2 cases:

with no compression (allows to disable the chunking of hdf5)
with fixed size columns (ie ints), may be faster than a string column

expose the querying features better

partially done in a7fb73f
missing:

documentation
exposing other features (metadata fetching in particular)
making metadata fetching faster
building examples / features to export the large result set faster (for example if the result set is very large go over the whole collection and export back to parquet in batches)

How to setup grafana metrics

I tried to add the endpoint as a source in a Prometheus data source. When setting access to Browser I get this:
Access to fetch at 'http://splunk.vra.ro/metrics/api/v1/query?query=1%2B1&time=1634007938.867' from origin 'http://splunk.vra.ro:3000' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

Setting to Server does not do much. The data source seems inactive in Grafana

handle exception in files reader

see #13

Finish

Custom keys for wds image, text

Presently, the expectation is that your webdataset consists of tars containing filenames with the suffix "jpg" for an image and "txt" for your captions.

The "jpg" enforcement here prevents one from using other formats. For instance, I have prepared the DALLE blogpost image-text pairs using the script at https://github.com/robvanvolt/DALLE-models. They are almost all pngs with a few bitmaps I think.

As of when I ran that code; robvanvolt had specified the key-names to be .img and .cap.

It adds a bit of complexity for the user - and I think WebDataset should have very uniform expectations on the type of data you're working with. Is there any sort of standard they have for good defaults on various dataset modalities?

Otherwise; you could have arguments for specifying the keys similar to robs implementation in DALLE-pytorch:

Beginner/Verbose:

clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
    --output_folder embeddings_folder \
    --output_folder="./dalle_blog_embeds"\    
    --input_format="webdataset" \
    --image_key img --text_key cap

Bit more advanced:
Enable webdataset contigent upon either --webdataset/-wds is True or -wds=image_key,caption_key and remove the --input_format option altogether.

clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
    --output_folder embeddings_folder \
    --output_folder="./dalle_blog_embeds"\    
    --webdataset # alone just enables with default `txt,jpg`

clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
    --output_folder embeddings_folder \
    --output_folder="./dalle_blog_embeds"\    
    --webdataset img,cap

 # specify comma-delimited image_key, then caption_key

They're both relatively easy to implement. I personally prefer the comma-delimited variety but I think this codebase has opportunity for very mainstream appeal and it's perhaps worthwhile to consider that not everyone wants to deal with webdataset implementation details. Avoiding the need for such options completely seems preferable.

add a proper config file to clip back to handle the configuration per index

the current setup mixes global and per index options
let's store everything in a config file with explicit information
also have separate metadata sections to be able to configure what kind of metadata db is used, with what ordering, what mapping file ...

consider adding an option in the back to fetch images in the back

bypass problem of images that cannot be fetched from a browser

optimized batch metadata fetching

hdf5 is fast but maybe not optimal at doing random read
I tried leveldb and similar on disk k/v stores, leveldb seems to be 3x faster than hdf5 at this while increasing the compression, so it might be a good first step

the ideal solution is to extract the inverted list from the ivf index and use that to build an on disk structure with the correct locality
that should make it possible to avoid random reads and speed up metadata retrieval a lot

see https://github.com/facebookresearch/faiss/wiki/Inverted-list-objects-and-scanners and https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM on the topic

infinite scroll feature in front

Could be implemented by doing a new knn with a larger k and fetching the metadata only for the new slice

import faiss
index = faiss.read_index("the_big_index_16GB/image.index", faiss.IO_FLAG_MMAP|faiss.IO_FLAG_READ_ONLY)
import numpy as np
def timeit(f):
  a = time.time()
  f()
  print(time.time() - a)
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 1))
0.013533353805541992
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 10))
0.012979269027709961
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 100))
0.013437271118164062
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 1000))
0.015572071075439453
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 10000))
0.01698613166809082
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 100000))
0.033286094665527344
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 1000000))
0.08424758911132812
>>> timeit(lambda:index.search(np.ones((1, 512), 'float32'), 10000000))
0.5176742076873779

add white border before doing inference in clip back to be iso with indices

end2end test

url list -> screenshot of the front with checks along the way
can be done easily after #31 is done

Consider adding a feature to check how many relevant results are available

Do it by doing a query with a big k then counting how many results are above a given threshold

Check if it's cheap to do.

Would answer the question "how many blue cats are in this dataset"

add option to have either a select either a custom input for the backend choices

useful to have several default endpoints
I guess you'd be interested in implementing that @rvencu so we can add your endpoint by default and still keep mine

I'm think a select box that get replace by an input text if a checkbox next to it is checked

`clip-retrieval filter` webdataset support

Having trouble debugging this; but I think after looking at the code briefly - there's not any webdataset retrieval I dont think?

The root problem is that I am not getting any images saved from the query; even though it finds some in the output.

Make front panel collapsible in front

How to create indices.json from index file ?

echo '{"example_index": "output_folder"}' > indices_paths.json
this command is not working .

make clip batch save in batch

so it works for more data
also use autofaiss to make the indices work for lot of data

train a safety model and try integrating it as several places

inputs can be the caption, the url or the image. Easier to use if only the url and the caption.
Using the clip embeddings as input is also an option and may be less costly to use.
The model could then be used either:

quantized in the front directly
as a post filter in the back (called live or using a precomputed tag)
as a strict filter before indexation
as a weak filter by using the nsfw value as a bias component

Training data can be generated by:

using heuristics to find positive and negative examples in laion400m
labeling some data
finding existing nsfw dataset
using nsfw API to generate labels
Evaluation data should be human annotated

add choice between json/csv/parquet to download subsets

front+back

and maybe point people to img2dataset to download pictures out of it

Images does not rendering on clip front

Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'length') at HTMLElement.renderImage (clip-front.js:385) at clip-front.js:426 at Array.map (<anonymous>) at HTMLElement.render (clip-front.js:426) at HTMLElement.update (lit-element.js:220) at HTMLElement.performUpdate (updating-element.js:555) at HTMLElement._enqueueUpdate (updating-element.js:508)

use autofaiss

implement backend fallback in the front

if the one specified doesn't work, fallback to the default

Build an end to end command

Url list
-> filtering (dedup)
-> downloading
-> clip inference
-> indexing
-> back + front (subprocess or host with back too)

clip-retrieval end2end <url list> <config.json>

It would start a prefect UI with what's going on and wandb links for each subtask
Then after a small while, it will start the back and front and open the demo in the browser

Build it with prefect, use a good config framework (fromconfig ?)

Would be ideal to make it incremental and schedulable too.
Making it distributed potentially could also be interesting but not necessary.