Presently, the expectation is that your webdataset consists of tars containing filenames with the suffix "jpg" for an image and "txt" for your captions.
The "jpg" enforcement here prevents one from using other formats. For instance, I have prepared the DALLE blogpost image-text pairs using the script at https://github.com/robvanvolt/DALLE-models. They are almost all pngs with a few bitmaps I think.
As of when I ran that code; robvanvolt had specified the key-names to be .img
and .cap
.
It adds a bit of complexity for the user - and I think WebDataset should have very uniform expectations on the type of data you're working with. Is there any sort of standard they have for good defaults on various dataset modalities?
Otherwise; you could have arguments for specifying the keys similar to robs implementation in DALLE-pytorch:
Beginner/Verbose:
clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
--output_folder embeddings_folder \
--output_folder="./dalle_blog_embeds"\
--input_format="webdataset" \
--image_key img --text_key cap
Bit more advanced:
Enable webdataset contigent upon either --webdataset/-wds
is True or -wds=image_key,caption_key
and remove the --input_format
option altogether.
clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
--output_folder embeddings_folder \
--output_folder="./dalle_blog_embeds"\
--webdataset # alone just enables with default `txt,jpg`
or
clip-retrieval inference --input_dataset "https://www.dropbox.com/s/s3j0w2cj4obnlge/ds_000000.tar" \
--output_folder embeddings_folder \
--output_folder="./dalle_blog_embeds"\
--webdataset img,cap
# specify comma-delimited image_key, then caption_key
They're both relatively easy to implement. I personally prefer the comma-delimited variety but I think this codebase has opportunity for very mainstream appeal and it's perhaps worthwhile to consider that not everyone wants to deal with webdataset implementation details. Avoiding the need for such options completely seems preferable.