VCLIP is an extension of OpenAI's CLIP for variational inference. It was fine-tuned on a subset of Conceptual Captions. This repo contains the simple implementation code and a link to the weights. The implementation is an extension of HuggingFace's FlaxCLIPModel.
Pretrained weights (Google Cloud Storage).
VCLIP computes a Gaussian distribution over images for each prompt, rather than returning a single point. The similarity function between (text, img)
is the normal probability density function rather than cosine similarity.