An unofficial implementation of Splice (CVPR 2022 Oral).
This code is developed with python version 3.8, PyTorch 1.13.0, CUDA 11.6 and torchvision 0.14.0.
In addition to DINO, we also investigated the performance of weakly-supervised ViT (CLIP) through feature inversion.
Please download the pre-trained DINO-ViT & CLIP-ViT(optional) from here and place them in ./checkpoints
.
There are two methods for feature inversion:
- solely optimizing the image pixels
- optimizing for the weights of a CNN that translates a fixed random noise z to an output image.
Note that solely optimizing the image pixels is insufficient for converging into a meaningful result, as mentioned in paper.
Run example:
python inversion.py --target=./data/0001.png --use_cnn=True --inv_type=cls --layer=11
Run python inversion.py --help
for more details.
PCA visualization of the ViT facets' self-similarity. The leading components mostly capture semantic scene/objects parts, while discarding appearance information.
Run example:
python pca_visualization.py --image_path=./data/0001.png --facets=k,t --layers=2,5,11
Run python pca_visualization.py --help
for more details.
By default, we use the pretrained ViT(dino-base-8) as feature extractor. Given two images as source and target respectively, run the following command for training.
python splice.py --target=[TARGET_IMAGE_PATH] --source=[SOURCE_IMAGE_PATH]
python inference.py --ckpt=[GENERATOR_CHECKPOINT_FILEPATH] --source=[INPUT_IMAGE_FILEPATH] --output=[OUTPUT_DIRECTORY] --name=[OUTPUT_IMAGE_NAME]