This is the project page of our paper:

"Towards Visual Feature Translation." Hu, J., Ji, R., Liu, H., Zhang, S., Deng, C., & Tian, Q. In CVPR 2019. [paper]

If you have any problem, please feel free to contact us. ([email protected])

1. Feature Extraction

This section contains the process of collecting popular content-based image retrieval features for preparing the meta-data of our paper.

The extracted features are evaluated in this section, and the code with details can be found in: ./Extraction/

1.1 Evaluation

1.1.1 Datasets

Datasets for evaluation:

Holidays [1]
Oxford5k [2]
Paris6k [3]

Dataset for PCA whitening and creating codebooks:

Google-Landmarks [4]

1.1.2 Measurement

We use the mean Average Precision (mAP) provided by the official site of above datasets for evaluation.

1.2 Features

Please note that our extractions for images do not use the bounding boxes of the objects.

The local features (e.g., SIFT and DELF) are aggregated by the codebooks learned on 4,000 randomly picked images of Google-Landmarks dataset.

And the features of picked images are used to train the PCA whitening for all features of other images.

The features are listed bellow:

SIFT-FV and SIFT-VLAD: The Scale Invariant Feature Transform (SIFT) [5] features are extracted and then aggregate by Fisher Vector (FV) [6] and Vector of Locally Aggregated Descriptors (VLAD) [7].
DELF-FV and DELV-VLAD: The DEep Local Features (DELF) [8] are extracted and then aggregate also by FV and VLAD.
V-CroW and R-CroW: The abbreviation V represents the VGG [9] backbone network, and R [10] represents the Resnet50 backbone network. The Cross-dimensional Weighting (CroW) [11] are then used to aggregate the deep features generated by the backbone networks.
V-SPoC and R-SPoC: The Sum-Pooled Convolutional features (SPoC) [12] are used to aggregate the deep features generated by the backbone networks.
V-MAC, V-rMAC and R-MAC, R-rMAC: The Maximum Activations of Convolutions (MAC) [13] and the regional Maximum Activations of Convolutions (rMAC) [14] are used to aggregate the deep features generated by the backbone networks.
V-GeM, V-rGeM and R-GeM, R-rGeM: The Generalized-Mean pooling (GeM) [15] is used to aggregate the deep features generated by the backbone networks.

1.3 Results

The mAP (%) of collected features are as follows:

	Holidays	Oxford5k	Paris6k
SIFT-FV	61.77	36.25	36.91
SIFT-VLAD	63.92	40.49	41.49
DELF-FV	83.42	73.38	83.06
DELF-VLAD	84.61	75.31	82.54
V-CroW	83.17	68.38	79.79
V-GeM	84.57	82.71	86.85
V-MAC	74.18	60.97	72.65
V-rGeM	85.06	82.30	87.33
V-rMAC	83.50	70.84	83.54
V-SPoC	83.38	66.43	78.47
R-CroW	86.38	61.73	75.46
R-GeM	89.08	84.47	91.87
R-MAC	88.53	60.82	77.74
R-rGeM	89.32	84.60	91.90
R-rMAC	89.08	68.46	83.00
R-SPoC	86.57	62.36	76.75

2. Feature Translation

We translate different types of features and test them in this section.

The code with details can be found in: ./Translation/

2.1 Evaluation

2.1.1 Datasets

Datasets for evaluating the translation results:

Holidays [1]
Oxford5k [2]
Paris6k [3]

Dataset for training the Hybrid Auto-Encoder (HAE):

Google-Landmarks [4]

2.1.2 Measurement

The mean average precision (mAP) is used to evaluate the retrieval performance. We translate the source features of galary images to the target space, and the target features of query images are used for searching.

Galary: Source -> Target
Query: Target

2.2 Hybrid Auto-Encoder

The Hybrid Auto-Encoder (HAE) is trained with Translation (Source -> Target) and Reconstruction (Target -> Target), in which we can get the Translation Error and Reconstruction Error to optimize the network.

2.3 Results

2.3.1 Translation Results

The mAP(%) difference between target and translated features on three public datasets: Holidays (Green), Oxford5k (Blue) and Paris6k (Brown).

2.3.2 Retrieval Examples

The retrieval results for querying images of the Eiffel Tower (up) and the Arc de Triomphe (down) with the target features and the translated features. The images are resized for better view and the interesting results are colored by red bounding boxes.

3. Relation Mining

We mine the relation of different types of features in this section, and the code with details can be found in: ./Relation/

3.1 Affinity Measurement

If the Translation Error is close to Reconstruction Error, we think the Translation between source and target features is similar to the Reconstruction of target features, which indicates the source and target features have high afﬁnity.

Therefore, we regard the difference between the Translation Error and Reconstruction Error as an afﬁnity measurement.

By normalizing, we can finally get an Undirected Affinity Measurement.

3.2 Visualization Result

The Undirected Affinity can be visualized by applying a Minimum Spanning Tree algorithm.

The length of edges is the average value of the results on Holidays, Oxford5k and Paris6k datasets. The images are the retrieval results for a query image of the Pantheon with corresponding features in the main trunk of the MST. The close feature pairs such as R-SPoC and R-CroW have similar ranking lists.

4. Reference

[1] "Hamming embedding and weak geometric consistency for large scale image search." Jégou, H., Douze, M., & Schmid, C. In ECCV 2008.
[2] "Object retrieval with large vocabularies and fast spatial matching." Philbin, J., Chum, O., Isard, M., Sivic, J. & Zisserman, A. In CVPR 2007.
[3] "Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases." Philbin, J., Chum, O., Isard, M., Sivic, J. & Zisserman, A. In CVPR 2008.
[4] "Large-scale image retrieval with attentive deep local features." Noh, H., Araujo, A., Sim, J., Weyand, T., & Han, B. In ICCV 2017.
[5] "Distinctive image features from scale-invariant keypoints." Lowe, D. G. IJCV 2004.
[6] "Large-scale image retrieval with compressed fisher vectors." Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. In CVPR 2010.
[7] "Aggregating local descriptors into a compact image representation." Jégou, H., Douze, M., Schmid, C., & Pérez, P. In CVPR 2010.
[8] "Large-scale image retrieval with attentive deep local features." Noh, H., Araujo, A., Sim, J., Weyand, T., & Han, B. In ICCV 2017.
[9] "Very deep convolutional networks for large-scale image recognition." Simonyan, K., & Zisserman, A. arXiv:1409.1556.
[10] "Deep residual learning for image recognition." He, K., Zhang, X., Ren, S., & Sun, J. In CVPR 2016.
[11] "Cross-dimensional weighting for aggregated deep convolutional features." Kalantidis, Y., Mellina, C., & Osindero, S. In ECCV 2016.
[12] "Aggregating local deep features for image retrieval." Babenko, A., & Lempitsky, V. In ICCV 2015.
[13] "Visual instance retrieval with deep convolutional networks." Razavian, A. S., Sullivan, J., Carlsson, S., & Maki, A. MTA 2016.
[14] "Particular object retrieval with integral max-pooling of CNN activations." Tolias, G., Sicre, R., & Jégou, H. In ICLR 2016.
[15] "Fine-tuning CNN image retrieval with no human annotation." Radenović, F., Tolias, G., & Chum, O. PAMI 2018.

5. Citation

If our paper helps your research, please cite it in your publications:

@InProceedings{Hu_2019_CVPR,
author = {Hu, Jie and Ji, Rongrong and Liu, Hong and Zhang, Shengchuan and Deng, Cheng and Tian, Qi},
title = {Towards Visual Feature Translation},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019}
}

liuguoyou / visualfeaturetranslation Goto Github PK

visualfeaturetranslation's Introduction