Giter Site home page Giter Site logo

liuguoyou / visualfeaturetranslation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hujiecpp/visualfeaturetranslation

0.0 1.0 0.0 3.13 MB

The project page of paper: Towards Visual Feature Translation [CVPR 2019]

Python 33.86% MATLAB 47.79% Shell 18.35%

visualfeaturetranslation's Introduction

This is the project page of our paper:

"Towards Visual Feature Translation." Hu, J., Ji, R., Liu, H., Zhang, S., Deng, C., & Tian, Q. In CVPR 2019. [paper]

If you have any problem, please feel free to contact us. ([email protected])

The framework of our paper.

1. Feature Extraction

This section contains the process of collecting popular content-based image retrieval features for preparing the meta-data of our paper.

The extracted features are evaluated in this section, and the code with details can be found in: ./Extraction/

1.1 Evaluation

1.1.1 Datasets

Datasets for evaluation:

Dataset for PCA whitening and creating codebooks:

1.1.2 Measurement

We use the mean Average Precision (mAP) provided by the official site of above datasets for evaluation.

1.2 Features

Please note that our extractions for images do not use the bounding boxes of the objects.

The local features (e.g., SIFT and DELF) are aggregated by the codebooks learned on 4,000 randomly picked images of Google-Landmarks dataset.

And the features of picked images are used to train the PCA whitening for all features of other images.

The features are listed bellow:

  • SIFT-FV and SIFT-VLAD: The Scale Invariant Feature Transform (SIFT) [5] features are extracted and then aggregate by Fisher Vector (FV) [6] and Vector of Locally Aggregated Descriptors (VLAD) [7].

  • DELF-FV and DELV-VLAD: The DEep Local Features (DELF) [8] are extracted and then aggregate also by FV and VLAD.

  • V-CroW and R-CroW: The abbreviation V represents the VGG [9] backbone network, and R [10] represents the Resnet50 backbone network. The Cross-dimensional Weighting (CroW) [11] are then used to aggregate the deep features generated by the backbone networks.

  • V-SPoC and R-SPoC: The Sum-Pooled Convolutional features (SPoC) [12] are used to aggregate the deep features generated by the backbone networks.

  • V-MAC, V-rMAC and R-MAC, R-rMAC: The Maximum Activations of Convolutions (MAC) [13] and the regional Maximum Activations of Convolutions (rMAC) [14] are used to aggregate the deep features generated by the backbone networks.

  • V-GeM, V-rGeM and R-GeM, R-rGeM: The Generalized-Mean pooling (GeM) [15] is used to aggregate the deep features generated by the backbone networks.

1.3 Results

The mAP (%) of collected features are as follows:

Holidays Oxford5k Paris6k
SIFT-FV 61.77 36.25 36.91
SIFT-VLAD 63.92 40.49 41.49
DELF-FV 83.42 73.38 83.06
DELF-VLAD 84.61 75.31 82.54
V-CroW 83.17 68.38 79.79
V-GeM 84.57 82.71 86.85
V-MAC 74.18 60.97 72.65
V-rGeM 85.06 82.30 87.33
V-rMAC 83.50 70.84 83.54
V-SPoC 83.38 66.43 78.47
R-CroW 86.38 61.73 75.46
R-GeM 89.08 84.47 91.87
R-MAC 88.53 60.82 77.74
R-rGeM 89.32 84.60 91.90
R-rMAC 89.08 68.46 83.00
R-SPoC 86.57 62.36 76.75

2. Feature Translation

We translate different types of features and test them in this section.

The code with details can be found in: ./Translation/

2.1 Evaluation

2.1.1 Datasets

Datasets for evaluating the translation results:

Dataset for training the Hybrid Auto-Encoder (HAE):

2.1.2 Measurement

The mean average precision (mAP) is used to evaluate the retrieval performance. We translate the source features of galary images to the target space, and the target features of query images are used for searching.

  • Galary: Source -> Target
  • Query: Target

2.2 Hybrid Auto-Encoder

The Hybrid Auto-Encoder (HAE) is trained with Translation (Source -> Target) and Reconstruction (Target -> Target), in which we can get the Translation Error and Reconstruction Error to optimize the network.

2.3 Results

2.3.1 Translation Results

The mAP(%) difference between target and translated features on three public datasets: Holidays (Green), Oxford5k (Blue) and Paris6k (Brown).

The mAP difference.

2.3.2 Retrieval Examples

The retrieval results for querying images of the Eiffel Tower (up) and the Arc de Triomphe (down) with the target features and the translated features. The images are resized for better view and the interesting results are colored by red bounding boxes.

Some retrieval results.

3. Relation Mining

We mine the relation of different types of features in this section, and the code with details can be found in: ./Relation/

3.1 Affinity Measurement

If the Translation Error is close to Reconstruction Error, we think the Translation between source and target features is similar to the Reconstruction of target features, which indicates the source and target features have high affinity.

Therefore, we regard the difference between the Translation Error and Reconstruction Error as an affinity measurement.

By normalizing, we can finally get an Undirected Affinity Measurement.

3.2 Visualization Result

The Undirected Affinity can be visualized by applying a Minimum Spanning Tree algorithm.

The length of edges is the average value of the results on Holidays, Oxford5k and Paris6k datasets. The images are the retrieval results for a query image of the Pantheon with corresponding features in the main trunk of the MST. The close feature pairs such as R-SPoC and R-CroW have similar ranking lists.

The MST.

4. Reference

[1] "Hamming embedding and weak geometric consistency for large scale image search." Jégou, H., Douze, M., & Schmid, C. In ECCV 2008.
[2] "Object retrieval with large vocabularies and fast spatial matching." Philbin, J., Chum, O., Isard, M., Sivic, J. & Zisserman, A. In CVPR 2007.
[3] "Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases." Philbin, J., Chum, O., Isard, M., Sivic, J. & Zisserman, A. In CVPR 2008.
[4] "Large-scale image retrieval with attentive deep local features." Noh, H., Araujo, A., Sim, J., Weyand, T., & Han, B. In ICCV 2017.
[5] "Distinctive image features from scale-invariant keypoints." Lowe, D. G. IJCV 2004.
[6] "Large-scale image retrieval with compressed fisher vectors." Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. In CVPR 2010.
[7] "Aggregating local descriptors into a compact image representation." Jégou, H., Douze, M., Schmid, C., & Pérez, P. In CVPR 2010.
[8] "Large-scale image retrieval with attentive deep local features." Noh, H., Araujo, A., Sim, J., Weyand, T., & Han, B. In ICCV 2017.
[9] "Very deep convolutional networks for large-scale image recognition." Simonyan, K., & Zisserman, A. arXiv:1409.1556.
[10] "Deep residual learning for image recognition." He, K., Zhang, X., Ren, S., & Sun, J. In CVPR 2016.
[11] "Cross-dimensional weighting for aggregated deep convolutional features." Kalantidis, Y., Mellina, C., & Osindero, S. In ECCV 2016.
[12] "Aggregating local deep features for image retrieval." Babenko, A., & Lempitsky, V. In ICCV 2015.
[13] "Visual instance retrieval with deep convolutional networks." Razavian, A. S., Sullivan, J., Carlsson, S., & Maki, A. MTA 2016.
[14] "Particular object retrieval with integral max-pooling of CNN activations." Tolias, G., Sicre, R., & Jégou, H. In ICLR 2016.
[15] "Fine-tuning CNN image retrieval with no human annotation." Radenović, F., Tolias, G., & Chum, O. PAMI 2018.

5. Citation

If our paper helps your research, please cite it in your publications:

@InProceedings{Hu_2019_CVPR,
author = {Hu, Jie and Ji, Rongrong and Liu, Hong and Zhang, Shengchuan and Deng, Cheng and Tian, Qi},
title = {Towards Visual Feature Translation},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019}
}

visualfeaturetranslation's People

Contributors

hujiecpp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.