Giter Site home page Giter Site logo

clovaai / generative-evaluation-prdc Goto Github PK

View Code? Open in Web Editor NEW
234.0 9.0 27.0 297 KB

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

License: MIT License

Python 100.00%
deep-learning generative-adversarial-network evaluation-metrics precision recall machine-learning generative-model fidelity diversity evaluation

generative-evaluation-prdc's Introduction

PyPI version PyPI download month PyPI license

Reliable Fidelity and Diversity Metrics for Generative Models (ICML 2020)

Paper: Reliable Fidelity and Diversity Metrics for Generative Models

Muhammad Ferjad Naeem 1,3*, Seong Joon Oh2*, Yunjey Choi1, Youngjung Uh1, Jaejun Yoo1,4

Work done at Clova AI Research

* Equal contribution 1 Clova AI Research, NAVER Corp. 2 Clova AI Research, LINE Plus Corp. 3 Technische Universität München 4 EPFL

Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Fréchet Inception Distance (FID) score. Because it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall (Kynkäänniemi et al., 2019) metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics.

VIDEO

Updates

  • 1 June 2020: Paper accepted at ICML 2020.

1. Background

Precision and recall metrics

Precision and recall are defined below:

where the manifold is the defined as

is the ball around the point x with radius r.

is the distance to the kth-nearest neighbour.

Density and coverage metrics

Density and coverage are defined below:

Why are DC better than PR?

Precision versus density.

Precision versus Density. Because of the real outlier sample, the manifold is overestimated. Generating many fake samples around the real outlier is enough to increase the precision measure. The problem of overestimating precision (100%) is resolved using the density estimate (60%).

Recall versus coverage.

Recall versus Coverage. The real and fake samples are identical across left and right. Since models often generate many unrealistic yet diverse samples, the fake manifold is often an overestimation of the true fake distribution. In the figure above, while the fake samples are generally far from the modes in real samples, the recall measure is rewarded by the fact that real samples are contained in the overestimated fake manifold.

2. Usage

Installation

pip3 install prdc

Example

Test 10000 real and fake samples form the standard normal distribution N(0,I) in 1000-dimensional Euclidean space. Set the nearest neighbour k=5. We compute precision, recall, density, and coverage estimates below.

import numpy as np
from prdc import compute_prdc


num_real_samples = num_fake_samples = 10000
feature_dim = 1000
nearest_k = 5
real_features = np.random.normal(loc=0.0, scale=1.0,
                                 size=[num_real_samples, feature_dim])

fake_features = np.random.normal(loc=0.0, scale=1.0,
                                 size=[num_fake_samples, feature_dim])

metrics = compute_prdc(real_features=real_features,
                       fake_features=fake_features,
                       nearest_k=nearest_k)

print(metrics)

Above test code will result in the following estimates (may fluctuate due to randomness).

{'precision': 0.4772,
 'recall': 0.4705,
 'density': 1.0555,
 'coverage': 0.9735}

3. Miscellaneous

References

Kynkäänniemi et al., 2019. Improved precision and recall metric for assessing generative models. Neurips 2019.

License

Copyright (c) 2020-present NAVER Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Cite this work

@article{ferjad2020icml,
  title = {Reliable Fidelity and Diversity Metrics for Generative Models},
  author = {Naeem, Muhammad Ferjad and Oh, Seong Joon and Uh, Youngjung and Choi, Yunjey and Yoo, Jaejun},
  year = {2020},
  booktitle = {International Conference on Machine Learning},
}

generative-evaluation-prdc's People

Contributors

clovaaiadmin avatar coallaoh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

generative-evaluation-prdc's Issues

Dummy example gives non intuitive result

Hello and thank you for this great paper and implementation.

I've run your method with a dummy example:

    fake_features = torch.ones((1024, 4096))
    real_features = torch.ones((1024, 4096))

and would expect 1.0 for both density and coverage but actually got 0.0

There are two changes to the Density metric that might help

  1. Add less or equal
    (distance_real_fake < np.expand_dims(real_nearest_neighbour_distances, axis=1)
    =>
    (distance_real_fake <= np.expand_dims(real_nearest_neighbour_distances, axis=1)
  2. Add clamp with self.neareset_k and enforce [0,1] result
    (distance_real_fake <= real_nearest_neighbour_distances.unsqueeze(1)).sum(dim=0) => (distance_real_fake <= real_nearest_neighbour_distances.unsqueeze(1)).sum(dim=0).clamp(0, self.nearest_k)

Is it make sense or do I miss something?

Density much larger than 1

Hi! I understand that the density metric is not upper-bounded by 1 and the expectation of density given two identical distributions is 1. However, when I evaluate the density for StyleGAN2 trained on FFHQ, the density is much larger than 1. For the pre-trained StyleGAN2-F, the density is around 1.12. For a fine-tuned StyleGAN2 that obtains higher precision, the density goes up to around 1.5 . Is this an ill behavior of the density metric?

Thanks in advance!

using exact similarity search

Thank you for providing the coding and implementation for density and coverage. It is awesome to have the code ready for use in practice. I cite the paper whenever possible.

I took the liberty to research possible improvements. I found that using an exact similarity search as offered by faiss can speed up the calculation of density and coverage by a great deal.

Here are my results for num_real_samples = num_fake_samples = 1024, feature_dim = 12, nearest_k = 5:

--------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------
Name (time in ms)                 Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_bench_my_coverage        10.7225 (1.0)       53.4196 (1.09)      13.9665 (1.0)       8.7931 (1.0)       11.1186 (1.0)       1.4884 (1.0)           1;4  71.5998 (1.0)          24           1
test_bench_my_density         11.9193 (1.11)      49.0908 (1.0)       16.7892 (1.20)      9.0503 (1.03)      12.8918 (1.16)      3.9669 (2.67)          2;3  59.5619 (0.83)         18           1
test_bench_prdc_coverage     316.7985 (29.55)    400.5574 (8.16)     354.7417 (25.40)    31.7475 (3.61)     355.2325 (31.95)    43.6299 (29.31)         2;0   2.8190 (0.04)          5           1
test_bench_prdc_density      365.5958 (34.10)    400.6876 (8.16)     382.3611 (27.38)    12.6120 (1.43)     380.4541 (34.22)    12.9641 (8.71)          2;0   2.6153 (0.04)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I tagged, any algorithm using faiss with test_bench_my. Using a similarity tree approach, this line

 real_nearest_neighbour_distances = compute_nearest_neighbour_distances(
        real_features, nearest_k)

in the original code is accelerated big time due to the efficient lookup of samples with the tree structure.

As such a change would drag in a dependency to faiss, I am reluctant to send a PR to this repo. Let me know what you think!

About a specific random embedding extraction method

When extracting random embeddings, does use a randomly initialized vgg16 mean using an untrained model?

Or does it mean to train and select vgg16 up to a certain threshold as in the deep prior image? If so, how did you set the threshold for that threshold?

May I know the specific code you used or the code you referenced?

How do I use my own image dataset to run your code?

Hello, your research is very significant to my work.
What I want to ask is, how do I go about running this code to test my images? I have already prepared the original image and the generated image, but I am not quite sure how to run your code program? Can you tell me how to run the code when using my own image dataset?

Is it like this?
run prdc . /generate-images ./real-images
or compute_prdc . /generate-images . /real-images

Feature extraction to obtain vectors

Dear Sir,
it seems that in your work there aren't tools for getting the vectors of features from images. Thus, have you got any advice for me to obtain these features vectors from my real-fake images dataset?
Thank you so much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.