clovaai / generative-evaluation-prdc Goto Github PK

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

License: MIT License

Python 100.00%

deep-learning generative-adversarial-network evaluation-metrics precision recall machine-learning generative-model fidelity diversity evaluation

generative-evaluation-prdc's Introduction

Reliable Fidelity and Diversity Metrics for Generative Models (ICML 2020)

Paper: Reliable Fidelity and Diversity Metrics for Generative Models

Muhammad Ferjad Naeem ^1,3*, Seong Joon Oh^2*, Yunjey Choi¹, Youngjung Uh¹, Jaejun Yoo^1,4

_{Work done at Clova AI Research}

_{* Equal contribution} ¹ _{Clova AI Research, NAVER Corp.} ² _{Clova AI Research, LINE Plus Corp.} ³ _{Technische Universität München} ⁴ _EPFL

Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Fréchet Inception Distance (FID) score. Because it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall (Kynkäänniemi et al., 2019) metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics.

Updates

1 June 2020: Paper accepted at ICML 2020.

1. Background

Precision and recall metrics

Precision and recall are defined below:

$\text{precision}:=\frac{1}{M}\sum_{j=1}^{M}1_{Y_j\in\text{manifold}(X_1,\cdots,X_N)}$

$\text{recall}:=\frac{1}{N}\sum_{i=1}^{N}1_{X_i\in\text{manifold}(Y_1,\cdots,Y_M)}$

where the manifold is the defined as

$\text{manifold}(X_1,\cdots,X_N):= \bigcup_{i=1}^{N} B(X_i,\text{NND}_k(X_i))$

is the ball around the point x with radius r.

$\text{NND}_k(X_i)$ is the distance to the kth-nearest neighbour.

Density and coverage metrics

Density and coverage are defined below:

$\text{density}:=\frac{1}{kM}\sum_{j=1}^{M}\sum_{i=1}^{N}1_{Y_j\in B(X_i,\text{NND}_k(X_i))}$

$\text{coverage}:=\frac{1}{N}\sum_{i=1}^{N}1_{\exists\text{ }j\text{ s.t. } Y_j\in B(X_i,\text{NND}_k(X_i))}$

Why are DC better than PR?

Precision versus Density. Because of the real outlier sample, the manifold is overestimated. Generating many fake samples around the real outlier is enough to increase the precision measure. The problem of overestimating precision (100%) is resolved using the density estimate (60%).

Recall versus Coverage. The real and fake samples are identical across left and right. Since models often generate many unrealistic yet diverse samples, the fake manifold is often an overestimation of the true fake distribution. In the figure above, while the fake samples are generally far from the modes in real samples, the recall measure is rewarded by the fact that real samples are contained in the overestimated fake manifold.

2. Usage

Installation

pip3 install prdc

Example

Test 10000 real and fake samples form the standard normal distribution N(0,I) in 1000-dimensional Euclidean space. Set the nearest neighbour k=5. We compute precision, recall, density, and coverage estimates below.

import numpy as np
from prdc import compute_prdc


num_real_samples = num_fake_samples = 10000
feature_dim = 1000
nearest_k = 5
real_features = np.random.normal(loc=0.0, scale=1.0,
                                 size=[num_real_samples, feature_dim])

fake_features = np.random.normal(loc=0.0, scale=1.0,
                                 size=[num_fake_samples, feature_dim])

metrics = compute_prdc(real_features=real_features,
                       fake_features=fake_features,
                       nearest_k=nearest_k)

print(metrics)

Above test code will result in the following estimates (may fluctuate due to randomness).

{'precision': 0.4772,
 'recall': 0.4705,
 'density': 1.0555,
 'coverage': 0.9735}

3. Miscellaneous

References

Kynkäänniemi et al., 2019. Improved precision and recall metric for assessing generative models. Neurips 2019.

License

Copyright (c) 2020-present NAVER Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Cite this work

@article{ferjad2020icml,
  title = {Reliable Fidelity and Diversity Metrics for Generative Models},
  author = {Naeem, Muhammad Ferjad and Oh, Seong Joon and Uh, Youngjung and Choi, Yunjey and Yoo, Jaejun},
  year = {2020},
  booktitle = {International Conference on Machine Learning},
}

generative-evaluation-prdc's People

Contributors

Stargazers

Watchers

generative-evaluation-prdc's Issues

Dummy example gives non intuitive result

Hello and thank you for this great paper and implementation.

I've run your method with a dummy example:

    fake_features = torch.ones((1024, 4096))
    real_features = torch.ones((1024, 4096))

and would expect 1.0 for both density and coverage but actually got 0.0

There are two changes to the Density metric that might help

Add less or equal
(distance_real_fake < np.expand_dims(real_nearest_neighbour_distances, axis=1)
=>
(distance_real_fake <= np.expand_dims(real_nearest_neighbour_distances, axis=1)
Add clamp with self.neareset_k and enforce [0,1] result
(distance_real_fake <= real_nearest_neighbour_distances.unsqueeze(1)).sum(dim=0) => (distance_real_fake <= real_nearest_neighbour_distances.unsqueeze(1)).sum(dim=0).clamp(0, self.nearest_k)

Is it make sense or do I miss something?

Density much larger than 1

Hi! I understand that the density metric is not upper-bounded by 1 and the expectation of density given two identical distributions is 1. However, when I evaluate the density for StyleGAN2 trained on FFHQ, the density is much larger than 1. For the pre-trained StyleGAN2-F, the density is around 1.12. For a fine-tuned StyleGAN2 that obtains higher precision, the density goes up to around 1.5 . Is this an ill behavior of the density metric?

Thanks in advance!

using exact similarity search

Thank you for providing the coding and implementation for density and coverage. It is awesome to have the code ready for use in practice. I cite the paper whenever possible.

I took the liberty to research possible improvements. I found that using an exact similarity search as offered by faiss can speed up the calculation of density and coverage by a great deal.

Here are my results for num_real_samples = num_fake_samples = 1024, feature_dim = 12, nearest_k = 5:

--------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------
Name (time in ms)                 Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_bench_my_coverage        10.7225 (1.0)       53.4196 (1.09)      13.9665 (1.0)       8.7931 (1.0)       11.1186 (1.0)       1.4884 (1.0)           1;4  71.5998 (1.0)          24           1
test_bench_my_density         11.9193 (1.11)      49.0908 (1.0)       16.7892 (1.20)      9.0503 (1.03)      12.8918 (1.16)      3.9669 (2.67)          2;3  59.5619 (0.83)         18           1
test_bench_prdc_coverage     316.7985 (29.55)    400.5574 (8.16)     354.7417 (25.40)    31.7475 (3.61)     355.2325 (31.95)    43.6299 (29.31)         2;0   2.8190 (0.04)          5           1
test_bench_prdc_density      365.5958 (34.10)    400.6876 (8.16)     382.3611 (27.38)    12.6120 (1.43)     380.4541 (34.22)    12.9641 (8.71)          2;0   2.6153 (0.04)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I tagged, any algorithm using faiss with test_bench_my. Using a similarity tree approach, this line

 real_nearest_neighbour_distances = compute_nearest_neighbour_distances(
        real_features, nearest_k)

in the original code is accelerated big time due to the efficient lookup of samples with the tree structure.

As such a change would drag in a dependency to faiss, I am reluctant to send a PR to this repo. Let me know what you think!

About a specific random embedding extraction method

When extracting random embeddings, does use a randomly initialized vgg16 mean using an untrained model?

Or does it mean to train and select vgg16 up to a certain threshold as in the deep prior image? If so, how did you set the threshold for that threshold?

May I know the specific code you used or the code you referenced?

[Feature Request] Command line interface

Could you please add a command-line interface with image directory as argument?

ex)

Inception Score
FID score
-> ./fid_score.py path/to/dataset1 path/to/dataset2

FID-infinite comparison and discussion

Thanks for the interesting work!

This is not an issue regarding your source code but just a comment to your study.

I believe it would be fair to discuss and cite the recent work by Min Jin Chong and David Forsyth Effectively Unbiased FID and Inception Score and where to find them
where they propose unbiased drop-in replacements for FID scores.

How do I use my own image dataset to run your code?

Hello, your research is very significant to my work.
What I want to ask is, how do I go about running this code to test my images? I have already prepared the original image and the generated image, but I am not quite sure how to run your code program? Can you tell me how to run the code when using my own image dataset?

Is it like this?
run prdc . /generate-images ./real-images
or compute_prdc . /generate-images . /real-images

Feature extraction to obtain vectors

Dear Sir,
it seems that in your work there aren't tools for getting the vectors of features from images. Thus, have you got any advice for me to obtain these features vectors from my real-fake images dataset?
Thank you so much!

When the number of real samples is smaller than 10K, does the metric still produce a reliable score?

Hi, thanks for sharing the great work!
I would like to use this metric to evaluate the results in the image-to-image translation task. However, in I2I datasets, the number of real samples are always less than 10K, and most of them are around 1k. In this case, does the metric still produce a reliable score? Can I directly use this metric?
Looking forward to your reply, thanks a lot!