Hi, thanks for sharing the great work! I would like to use this metric to evaluate

When the number of real samples is smaller than 10K, does the metric still produce a reliable score? about generative-evaluation-prdc HOT 5 CLOSED

clovaai commented on May 24, 2024

When the number of real samples is smaller than 10K, does the metric still produce a reliable score?

from generative-evaluation-prdc.

Comments (5)

coallaoh commented on May 24, 2024

Thank you for your interest in our work 👍
We did a bit of analysis around this for real==fake case, when the D&C metrics should give values close to 1.0.

Coverage is very stable against #samples. The problem is Density's sensitivity to #samples. However, even for Density, I would say 1k samples already give stable result (low variance around the mean 1.0).

So my answer to your question is a reserved yes. Please go ahead with 1k samples, but keep in mind that the metrics are based on samples and are therefore not completely free from the sample variance.

from generative-evaluation-prdc.

HelenMao commented on May 24, 2024

Thanks for your quick response and it helps me a lot!

from generative-evaluation-prdc.

HelenMao commented on May 24, 2024

Hi, I use this metric and P&R to compare several generative models.
I find:

Although FID_A < FID_B, however, the coverage and density of B are better than A.
Also, I find the tendency of P&R and C&D is not consistent. For example, recall is smaller. However, the coverage score is larger. (This is not the outlier case that the paper reported)
when I choose different K. The tendency of coverage and density among these models will change. For example, when I choose K=3, C_A is much better than other methods. However, K=5, C_A is the worst.
Moreover, in the conditional setting, not only P&R and C&D seem not very consistent with FID. Namely, there are always many cases that FID is good. However, C&D is worse.
Do you have some experience about that?

from generative-evaluation-prdc.

coallaoh commented on May 24, 2024

Yes, we also have experienced certain inconsistencies in the model rankings for different metrics, and I personally do understand your pain. Unfortunately, there is no quick solution to the problems you are having - I should say they are deeply rooted in the difficulty of evaluating generative models.

Maybe you have already thought about this, but let's probably think a bit about how to judge whether an evaluation metric is doing the right job. It's not so easy ;) I believe there are two ways.

The metric is by definition what we want. For example, the accuracy metric is defined as the proportion of the correct predictions among all predictions made. This metric, by definition, is the exact representation of what humans generally want from a model. However, it is not always easy to build such a fully "interpretable" evaluation metric - e.g. building an evaluation metric for generative models. How do you algorithmically quantify the fidelity and diversity? There is no easy way. Thus, we and previous researchers have come up with proxies like CNN embeddings and KDE-like density estimators. But among many proxy metrics, how do we judge if metric A is better than metric B? This question leads to the second way of "evaluating an evaluation metric".
Build a few test cases where you know how the metrics should behave & see if the metrics pass this test. This is the method we adopted in our paper - with a handful of test cases where FID and P&R fail while D&C thrive. We cannot say that we have covered all meaningful test cases for evaluating generative models, but we did our best to cover them all, seeking advice from researchers who have been working with generative models for years. And yet, it is very much likely that D&C still fails in certain cases - we hope future researchers will find them out and propose improved metrics.

In this context, we can only say D&C are remedying key shortcomings of FID and P&R, rather than saying that D&C are the evaluation metrics to be used.

For your problem cases 1 and 2, we can't say if D&C are failing because they may be rectifying the wrong evaluation results given by FID or P&R. They are not the "test cases" as in point 2 above where the desired metric values or rankings are present. They are intriguing inconsistencies but are inconclusive of which metrics are doing the right job.

For problem case 3, the ranking's dependence on K is definitely a shortcoming for D&C. It is unfortunate, but partly expected because there is no guarantee that D&C are perfect metrics.

Sorry that my answers do not really solve any of your issues. But I can tell you that we have the same kind of issues, and they are deeply rooted in the inherent difficulty of evaluating generative models.

from generative-evaluation-prdc.

HelenMao commented on May 24, 2024

Thanks for your detailed reply :)
Yes, you are right. It is really suffered when I try to use all the metrics to evaluate the models and find inconsistency results since I cannot draw any conclusions from the results.
But just as you say, evaluating the generative models is indeed a difficulty without any GT to evaluate the metric itself.
Thanks again for the discussions 👍

from generative-evaluation-prdc.

When the number of real samples is smaller than 10K, does the metric still produce a reliable score? about generative-evaluation-prdc HOT 5 CLOSED

Comments (5)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent