Giter Site home page Giter Site logo

When the number of real samples is smaller than 10K, does the metric still produce a reliable score? about generative-evaluation-prdc HOT 5 CLOSED

clovaai avatar clovaai commented on May 24, 2024
When the number of real samples is smaller than 10K, does the metric still produce a reliable score?

from generative-evaluation-prdc.

Comments (5)

coallaoh avatar coallaoh commented on May 24, 2024

Thank you for your interest in our work 👍
We did a bit of analysis around this for real==fake case, when the D&C metrics should give values close to 1.0.

Screen Shot 2020-08-11 at 10 46 28 AM

Screen Shot 2020-08-11 at 10 46 36 AM

Coverage is very stable against #samples. The problem is Density's sensitivity to #samples. However, even for Density, I would say 1k samples already give stable result (low variance around the mean 1.0).

So my answer to your question is a reserved yes. Please go ahead with 1k samples, but keep in mind that the metrics are based on samples and are therefore not completely free from the sample variance.

from generative-evaluation-prdc.

HelenMao avatar HelenMao commented on May 24, 2024

Thanks for your quick response and it helps me a lot!

from generative-evaluation-prdc.

HelenMao avatar HelenMao commented on May 24, 2024

Hi, I use this metric and P&R to compare several generative models.
I find:

  1. Although FID_A < FID_B, however, the coverage and density of B are better than A.
  2. Also, I find the tendency of P&R and C&D is not consistent. For example, recall is smaller. However, the coverage score is larger. (This is not the outlier case that the paper reported)
  3. when I choose different K. The tendency of coverage and density among these models will change. For example, when I choose K=3, C_A is much better than other methods. However, K=5, C_A is the worst.
    Moreover, in the conditional setting, not only P&R and C&D seem not very consistent with FID. Namely, there are always many cases that FID is good. However, C&D is worse.
    Do you have some experience about that?

from generative-evaluation-prdc.

coallaoh avatar coallaoh commented on May 24, 2024

Yes, we also have experienced certain inconsistencies in the model rankings for different metrics, and I personally do understand your pain. Unfortunately, there is no quick solution to the problems you are having - I should say they are deeply rooted in the difficulty of evaluating generative models.

Maybe you have already thought about this, but let's probably think a bit about how to judge whether an evaluation metric is doing the right job. It's not so easy ;) I believe there are two ways.

  1. The metric is by definition what we want. For example, the accuracy metric is defined as the proportion of the correct predictions among all predictions made. This metric, by definition, is the exact representation of what humans generally want from a model. However, it is not always easy to build such a fully "interpretable" evaluation metric - e.g. building an evaluation metric for generative models. How do you algorithmically quantify the fidelity and diversity? There is no easy way. Thus, we and previous researchers have come up with proxies like CNN embeddings and KDE-like density estimators. But among many proxy metrics, how do we judge if metric A is better than metric B? This question leads to the second way of "evaluating an evaluation metric".

  2. Build a few test cases where you know how the metrics should behave & see if the metrics pass this test. This is the method we adopted in our paper - with a handful of test cases where FID and P&R fail while D&C thrive. We cannot say that we have covered all meaningful test cases for evaluating generative models, but we did our best to cover them all, seeking advice from researchers who have been working with generative models for years. And yet, it is very much likely that D&C still fails in certain cases - we hope future researchers will find them out and propose improved metrics.

In this context, we can only say D&C are remedying key shortcomings of FID and P&R, rather than saying that D&C are the evaluation metrics to be used.

For your problem cases 1 and 2, we can't say if D&C are failing because they may be rectifying the wrong evaluation results given by FID or P&R. They are not the "test cases" as in point 2 above where the desired metric values or rankings are present. They are intriguing inconsistencies but are inconclusive of which metrics are doing the right job.

For problem case 3, the ranking's dependence on K is definitely a shortcoming for D&C. It is unfortunate, but partly expected because there is no guarantee that D&C are perfect metrics.

Sorry that my answers do not really solve any of your issues. But I can tell you that we have the same kind of issues, and they are deeply rooted in the inherent difficulty of evaluating generative models.

from generative-evaluation-prdc.

HelenMao avatar HelenMao commented on May 24, 2024

Thanks for your detailed reply :)
Yes, you are right. It is really suffered when I try to use all the metrics to evaluate the models and find inconsistency results since I cannot draw any conclusions from the results.
But just as you say, evaluating the generative models is indeed a difficulty without any GT to evaluate the metric itself.
Thanks again for the discussions 👍

from generative-evaluation-prdc.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.