Giter Site home page Giter Site logo

Comments (6)

ydennisy avatar ydennisy commented on May 22, 2024

@cgnorthcutt +1 on this old question.

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 22, 2024

Hi @ydennisy and @cindyyj ,

For model selection, you can try to 'estimate' what the F1-score or accuracy would be with a clean test set, but the problem with this approach is the following: if your estimation method was good enough to accurately estimate the error in the test set, why not just fix the error in the test set directly... so, the most straight-forward solution is to do just that: clean your test set, too (not just the train set). Here is a paper where we (myself in collaboration with ChipBrain, MIT, and Amazon) use cleanlab to clean and correct 10 of the most popular machine learning datasets: https://arxiv.org/abs/2103.14749

So, step-by-step:

  1. You'll need a test set for model selection. Choose one with the same distribution as your train set.
  2. Obtain psx (predicted probability matrix for every example in your test set, for every class). You can learn more about the definition of psx here
  • To compute psx, train a model on the train set. Then compute the predicted probabilities for each example in the test set. (optionally for slightly better results, after training on the test set, you can then fine-tune the model on the test set by training (for just a small amount of epochs) using cross-validation to obtain out of sample predicted probabilities on the test set).
  1. You already have s (the noisy labels of the test set).
  2. Find the label errors of the test set using: https://github.com/cgnorthcutt/cleanlab/blob/master/cleanlab/pruning.py#L245 (here is the documentation)
  3. clean up your test set, by removing errors. (you can correct some errors, too, see second-to-last-point below)

Some more details and options to further clean up your test set:

  • Consider cleaning your train set using a similar procedure as the one above (ie, using cleanlab). And doing a final model training using co-teaching (example here) on the train set, then use this improved model to compute psx on the test set -- you may find more/different errors this way.
  • Consider iteratively removing error from the training set (by only removing half of the examples found using cleanlab) and then repeating. This way the model iteratively improves in performance and can find more errors (this is an empirical trick that can work in some settings).
  • Once you have a model trained on a fairly clean train set, you can identify error in the test set with higher accuracy. Now validate the quality of the remaining test set by checking approx. 10 examples in each class, by hand. (on imagenet, that's 10,000 examples, which takes a long day, but its doable). There are some tricks to avoid this step. One thing you can do is instead, check the label errors (which should be a much smaller set than your original dataset because the amount of noise should be less than the amount of clean data). If the label errors seem reasonably accurate, its possible that your test set still contains some errors that were missed, but at least you know the test set is cleaner (and you can still check a few examples).
  • Note that cleanlab ranks the errors for you in the cleanlab.pruning.get_noise_indices method. So you can choose to 'correct' the labels for the top X% then re-train and re-find the errors, using the corrected train set. Just be careful how you set X. By iteratively correcting easily-correctable examples (that have a very high probability of being another class and very low probability of belonging to their given-label class) and throwing out the most egregious errors (low probability of belonging to their given-label class, as well as any other class), you can run cleanlab many times to iteratively building up a set of corrected labels, benign labels (ones that seem like they were already labeled correctly), and errors (ones you need to throw out). cleanlab doesn't do this be default (yet) because if you set X improperly, you can relabel things incorrectly and your labels can get worse! So be careful if you take this approach! Again, the easiest way is the 5 steps at the top of this comment (just find psx, and use cleanlab to remove all the estimated error) -- you might throw out some correctly labeled data, but you'll also throw out the errors found by cleanlab.
  • A final option: if you don't mind have a tiny test set, you can just hand pick a few examples from each class.

By the way, in response to your comment that "for real world scenario, the label noise is unknown", note that unlike other packages, cleanlab estimates the label noise for you directly (provably accurately in reasonable conditions), for the train set or the test set -- whichever sets you run cleanlab on. Cleanlab works using confident learning: (see paper) -- confident learning (and therefore cleanlab) estimates the noise in the dataset without knowing the true labels. Here is some code explaining more about this.

from cleanlab.latent_estimation import (
estimate_joint, estimate_py_and_noise_matrices_from_probabilities)
# cleanlab can directly estimate the noise transition matrices and prior over true labels.
(
  prior_distribution_over_true_class_labels_py,
  noise_transition_matrix,
  inverse_noise_transition_matrix,
  confident_joint,
) = estimate_py_and_noise_matrices_from_probabilities(
    s,
    psx,
    thresholds=None,
    converge_latent_estimates=True,
    py_method='cnt',
    calibrate=True,
)

# cleanlab can also find the joint distribution of true labels and noisy labels.
joint = estimate_joint(  
    s=noisy_labels,
    psx=probabilities,
)
# From the joint, you can compute anything, including the noise transition matrices and true distribution priors.

from cleanlab.

ydennisy avatar ydennisy commented on May 22, 2024

@cgnorthcutt thanks for this very detailed response!

I am sorry but I still have a question, in your steps, we are training our model on the train set (with errors) and using that to then remove errors from the test set - is that correct?

Once we have this cleaned test set - we can use it to select models, etc.

From a basic perspective, is this the same as removing all samples where a model which can predicted probabilities is unsure on the class label?

Is there an issue that the train set was still noisy?

What is the different between the approach you suggested and just showing the full dataset to the model and using the noise_mask to clean it all in one go, before test/train split? I understand this feels incorrect, as we have shown the full set to a model - but I am just trying to get some intuition as to why using only the train data to clean the test, is better as this also feels a little like cheating :)

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 22, 2024

It looks like your questions are about how confident learning works, why it works, and how it applies to various cases. I highly recommend reading the paper here: https://arxiv.org/abs/1911.00068

Or for a more accessible version, the blog here: https://l7.curtisnorthcutt.com/confident-learning

Please let me know if anything in those publications doesn't make sense.

from cleanlab.

ydennisy avatar ydennisy commented on May 22, 2024

Hi @cgnorthcutt

I have read both the paper and blog post already :) The process makes sense.

I think my above comment had too many questions inside, my main worry is that if we use some process to learn on a train set and use that same model to prune out from the test set - you are basically throwing away hard examples on which it would have failed. So it feels a little like cheating.

Apologies if this question still comes across as not having understood the underlying CL process - I will read the paper again.

Thanks in advance.

from cleanlab.

jwmueller avatar jwmueller commented on May 22, 2024

Closing this issue due to lack of activity. Feel free to reopen if you still have questions!

And yes, when cleaning labels in test data (which is highly recommended for proper model selection/evaluation), you need to be careful not to introduce bias into the dataset. We've recently created a tool called Cleanlab Studio to help you quickly verify the cleanlab-corrected labels to easily ensure they are trustworthy: https://cleanlab.ai/studio/

from cleanlab.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.