Comments (6)
@cgnorthcutt +1 on this old question.
from cleanlab.
For model selection, you can try to 'estimate' what the F1-score or accuracy would be with a clean test set, but the problem with this approach is the following: if your estimation method was good enough to accurately estimate the error in the test set, why not just fix the error in the test set directly... so, the most straight-forward solution is to do just that: clean your test set, too (not just the train set). Here is a paper where we (myself in collaboration with ChipBrain, MIT, and Amazon) use cleanlab to clean and correct 10 of the most popular machine learning datasets: https://arxiv.org/abs/2103.14749
So, step-by-step:
- You'll need a test set for model selection. Choose one with the same distribution as your train set.
- Obtain psx (predicted probability matrix for every example in your test set, for every class). You can learn more about the definition of psx here
- To compute psx, train a model on the train set. Then compute the predicted probabilities for each example in the test set. (optionally for slightly better results, after training on the test set, you can then fine-tune the model on the test set by training (for just a small amount of epochs) using cross-validation to obtain out of sample predicted probabilities on the test set).
- You already have
s
(the noisy labels of the test set). - Find the label errors of the test set using: https://github.com/cgnorthcutt/cleanlab/blob/master/cleanlab/pruning.py#L245 (here is the documentation)
- clean up your test set, by removing errors. (you can correct some errors, too, see second-to-last-point below)
Some more details and options to further clean up your test set:
- Consider cleaning your train set using a similar procedure as the one above (ie, using cleanlab). And doing a final model training using co-teaching (example here) on the train set, then use this improved model to compute psx on the test set -- you may find more/different errors this way.
- Consider iteratively removing error from the training set (by only removing half of the examples found using cleanlab) and then repeating. This way the model iteratively improves in performance and can find more errors (this is an empirical trick that can work in some settings).
- Once you have a model trained on a fairly clean train set, you can identify error in the test set with higher accuracy. Now validate the quality of the remaining test set by checking approx. 10 examples in each class, by hand. (on imagenet, that's 10,000 examples, which takes a long day, but its doable). There are some tricks to avoid this step. One thing you can do is instead, check the label errors (which should be a much smaller set than your original dataset because the amount of noise should be less than the amount of clean data). If the label errors seem reasonably accurate, its possible that your test set still contains some errors that were missed, but at least you know the test set is cleaner (and you can still check a few examples).
- Note that cleanlab ranks the errors for you in the
cleanlab.pruning.get_noise_indices
method. So you can choose to 'correct' the labels for the top X% then re-train and re-find the errors, using the corrected train set. Just be careful how you set X. By iteratively correcting easily-correctable examples (that have a very high probability of being another class and very low probability of belonging to their given-label class) and throwing out the most egregious errors (low probability of belonging to their given-label class, as well as any other class), you can run cleanlab many times to iteratively building up a set of corrected labels, benign labels (ones that seem like they were already labeled correctly), and errors (ones you need to throw out). cleanlab doesn't do this be default (yet) because if you set X improperly, you can relabel things incorrectly and your labels can get worse! So be careful if you take this approach! Again, the easiest way is the 5 steps at the top of this comment (just find psx, and use cleanlab to remove all the estimated error) -- you might throw out some correctly labeled data, but you'll also throw out the errors found by cleanlab. - A final option: if you don't mind have a tiny test set, you can just hand pick a few examples from each class.
By the way, in response to your comment that "for real world scenario, the label noise is unknown", note that unlike other packages, cleanlab
estimates the label noise for you directly (provably accurately in reasonable conditions), for the train set or the test set -- whichever sets you run cleanlab on. Cleanlab works using confident learning: (see paper) -- confident learning (and therefore cleanlab) estimates the noise in the dataset without knowing the true labels. Here is some code explaining more about this.
from cleanlab.latent_estimation import (
estimate_joint, estimate_py_and_noise_matrices_from_probabilities)
# cleanlab can directly estimate the noise transition matrices and prior over true labels.
(
prior_distribution_over_true_class_labels_py,
noise_transition_matrix,
inverse_noise_transition_matrix,
confident_joint,
) = estimate_py_and_noise_matrices_from_probabilities(
s,
psx,
thresholds=None,
converge_latent_estimates=True,
py_method='cnt',
calibrate=True,
)
# cleanlab can also find the joint distribution of true labels and noisy labels.
joint = estimate_joint(
s=noisy_labels,
psx=probabilities,
)
# From the joint, you can compute anything, including the noise transition matrices and true distribution priors.
from cleanlab.
@cgnorthcutt thanks for this very detailed response!
I am sorry but I still have a question, in your steps, we are training our model on the train set (with errors) and using that to then remove errors from the test set - is that correct?
Once we have this cleaned test set - we can use it to select models, etc.
From a basic perspective, is this the same as removing all samples where a model which can predicted probabilities is unsure on the class label?
Is there an issue that the train set was still noisy?
What is the different between the approach you suggested and just showing the full dataset to the model and using the noise_mask
to clean it all in one go, before test/train split? I understand this feels incorrect, as we have shown the full set to a model - but I am just trying to get some intuition as to why using only the train data to clean the test, is better as this also feels a little like cheating :)
from cleanlab.
It looks like your questions are about how confident learning works, why it works, and how it applies to various cases. I highly recommend reading the paper here: https://arxiv.org/abs/1911.00068
Or for a more accessible version, the blog here: https://l7.curtisnorthcutt.com/confident-learning
Please let me know if anything in those publications doesn't make sense.
from cleanlab.
Hi @cgnorthcutt
I have read both the paper and blog post already :) The process makes sense.
I think my above comment had too many questions inside, my main worry is that if we use some process to learn on a train set and use that same model to prune out from the test set - you are basically throwing away hard examples on which it would have failed. So it feels a little like cheating.
Apologies if this question still comes across as not having understood the underlying CL process - I will read the paper again.
Thanks in advance.
from cleanlab.
Closing this issue due to lack of activity. Feel free to reopen if you still have questions!
And yes, when cleaning labels in test data (which is highly recommended for proper model selection/evaluation), you need to be careful not to introduce bias into the dataset. We've recently created a tool called Cleanlab Studio to help you quickly verify the cleanlab-corrected labels to easily ensure they are trustworthy: https://cleanlab.ai/studio/
from cleanlab.
Related Issues (20)
- Error in null: Ambiguous truth value of a Series HOT 4
- Add end-to-end tests at the end of Datalab quickstart tutorial
- get rid of warnings in the datalab quickstart tutorial
- Remove Tensorflow version constraint in developer dependencies
- add unit test with all identical dataset HOT 3
- Difference of object detection confident learning with objectlab paper HOT 1
- update coveragerc to only skip over specific experimental subfolders that currently are untested
- Null issue check throwing an error HOT 1
- lab.find_issues(features=features) outputs error for underperforming issue HOT 1
- Object detection, segmentation k-fold practical issue HOT 1
- Trying to create Datalab object with label set to a dtype of 'category' but getting 'NotImplementedError'
- test_scores_for_identical_examples unit test fails
- be able to pass in kwargs to plt.show()
- datalab issue guide should better describe the relevant cleanlab columns
- Trying to build docs with a new notebook I have created but getting `AttributeError` from the audio.ipynb tutorial HOT 1
- Doctests are failing for some functions HOT 1
- In the “Synthetic Data Quality” part, do we need the same amount of real data and generated data HOT 1
- image datalab tutorial broken: Getting build error RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [64, 1, 1, 28, 28] HOT 2
- 3D Cleanlab / DCAI ?
- Follow-Up: Revert macOS CI Environment to Latest Version Once Python Compatibility Is Resolved
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cleanlab.