cleanlab / cleanlab Goto Github PK

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

License: GNU Affero General Public License v3.0

Python 100.00%

weak-supervision data-cleaning data-quality data-science noisy-labels data-centric-ai out-of-distribution-detection outlier-detection active-learning data-labeling

cleanlab's Introduction

cleanlab helps you clean data and labels by automatically detecting issues in a ML dataset. To facilitate machine learning with messy, real-world data, this data-centric AI package uses your existing models to estimate dataset problems that can be fixed to train even better models.

# cleanlab works with **any classifier**. Yup, you can use PyTorch/TensorFlow/OpenAI/XGBoost/etc.
cl = cleanlab.classification.CleanLearning(sklearn.YourFavoriteClassifier())

# cleanlab finds data and label issues in **any dataset**... in ONE line of code!
label_issues = cl.find_label_issues(data, labels)

# cleanlab trains a robust version of your model that works more reliably with noisy data.
cl.fit(data, labels)

# cleanlab estimates the predictions you would have gotten if you had trained with *no* label issues.
cl.predict(test_data)

# A universal data-centric AI tool, cleanlab quantifies class-level issues and overall data quality, for any dataset.
cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint)

Get started with: tutorials, documentation, examples, and blogs.

Learn to run cleanlab on your data in 5 minutes for: image, text, audio, or tabular data.
Use cleanlab to automatically: detect data issues (outliers, duplicates, label errors, etc), train robust models, infer consensus + annotator-quality for multi-annotator data, suggest data to (re)label next (active learning).

Examples of various issues in Cat/Dog dataset automatically detected by cleanlab via this code:

        lab = cleanlab.Datalab(data=dataset, label="column_name_for_labels")
        # Fit any ML model, get its feature_embeddings & pred_probs for your data
        lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
        lab.report()

So fresh, so cleanlab

cleanlab cleans your data's labels via state-of-the-art confident learning algorithms, published in this paper and blog. See some of the datasets cleaned with cleanlab at labelerrors.com. This data-centric AI tool helps you find data and label issues, so you can train reliable ML models.

cleanlab is:

backed by theory -- with provable guarantees of exact label noise estimation, even with imperfect models.
fast -- code is parallelized and scalable.
easy to use -- one line of code to find mislabeled data, bad annotators, outliers, or train noise-robust models.
general -- works with any dataset (text, image, tabular, audio,...) + any model (PyTorch, OpenAI, XGBoost,...)

Examples of incorrect given labels in various image datasets found and corrected using cleanlab. While these examples are from image datasets, this also works for text, audio, tabular data.

Run cleanlab

cleanlab supports Linux, macOS, and Windows and runs on Python 3.8+.

Get started here! Install via pip or conda as described here.
Developers who install the bleeding-edge from source should refer to this master branch documentation.
For help, check out our detailed FAQ, Github Issues, or Slack. We welcome any questions!

Practicing data-centric AI can look like this:

Train initial ML model on original dataset.
Utilize this model to diagnose data issues (via cleanlab methods) and improve the dataset.
Train the same model on the improved dataset.
Try various modeling techniques to further improve performance.

Most folks jump from Step 1 → 4, but you may achieve big gains without any change to your modeling code by using cleanlab! Continuously boost performance by iterating Steps 2 → 4 (and try to evaluate with cleaned data).

Use cleanlab with any model for most ML tasks

All features of cleanlab work with any dataset and any model. Yes, any model: PyTorch, Tensorflow, Keras, JAX, HuggingFace, OpenAI, XGBoost, scikit-learn, etc. If you use a sklearn-compatible classifier, all cleanlab methods work out-of-the-box.

It’s also easy to use your favorite non-sklearn-compatible model (click to learn more)

cleanlab can find label issues from any model's predicted class probabilities if you can produce them yourself.

Some cleanlab functionality may require your model to be sklearn-compatible. There's nothing you need to do if your model already has .fit(), .predict(), and .predict_proba() methods. Otherwise, just wrap your custom model into a Python class that inherits the sklearn.base.BaseEstimator:

from sklearn.base import BaseEstimator
class YourFavoriteModel(BaseEstimator): # Inherits sklearn base classifier
    def __init__(self, ):
        pass  # ensure this re-initializes parameters for neural net models
    def fit(self, X, y, sample_weight=None):
        pass
    def predict(self, X):
        pass
    def predict_proba(self, X):
        pass
    def score(self, X, y, sample_weight=None):
        pass

This inheritance allows to apply a wide range of sklearn functionality like hyperparameter-optimization to your custom model. Now you can use your model with every method in cleanlab. Here's one example:

from cleanlab.classification import CleanLearning
cl = CleanLearning(clf=YourFavoriteModel())  # has all the same methods of YourFavoriteModel
cl.fit(train_data, train_labels_with_errors)
cl.predict(test_data)

Want to see a working example? Here’s a compliant PyTorch MNIST CNN class

More details are provided in documentation of cleanlab.classification.CleanLearning.

Note, some libraries exist to give you sklearn-compatibility for free. For PyTorch, check out the skorch Python library which will wrap your PyTorch model into a sklearn-compatible model (example). For TensorFlow/Keras, check out our Keras wrapper. Many libraries also already offer a special scikit-learn API, for example: XGBoost or LightGBM.

cleanlab is useful across a wide variety of Machine Learning tasks. Specific tasks this data-centric AI solution offers dedicated functionality for include:

Binary and multi-class classification
Multi-label classification (e.g. image/document tagging)
Token classification (e.g. entity recognition in text)
Regression (predicting numerical column in a dataset)
Image segmentation (images with per-pixel annotations)
Object detection (images with bounding box annotations)
Classification with data labeled by multiple annotators
Active learning with multiple annotators (suggest which data to label or re-label to improve model most)
Outlier detection (identify atypical data that appears out of distribution)

For other ML tasks, cleanlab can still help you improve your dataset if appropriately applied. Many practical applications are demonstrated in our Example Notebooks.

Citation and related publications

cleanlab is based on peer-reviewed research. Here are relevant papers to cite if you use this package:

Confident Learning (JAIR '21) (click to show bibtex)

@article{northcutt2021confidentlearning,
    title={Confident Learning: Estimating Uncertainty in Dataset Labels},
    author={Curtis G. Northcutt and Lu Jiang and Isaac L. Chuang},
    journal={Journal of Artificial Intelligence Research (JAIR)},
    volume={70},
    pages={1373--1411},
    year={2021}
}

Rank Pruning (UAI '17) (click to show bibtex)

@inproceedings{northcutt2017rankpruning,
    author={Northcutt, Curtis G. and Wu, Tailin and Chuang, Isaac L.},
    title={Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels},
    booktitle = {Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence},
    series = {UAI'17},
    year = {2017},
    location = {Sydney, Australia},
    numpages = {10},
    url = {http://auai.org/uai2017/proceedings/papers/35.pdf},
    publisher = {AUAI Press},
}

Label Quality Scoring (ICML '22) (click to show bibtex)

@inproceedings{kuan2022labelquality,
    title={Model-agnostic label quality scoring to detect real-world label errors},
    author={Kuan, Johnson and Mueller, Jonas},
    booktitle={ICML DataPerf Workshop},
    year={2022}
}

Out-of-Distribution Detection (ICML '22) (click to show bibtex)

@inproceedings{kuan2022ood,
    title={Back to the Basics: Revisiting Out-of-Distribution Detection Baselines},
    author={Kuan, Johnson and Mueller, Jonas},
    booktitle={ICML Workshop on Principles of Distribution Shift},
    year={2022}
}

Token Classification Label Errors (NeurIPS '22) (click to show bibtex)

@inproceedings{wang2022tokenerrors,
    title={Detecting label errors in token classification data},
    author={Wang, Wei-Chen and Mueller, Jonas},
    booktitle={NeurIPS Workshop on Interactive Learning for Natural Language Processing (InterNLP)},
    year={2022}
}

CROWDLAB for Data with Multiple Annotators (NeurIPS '22) (click to show bibtex)

@inproceedings{goh2022crowdlab,
    title={CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators},
    author={Goh, Hui Wen and Tkachenko, Ulyana and Mueller, Jonas},
    booktitle={NeurIPS Human in the Loop Learning Workshop},
    year={2022}
}

ActiveLab: Active learning with data re-labeling (ICLR '23) (click to show bibtex)

@inproceedings{goh2023activelab,
    title={ActiveLab: Active Learning with Re-Labeling by Multiple Annotators},
    author={Goh, Hui Wen and Mueller, Jonas},
    booktitle={ICLR Workshop on Trustworthy ML},
    year={2023}
}

Incorrect Annotations in Multi-Label Classification (ICLR '23) (click to show bibtex)

@inproceedings{thyagarajan2023multilabel,
    title={Identifying Incorrect Annotations in Multi-Label Classification Data},
    author={Thyagarajan, Aditya and Snorrason, Elías and Northcutt, Curtis and Mueller, Jonas},
    booktitle={ICLR Workshop on Trustworthy ML},
    year={2023}
}

Detecting Dataset Drift and Non-IID Sampling (ICML '23) (click to show bibtex)

@inproceedings{cummings2023drift,
    title={Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors},
    author={Cummings, Jesse and Snorrason, Elías and Mueller, Jonas},
    booktitle={ICML Workshop on Data-centric Machine Learning Research},
    year={2023}
}

Detecting Errors in Numerical Data (ICML '23) (click to show bibtex)

@inproceedings{zhou2023errors,
    title={Detecting Errors in Numerical Data via any Regression Model},
    author={Zhou, Hang and Mueller, Jonas and Kumar, Mayank and Wang, Jane-Ling and Lei, Jing},
    booktitle={ICML Workshop on Data-centric Machine Learning Research},
    year={2023}
}

ObjectLab: Mislabeled Images in Object Detection Data (ICML '23) (click to show bibtex)

@inproceedings{tkachenko2023objectlab,
    title={ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data},
    author={Tkachenko, Ulyana and Thyagarajan, Aditya and Mueller, Jonas},
    booktitle={ICML Workshop on Data-centric Machine Learning Research},
    year={2023}
}

Label Errors in Segmentation Data (ICML '23) (click to show bibtex)

@inproceedings{lad2023segmentation,
    title={Estimating label quality and errors in semantic segmentation data via any model},
    author={Lad, Vedang and Mueller, Jonas},
    booktitle={ICML Workshop on Data-centric Machine Learning Research},
    year={2023}
}

To understand/cite other cleanlab functionality not described above, check out our additional publications.

Other resources

Easy mode: No-code Data Improvement

While this open-source package finds data issues, its utility depends on you having: a good existing ML model + an interface to efficiently fix these issues in your dataset. Providing all these pieces, Cleanlab Studio is a Data Curation platform to find and fix problems in any {image, text, tabular} dataset. Cleanlab Studio automatically runs optimized algorithms from this package on top of AutoML & Foundation models fit to your data, and presents detected issues (+ AI-suggested fixes) in an intelligent data correction interface.

Try it for free! Adopting Cleanlab Studio enables users of this package to:

work 100x faster (1 min to analyze your raw data with zero code or ML work; optionally use Python API)
produce better-quality data (10x more types of issues auto detected & corrected via built-in AI)
accomplish more (auto-label data, deploy ML instantly, audit LLM inputs/outputs, moderate content, ...)

Join our community

The best place to learn is our Slack community.
Have ideas for the future of cleanlab? How are you using cleanlab? Join the discussion and check out our active/planned Projects and what we could use your help with.
Interested in contributing? See the contributing guide and ideas on useful contributions. We welcome your help building a standard open-source platform for data-centric AI!
Have code improvements for cleanlab? See the development guide.
Have an issue with cleanlab? Search our FAQ and existing issues, or submit a new issue.
Need professional help with cleanlab? Join our #help Slack channel and message us there, or reach out via email: [email protected]

License

cleanlab is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

cleanlab is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See GNU Affero General Public LICENSE for details. You can email us to discuss licensing: [email protected]

Commercial licensing

Commercial licensing is available for teams and enterprises that want to use cleanlab in production workflows, but are unable to open-source their code as is required by the current license. Please email us: [email protected]

cleanlab's People

Contributors

Stargazers

Watchers

Forkers

ichuang zhongkailv yyht rsip4sh dfayzur nnu-gisa hijuly zhaowang-iit tafujiwa for-competition woshifzh kryptec amirhfarzaneh panwarnaveen9 mannyjop wywywy01 bruceshuyu tailintalent hongxin001 seeker1943 qixiuai josealfon linhduongtuan haiziya kaziahosunhabibripon leo-xxx highland2019 kentchun33333 onejune2018 ginamathew benjamesbabala zhouleisjtu tristaneljed tchigher mimbres inspectordidi arcral yyincc royzon wovai hwenjun18 pierrehao phamcuong92 aihgf ty2009137128 face-dl dori2063 wuterry cxxpython9 marvis shengzhang90 gaohuan2015 detrident swansealeo phymucs wkryst sanyam07 shubhampachori12110095 saxh birajaghoshal personx000 whitesharkbrother zeta1999 bgscoones frostjsy arita37 vikas-kumar-infrrd shenyi666666 rothsword xrosliang keyky zhaoqiangshen jiarunliu aerinkim sudhirsilwal23 wuyangfeng sadjadasghari jennifer0218 hoondori coolshan008 nicexw lifanghe ericxsun rlds-107 dklingmann sherry15831 cooparation aifan-lab howard201314 gcv9htd uebunogamy mrleu thetargo gongxijun cheesezh barryzm deftruth semal bdotgradb guang000

cleanlab's Issues

May I set the sample_weight parameter manually？

Hi, It's a pretty lib. When I try to train model with cleanlab, I find it is not allowed to set sample_weight manually. It is inteligent to set sample weight according to the ratio of positive samples and negitave samples. However, I need to tun the paramater for better performance.
Thank you for your contribution about this lib.

Assigned pulearning doesn't change the results much

From what I understand if I assign pulearning = 1 (in a binary classification problem), it should imply that the class of ones has no noise. Still after training, i get the following output from the confident_joint, est_py, est_nm and est_inv respectively as:

[[ 3216. 1179.]
[16989. 14594.]]

[0.79313136 0.20686864]

[[0.15916852 0.07474799]
[0.84083148 0.92525201]]

[[0.73174061 0.53791597]
[0.26825939 0.46208403]]

Is there any other way to make sure no data points from class 1 are considered as noisy?

Thanks.

what is the actual role of inv_noise_matrix?

I see that the inv_noise_matrix and noise_matrix are calculated throughout the code but I can't see them used anywhere in the final pruning function which is the main purpose. I see that the indices to remove are calculated using cj but the cj is itself calculated using the psx. I see that even inv_noise_matrix is passed to the get_noise_indices function but I don't see it actually used in the function. I understand that this is also explained in the paper. Would you please explain to me why it is calculated in the code even though it is not used in the actual function of the code? Maybe I'm missing something?!

Support for other modalities

Hey

Thank you for this package. What changes will be required to make this work for Speech, Text, and Video?

typo in the comments

Shouldn't this line be:

# 'ps' is p(s=k) = noise_matrix * p(**y**=k)

What format of s and psx should be inputted into get_noise_indices() in a multi-label scenario

I have 20 samples with multi-label and 5 classes, such as:
[[2, 3, 4], [1, 3, 4, 5], [1, 3, 4], [1, 2, 3, 4, 5], [2, 3, 5], [1, 2, 4], [1, 3, 4, 5], [1, 3], [1, 5], [1, 3, 4, 5], [2, 3, 4], [3, 4], [4], [1, 3, 4], [2, 3, 4, 5], [1, 4], [3, 4], [3, 5], [2, 3, 5], [2, 5]]
I inputted this label list and a probabilities matrix as psx (shape=(20,5)) into get_noise_indices().
However, the error is:
File "C:\Users\Anaconda2\envs\tf18\lib\site-packages\cleanlab\pruning.py", line 342, in get_noise_indices
multi_label=multi_label,
File "C:\Users\Anaconda2\envs\tf18\lib\site-packages\cleanlab\latent_estimation.py", line 303, in compute_confident_joint
calibrate=calibrate,
File "C:\Users\Anaconda2\envs\tf18\lib\site-packages\cleanlab\latent_estimation.py", line 216, in _compute_confident_joint_multi_label
multi_label=True,
File "C:\Users\Anaconda2\envs\tf18\lib\site-packages\cleanlab\latent_estimation.py", line 121, in calibrate_confident_joint
confident_joint.T / confident_joint.sum(axis=1) * s_counts
ValueError: operands could not be broadcast together with shapes (5,5) (6,)

Is there any wrong with my inputs?
What format of s and psx are correct in this multi-label scenario?

fasttext.py line 247,typeerror

Anaconda3\lib\site-packages\cleanlab\models\fasttext.py", line 247, in
psx = [[p for _, p in sorted(list(zip(*l)), key=lambda x: x[0])] for l in list(zip(*pred))]
TypeError: zip argument #2 must support iteration

Multilabel Scenario

How would get_noise_indices work on multilabel scenario?
psx would reaming same which is a nxm sized probability array. How to represent s in this case?

Getting NaN values when LearningWithNoisyLabels

Hi,

I am trying to apply your cleanlab, in a nutshell 🌰 tutorial with my own dataset, but on the learning step I got the following error:

# Wrap around any classifier (scikit-learn, PyTorch, TensorFlow, FastText, etc.)
lnl = LearningWithNoisyLabels(clf=LogisticRegression()) 
lnl.fit(X=X_train_data, s=train_noisy_labels)

$ ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Regarding my training data:

X_train_data is a (47,100) numpy array from TFIDF extractor
train_noisy_labels is a (47,) numpy array from LabelEncoder (multi-class setting)

I have also tried with the XGBoost classifier (initially I intended to do it with XGBoost) and in that case the training seems to finish correctly but when I get the predictions with predict_proba the predictions array only contains nan values.

Thanks in advance

Conceptually does it make sense to run gridsearchCV on models that are wrapped with your api?

Hi,

First of all, really like the paper and this implementation.

If one wanted to tune hyper-parameters of the classifier, is it reasonable to use say GridSearchCV on top of this wrapper? Say

GridSearchCV(clf=NoisyLabelsEstimator(clf=Logistic()),parameter_grid={}).

Does this make sense given the fact that there potentially could be fewer N to determine noise?

If you find a hyper-parameter optimal, is it guaranteed after the de-noising that is still the optimal?

Thanks,

Will

cleanlab can be used for cleansing semantic segmentation dataset??

I found all the examples are only for classification?Is there any example for segmentaion or tutorial?

in LearningWithNoisyLabels(), did it train only on confident results? (by dropping label errors found by cleanlab)?

Dear Curtis,

I am a bit confused. in LearningWithNoisyLabels(), did it train only on confident results? (by dropping label errors found by cleanlab)?

I found in your state-of-the-art on CIFAR-10 application, cleanlab is used to find the label errors in CIFAR-10, and then Remove errors and train on cleaned data via Co-Teaching...

Hope to learn your approach better. Thank you very much!

errors found by cleanlab are mostly correct actually.

I used the method in tutorial:

ordered_label_errors = get_noise_indices( s=numpy_array_of_noisy_labels, psx=numpy_array_of_predicted_probabilities, sorted_index_method='normalized_margin', # Orders label errors )

then the outputs that supposed to be error labels are actually correct, what actions could I take to figure out the reason?

Suggestion: Memory consumption / time complexity in README

Hi, first of all thanks for open-sourcing!

I tried running this on a production dataset with ~ 200k examples that are classified in ~5k categories. I'd like to identify labeling errors with the simple example given in the README ordered_label_errors = get_noise_indices(...). Although being on a ~ 50 GB machine, I run out of memory.

It would be great if you could write something about memory and time complexity for the method in the README so that users can get a rough understanding on what the method can be used on and what not.

Also, I would appreciate to get some info for my example. Is the number of categories the problem here?

get_noise_indices parameter requirements?

Hi,

I'm stuck with using pruning.get_noise_indices to find label errors in my dataset. So I have a set of image data, 392 images, and I calculated its psx usinng my model, with the shape of (372, 5026). (I didn't do cross validation but I think that's not a problem for now?), and y is just a vector of shape(372,) containing the labels of the images(all 0).

Then if I use psx and y as input, I get the following error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/Users/zijingwu/Library/Python/3.7/lib/python/site-packages/cleanlab/pruning.py", line 170, in _prune_by_count
if s_counts[k] <= MIN_NUM_PER_CLASS: # No prune if not MIN_NUM_PER_CLASS
IndexError: index 948 is out of bounds for axis 0 with size 1
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/zijingwu/PycharmProjects/inception/cleaner.py", line 302, in
run_detection()
File "/Users/zijingwu/PycharmProjects/inception/cleaner.py", line 292, in run_detection
ordered_label_errors = cleanlab.pruning.get_noise_indices(y_test, psx, prune_method=prune_method)
File "/Users/zijingwu/Library/Python/3.7/lib/python/site-packages/cleanlab/pruning.py", line 419, in get_noise_indices
noise_masks_per_class = p.map(_prune_by_count, range(K))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
IndexError: index 948 is out of bounds for axis 0 with size 1

I had a similar shape mismatch error with another dataset before. However, if I reshape the psx into having the same number of columns as the number of unique values in y, the method would work, despite giving incorrect results. So I'm wondering what shape should psx be? From my understanding of the cleanlab paper, I think (372, 5026) should be the correct shape of psx.

Thanks in advance for any help! :)

an extension of cleanlab for regression tasks

Will there be an extension of the algorithm for regression tasks?

What is the correct format of s?

I have 14312 training examples and the number of classes is 14. We all know that the format of psx is n*m where n is the number of examples and m is the number of classes. Thus, the shape of my psx numpy array is (14312, 14).
I don't know what is the format of s. I guess it should be (14312,) and my s numpy array is like:
['6' '2' '10' ... '5' '0' '4']
But when I tried to run the following code:
ordered_label_errors = get_noise_indices(
s=noisy_labels,
psx=predicted_probabilities,
prune_method='prune_by_class')
I got the follow error:
Traceback (most recent call last):
File "run.py", line 18, in
prune_method='prune_by_class')
File "/nethome/shao42/.local/lib/python3.5/site-packages/cleanlab/pruning.py", line 342, in get_noise_indices
multi_label=multi_label,
File "/nethome/shao42/.local/lib/python3.5/site-packages/cleanlab/latent_estimation.py", line 357, in compute_confident_joint
confident_joint = calibrate_confident_joint(confident_joint, s)
File "/nethome/shao42/.local/lib/python3.5/site-packages/cleanlab/latent_estimation.py", line 121, in calibrate_confident_joint
confident_joint.T / confident_joint.sum(axis=1) * s_counts
ValueError: operands could not be broadcast together with shapes (0,0) (14,)

So I don't know where is wrong. Is it due to the format of s?

Error when Using LearningWithNoiseLabels()

Hey @cgnorthcutt ,

I am currently trying to improve my model's accuracy in classification. I know for sure that my labels have 20% noise in them. I am using Sklearn's Random Forest Classifier and my data is ~600K samples with 500 Features.

Environment Details :

Mac
Python 3.5.6
Sklearn 0.21.3

Code:

ln1=  LearningWithNoisyLabels(clf=RandomForestClassifier(class_weight='balanced'),seed = 2)
ln1.fit(X = x_train.values,s=y_train['prediction'].values)

Error Faced :

TypeError                                 Traceback (most recent call last)
<ipython-input-7-2636b16750dc> in <module>
      3 
      4 ln1=  LearningWithNoisyLabels(clf=RandomForestClassifier(class_weight='balanced'),seed = 2)
----> 5 ln1.fit(X = x_train.values,s=y_train['prediction'].values)

/anaconda3/envs/cleanlabel/lib/python3.5/site-packages/cleanlab/classification.py in fit(self, X, s, psx, thresholds, noise_matrix, inverse_noise_matrix)
    296             confident_joint = self.confident_joint,
    297             prune_method = self.prune_method,
--> 298             converge_latent_estimates = self.converge_latent_estimates,
    299         ) 
    300 

TypeError: get_noise_indices() got an unexpected keyword argument 'converge_latent_estimates

Exploring the code it seems the function get_nosie_indices() has no argument

GridSearchCV seems to work, but how does it work internally?

Let's say my classifier is LearningWithNoisyLabels(GridSearchCV(estimator = RandomForestClassifier(), param_grid = ..., cv = ...), cv_n_fold = ...)

What will happen here? Will I get the best parameters from GridSearchCV's cross validation, and then re-train this model on the best set of data using LearningWithNoisyLabels's cross validation? Will this potentially produce bad results?

Question About the paper found in arxiv

Dear :
I'm sorry but i am confuse When i read the paper <Confident Learning: Estimating Uncertainly in Dataset labels> THAT in section 3.1, Why Confuseion matrix C define to given yk and predictions argmax(expr). Is that just like a Normalized Pivot table?

Can the class number computed from psx not from s

Thanks for your contributions!
In the function cleanlab.latent_estimation.compute_confident_joint, if K is None, I wonder if I can replace this line with K = np.array(psx).shape[1] to avoid some bugs if one or more classes in the dataset is of zero sample.
And there may be other lines using this kind of code. Is the toolkit compatible with the situation above? Thanks!

What is the algorithm behind this repo?

Hi, I am very interested in this package. I have a small clean dataset, and a large noisy dataset. I think this package can help my model better learn on the whole dataset.
What is the algorithm behind all these? Could you post a link or a paper? Many thanks!

Can we detect an image that does not have a true class?

Suppose I'm classifying cats and dogs. But in the training data, there are sometimes images of a bird that are incorrectly labeled as cats or dogs. And this bird image cannot be labeled as dogs or cats. It's both wrong. The only course of action is to delete this image, you cannot fix the label.

Would cleanlab be able to detect bird images in a dataset of cats and dogs?

Validation dataset

Hello, if the confident learning just is applied in training dataset, the validation set keeps same to compare

Example Notebooks throws error

The notebooks 'iris_simple_example.ipynb' and 'classifier_comparison.ipynb' in the examples folder both throw an error when run.

The error is:

NameError: name 'psx' is not defined

It seems like the out of sample predicted probabilities are not being computed through CV?

I am using python version 3.7.3 on Windows 10 and sklearn version 0.21.2

Sparse matrix support

Hi!
I was wondering why the "utils.assert_inputs_are_valid" does not accept sparse matrices (such as scipy's csc/csr matrix) as input.
I understand that some models do not work well with sparse matrices and that could create errors while training, but I think it's reasonable to leave that for the user to handle .
What are your thoughts?

one mistake in algorithm in the paper

It is an excellent work.
In Algorithm 1 of Section C. The confident joint and joint algorithms of the paper.
The code in part 2:
_for i 1; m do_
may mistake
_for i 1; n do_

Batch based noisy labels estimation

Thanks for sharing such a good tool!

I have a quite large dataset (10M samples) with 100K classes so the probabilities matrix of the whole dataset will take around 4Tb and it won't fit into RAM of my machine. Does it make any sense to split the dataset into batches and run the algorithm batch by batch?

fns is not defined

how to use cleanlab with rank task

thx for great job. I want to use confidence learning to rank task, such as lambdamart. could you give my some suggestion? thx

Extension to object detection

If possible，any suggestions is welcomed

cross-validated train

@cgnorthcutt
Hello :
Thank you approach^^, but I have a question for you.
your approach step is 4 cross-validated , every train num_classes i s same , last to concat 4 softmax.npy,
but for any training set, if use all data to train ,get last softmax.npy is ok? or will have problem ? or what kind of data does your method work well or not？

Strange. Every item in label_errors is False.

thresholds is set either None or [0.9, ... 0.9]

_compute_confident_joint_multi_label() fails for multilabel case with varying label counts per data point

Problem:

_compute_confident_joint_multi_label() produces a wrong count of unique labels K in line 192 when the labels consist of lists with varying label counts per data point. The problem lies in the function np.unique() which only works if the lists of labels all have the same length.

Example:

s = [[1, 2], [3]]
np.unique(s)

>>> array([list([1, 2]), list([3])], dtype=object)

You can see, that np.unique() counts unique lists instead of the entries in the lists. The desired output would be:

>>> array([1, 2, 3])

The problem lies in the fact, that numpy fails at flattening the list of lists if the lists have varying length.

Proposed Solution:

Explicitely flatten s before passing it to np.unique():

s = [[1, 2], [2, 3]]
s_flat = [i for l in s for i in l]
np.unique(s_flat)

>>> array([1, 2, 3])

License file not included in source distribution.

Multi-label Classification, getting Y class error.

Good day,
I am trying to use cleanlab to train a multi-label classifier, where my target classes are encoded using scikit-learn MultiLabelBinarizerlink here.
I have a total of five classes and 20,000 images for training.
I build up a scikit-learn BaseEstimator and wrap a Keras Resnet50 model inside it.
Now when using LearningWithNoisyLabels from cleanlab.classification , I am getting following error:

  File "c:/Users/ASUS/Desktop/cleanlab mangoes/clean_training.py", line 16, in <module>
    lnl.fit(X, y)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\cleanlab\classification.py", line 267, in fit
    assert_inputs_are_valid(X, s, psx)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\cleanlab\util.py", line 41, in assert_inputs_are_valid
    ensure_2d=False,
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\sklearn\utils\validation.py", line 807, in check_X_y
    y = column_or_1d(y, warn=True)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\sklearn\utils\validation.py", line 847, in column_or_1d
    "got an array of shape {} instead.".format(shape))
ValueError: y should be a 1d array, got an array of shape (25767, 5) instead.

my code used for training is as follow:

from sk_resnet import ResnetTrainer
from cleanlab.classification import LogReg
from cleanlab.classification import LearningWithNoisyLabels
import pandas as pd 

df = pd.read_csv('train_labels.csv')
values = df.values
X = values[:,0]
y = values[:,1:]


lnl = LearningWithNoisyLabels(clf=ResnetTrainer(batch_size=8, epochs=10))
lnl.fit(X, y)

I hope authors of cleanlab can provide a simple example of how to use cleanlab for multi-label training.

raise error:ValueError: operands could not be broadcast together with shapes (20000,9140) (401,)

hello, I use the cleanlab to clean my dataset.
my dataset contain about 450000 samples of 9140 classes, when i just use the first 20000 sample to clean.I got a error:

(20000,)
(20000, 9140)
/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in true_divide
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "clean_base_on_clean_lab.py", line 24, in
sorted_index_method='normalized_margin', # Orders label errors
File "/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/cleanlab/pruning.py", line 342, in get_noise_indices
multi_label=multi_label,
File "/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/cleanlab/latent_estimation.py", line 337, in compute_confident_joint
psx_bool = (psx >= thresholds - 1e-6)
ValueError: operands could not be broadcast together with shapes (20000,9140) (401,)

my input of s is a numpy.ndarray with shape(20000)
and psx is a numpy.ndarry with shape(20000,9140)
I don't know why the error is can't broadcast together with shapes (20000,9140) (401,)?
where is (401,) come from?

ValueError: operands could not be broadcast together with shapes

Dataset - UCI skin segmentation https://www.openml.org/d/1502
uci_skin_segmentaion.csv.zip

Error -

File "/Users/naveen/anaconda3/lib/python3.6/site-packages/cleanlab/latent_estimation.py", line 687, in estimate_py_noise_matrices_and_cv_pred_proba
    seed = seed,
  File "/Users/naveen/anaconda3/lib/python3.6/site-packages/cleanlab/latent_estimation.py", line 593, in estimate_confident_joint_and_cv_pred_proba
    return_list_of_converging_cj_matrices = return_list_of_converging_cj_matrices,
  File "/Users/naveen/anaconda3/lib/python3.6/site-packages/cleanlab/latent_estimation.py", line 319, in estimate_confident_joint_from_probabilities
    confident_joint = calibrate_confident_joint(confident_joint, s, psx)
  File "/Users/naveen/anaconda3/lib/python3.6/site-packages/cleanlab/latent_estimation.py", line 111, in calibrate_confident_joint
    calibrated_cj = (confident_joint.T / confident_joint.sum(axis=1) * s_counts).T
ValueError: operands could not be broadcast together with shapes (2,2) (3,)

number of mislabeled samples

Hello,
thank you for your contributions! I have some problem,
how can I get more possible mislabeled samples with high probability？
Can I just enlarge the frac_noise to a larger number? The number of indices seems not linear with frac_noise.

Using keras model for label noise prediction

I have build the custom class as mentioned in readme

class CustomKerasModel(Sequential):

    def __init__(self,name=None, max_features=20000, batch_size=128, epochs=10, validation_split=0.1):
        super(CustomKerasModel, self).__init__(name=name)
        self.add(Embedding(max_features, 128))
        self.add(Bidirectional(LSTM(32, return_sequences = True)))
        self.add(GlobalMaxPool1D())
        self.add(Dense(20, activation="relu"))
        self.add(Dropout(0.05))
        self.add(Dense(num_classes, activation="softmax"))
        self.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
        self.batch_size = batch_size
        self.epochs = epochs
        self.validation_split = validation_split

    def fit(self, X, y):
        y_one_hot = to_categorical(y)
        return self.model.fit(X, y_one_hot, batch_size=self.batch_size, epochs=self.epochs, validation_split=self.validation_split)
    
    def score(self, X, y):
        return self.model.evaluate(X, y, batch_size=128)[1]

and then to predict the label I am using

from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba(
    X=X,
    s=y,
    clf = CustomKerasModel(max_features=max_features,batch_size=batch_size, epochs=epochs, validation_split=0.1)
)

But its not working I am getting following error

~/anaconda3/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
    238     memo[id(x)] = y
    239     for key, value in x.items():
--> 240         y[deepcopy(key, memo)] = deepcopy(value, memo)
    241     return y
    242 d[dict] = _deepcopy_dict

~/anaconda3/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    167                     reductor = getattr(x, "__reduce_ex__", None)
    168                     if reductor:
--> 169                         rv = reductor(4)
    170                     else:
    171                         reductor = getattr(x, "__reduce__", None)

TypeError: can't pickle _thread.RLock objects

Could anyone please provide the solution to this problem or right way to use
keras model for lable noise prediction ??

Regarding a tensorflow CNN example

Hi sir,
I have implemented a tensorflow-version CNN example using your rank pruning algorithm. Could you please check the e-mail I sent (to [email protected]; e-mail subject: Regarding the confident learning library) and reply my questions? I would be very grateful.
Hopefully my codes are a positive commit to this repository.

Multi-label classifiers and recommendation for prune method

Thanks for cleanlab!

Do you have examples of multi-label classification tasks? I'm curious what prune method to use in this case: prune by noise rate, prune by class, or both?

Model (classifier) Selection if true label is unknown

I really like that cleanlab can be used with any classifier and any dataset distribution. My question is: for real world scenario, the label noise is unknown that one cannot really compute the accuracy scores (as in your classifier comparison example), how would one choose which classifier to use? Due to the unknown error rate, it is hard to compare the model performance using the traditional binary classification metrics: F-1, AUC, etc.
Do you have any advice on that? Thank you very much!

How can cleanlab deal with too many labels?

HI, I'm using cleanlab to prune noise labels in my dataset. However, I find out that cleanlab was ineffective. There are 3900 classes in my dataset with almost 30million examples. The psx matrix is sparse. What can I do with this situation?

Available for Semantic Segmentation?

Hi, obviously it's a great python package, and I'm doing research about Pathological Image Segmentation with Noisy Labels. Can it's available for my recent research?

Errors on examples - iris example

Hi!
Really like the package just getting some errors on the example -

Hope you can assist.
Best,
Andrew

image segmentation with tensorflow?

Hi,

Thanks for this excellent work and cleanlab.

When I try this library in my image semantic segmentation task with tensorflow. There is a bug:

model = LearningWithNoisyLabels(clf=get_unet())

File "/usr/local/lib/python3.6/dist-packages/cleanlab/classification.py", line 178, in init
'The classifier (clf) must define a .predict_proba() method.')
ValueError: The classifier (clf) must define a .predict_proba() method.

Is there any advice?

thanks.

'psx' is not defined in get_noise_indices() - issue for WINDOWS python users

This is my code:

if __name__ == '__main__':
    .
    .
    .
    est_py, est_nm, est_inv, confident_joint, my_psx=estimate_py_noise_matrices_and_cv_pred_proba(
    X=X_train,
    s=train_labels_with_errors,
    clf = GaussianNB()
    )
    label_errors = get_noise_indices(train_labels_with_errors,my_psx,verbose=1)

I'm still getting this error, even if psx is declared as global in pruning.py

Traceback (most recent call last):
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\site-packages\cleanlab\pruning.py", line 109, in _prune_by_count
    noise_mask = np.zeros(len(psx), dtype=bool)
NameError: name 'psx' is not defined


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\Jacopo\.vscode\extensions\ms-python.python-2019.11.49689\pythonFiles\ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "c:\Users\Jacopo\.vscode\extensions\ms-python.python-2019.11.49689\pythonFiles\lib\python\old_ptvsd\ptvsd\__main__.py", line 432, in main
    run()
  File "c:\Users\Jacopo\.vscode\extensions\ms-python.python-2019.11.49689\pythonFiles\lib\python\old_ptvsd\ptvsd\__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\Users\Jacopo\Google Drive\TesiLulli\MachineLearning_Python\3classi\3 classi no BNP\CleanLab\dirty.py", line 28, in <module>
    label_errors = get_noise_indices(train_labels_with_errors,my_psx,verbose=1)
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\site-packages\cleanlab\pruning.py", line 336, in get_noise_indices
    noise_masks_per_class = p.map(_prune_by_count, range(K))
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
NameError: name 'psx' is not defined

Program runing time

Hello, I want to select samples with wrong labels, and I run the program with
ordered_label_errors = get_noise_indices( s=numpy_array_of_noisy_labels, psx=numpy_array_of_predicted_probabilities, sorted_index_method='normalized_margin', # Orders label errors )
the shape of psx is (390000, 70) and can you tell me how long can the program will run?
I already wait 2 hours and maybe I made some mistakes?

Return "margin" for false-classifications

It would be great if get_noise_indices() would not only return the ordered indices but also the ordering criteria (so eg. the margin). Maybe returned as a tuple or have another function that returns both in order not break the API.

I am looking to convey a "confidence" that a label is a wrong classification.