aws / amazon-sagemaker-clarify Goto Github PK

View Code? Open in Web Editor NEW

66.0 20.0 38.0 635 KB

Fairness Aware Machine Learning. Bias detection and mitigation for datasets and models.

License: Apache License 2.0

Python 98.95% Shell 1.05%

machine-learning fairness fairness-ml fairness-ai

amazon-sagemaker-clarify's Introduction

smclarify

Amazon Sagemaker Clarify

Bias detection and mitigation for datasets and models.

Installation

To install the package from PIP you can simply do:

pip install smclarify

You can see examples on running the Bias metrics on the notebooks in the examples folder.

Terminology

Facet

A facet is column or feature that will be used to measure bias against. A facet can have value(s) that designates that sample as "sensitive".

Label

The label is a column or feature which is the target for training a machine learning model. The label can have value(s) that designates that sample as having a "positive" outcome.

Bias measure

A bias measure is a function that returns a bias metric.

Bias metric

A bias metric is a numerical value indicating the level of bias detected as determined by a particular bias measure.

Bias report

A collection of bias metrics for a given dataset or a combination of a dataset and model.

Development

It's recommended that you setup a virtualenv.

virtualenv -p(which python3) venv
source venv/bin/activate.fish
pip install -e .[test]
cd src/
../devtool all

For running unit tests, do pytest --pspec. If you are using PyCharm, and cannot see the green run button next to the tests, open Preferences -> Tools -> Python Integrated tools, and set default test runner to pytest.

For Internal contributors, run ../devtool integ_tests after creating virtualenv with the above steps to run the integration tests.

amazon-sagemaker-clarify's People

Contributors

Stargazers

Watchers

amazon-sagemaker-clarify's Issues

DPL - proportions qa and qd definition interchanged.

The formula for the difference in proportions of labels is as follows:

    **DPL = (qa - qd)**

Where:

qa = na(1)/na is the proportion of facet a who have an observed label value of 1. For example, the proportion of a middle-aged demographic who get approved for loans. Here na(1) represents the number of members of facet a who get a positive outcome and na the is number of members of facet a.

qd = nd(1)/nd is the proportion of facet d who have an observed label value of 1. For example, the proportion of people outside the middle-aged demographic who get approved for loans. Here nd(1) represents the number of members of the facet d who get a positive outcome and nd the is number of members of the facet d.

In the below code,
na and na_pos needs to be for sensitive facet index and ~ operator should be used for nd and nd_pos instead.

Source code:
def DPL(feature: pd.Series, sensitive_facet_index: pd.Series, positive_label_index: pd.Series) -> float:
require(sensitive_facet_index.dtype == bool, "sensitive_facet_index must be of type bool")
require(positive_label_index.dtype == bool, "label_index must be of type bool")
na = len(feature[~sensitive_facet_index])
nd = len(feature[sensitive_facet_index])
na_pos = len(feature[~sensitive_facet_index & positive_label_index])
nd_pos = len(feature[sensitive_facet_index & positive_label_index])
if na == 0:
raise ValueError("Negative facet set is empty.")
if nd == 0:
raise ValueError("Facet set is empty.")
qa = na_pos / na
qd = nd_pos / nd
dpl = qa - qd
return dpl

How can i use it for forecasting model or the regression model

Hi Team,
Please let me know how can i use it for forecasting model

label_column = report.LabelColumn(name='{}'.format(output_name), data=output_col, positive_label_values=['{}'.format(output_name)])
in the above line what value can i pass in positive_label_values.
for classification problem i can pass like

label_column = report.LabelColumn(output_name, output_col, [1])
or
label_column = report.LabelColumn(output_name, output_col, ['yes'])

but for in my case temperature is the target column
Do please let me how config the label_column and also the below line.

bias_report = report.bias_report(train_df, facet_column, label_column, stage_type=report.StageType.PRE_TRAINING, group_variable= output_col)

Update function signatures of all metrics to be called from metric_one_vs_all

The metrics in posttraining have inconsistent signatures, and cannot be called while creating report. This needs to be fixed. And some tests have to be added.

Can't recreate smclarify example on AWS SageMaker. Bias report gives: CDDL metrics failed

When I try to run Bias_metrics_usage.ipynb from the examples in https://github.com/aws/amazon-sagemaker-clarify.git in a SageMaker studio notebook I get the following error from the following line (there was originally a Numpy error that I fixed by upgrading Pandas - other than that no other errors):

report.bias_report(df, facet_column, label_column, stage_type=report.StageType.PRE_TRAINING)

CDDL metrics failed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/report.py", line 292, in _categorical_metric_call_wrapper
    group_variable=group_variable,
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/__init__.py", line 26, in call_metric
    return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/pretraining.py", line 195, in CDDL
    return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/common.py", line 80, in CDD
    raise ValueError("Group variable is empty or not provided")
ValueError: Group variable is empty or not provided
CDDL metrics failed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/report.py", line 292, in _categorical_metric_call_wrapper
    group_variable=group_variable,
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/__init__.py", line 26, in call_metric
    return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/pretraining.py", line 195, in CDDL
    return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/common.py", line 80, in CDD
    raise ValueError("Group variable is empty or not provided")
ValueError: Group variable is empty or not provided
CDDL metrics failed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/report.py", line 292, in _categorical_metric_call_wrapper
    group_variable=group_variable,
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/__init__.py", line 26, in call_metric
    return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/pretraining.py", line 195, in CDDL
    return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/common.py", line 80, in CDD
    raise ValueError("Group variable is empty or not provided")
ValueError: Group variable is empty or not provided
CDDL metrics failed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/report.py", line 292, in _categorical_metric_call_wrapper
    group_variable=group_variable,
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/__init__.py", line 26, in call_metric
    return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/pretraining.py", line 195, in CDDL
    return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/common.py", line 80, in CDD
    raise ValueError("Group variable is empty or not provided")
ValueError: Group variable is empty or not provided

Is this a problem with the example? Could it be caused by the Pandas upgrade? Any help or advice would be much appreciated?

Add CDD and DPL to both pre and post-training

Remote duplicate implementation of DPL and DPPL.
Call CDD as CDDL post-training.
Call DPL as DPPL post-training.

Bias mitigation and fairness boosting in multi class classification use case

Is Amazon-SageMaker-clarify support mitigating bias and improving fairness in multi-class classification problems?
If so please provide an example of doing the same in a sample multi-class classification dataset.

Report only calls pretraining metrics

Reports only call pretraining metrics in bias_report function. We need to make it call posttraining metrics as well when necessary.

Clarify Job Failed due to Predicted Label Column series datatype is not the same as Label Column series

Hi,

I am using SageMakerClarifyProcessor to generate analysis reports for a XGBoost Binary Classification Model. However, the job failed due to Predicted Label Column series datatype is not the same as Label Column series.

The log output is:

1663079368252 | [ 2022-09-13T14:29:28.252Z ] 2022-09-13 14:29:23,993 Errors occurred when analyzing your data. Please check CloudWatch logs for more details.

1663079368252 | [ 2022-09-13T14:29:28.252Z ] 2022-09-13 14:29:23,993 exit_message: Customer Error: Predicted Label Column series datatype is not the same as Label Column series

1663079368252 | [ 2022-09-13T14:29:28.252Z ] 2022-09-13 14:29:23,992 Column None with data uniqueness fraction 0.00029424746211563924 is classifed as a CATEGORICAL column

1663079368252 | [ 2022-09-13T14:29:28.252Z ] 2022-09-13 14:29:23,951 Column STATUS with data uniqueness fraction 0.00029424746211563924 is classifed as a CONTINUOUS column

The data is something like:

0,0,3698911.49,89.12865,1,10,0,937296,1,0,0,1,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
0,0,1953000,89.30876,1,7.5,0,1116000,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0

The first column is the predict label, and the second column is the ground truth label.

Any suggestions that I can get around with this issue given that we do not have control on how sagemaker process the datatype?

Multi-categorical confusion matrix calculation for labels not presented in predicted_labels

Feedback from Bilal from a PR review: #136 (comment)

How are we supposed to handle cases when a predicted label (in this case 2) is not present in the observed labels (in this case [1])? Some options are:

We limit the confusion matrix CM to labels are present in both observed (label_series) and predicted labels (predicted_label_series). This is what sklearn does.
CM contains labels from the union of observed and predicted labels.
CM contains labels from observed labels only. If a predicted label is not found in observed labels, we raise an error saying something like "Unknown label 2".
I think we should pick option 3 since it assumes that observed labels provide us a complete list of all the possible labels. Option 1 could be problematic because it will drop some valid observed labels in case they are not found in predicted labels.

If we opt for 3, we should raise an error in this line.

Need to figure out if we want to handle this from analyzer side or library.

Investigate implementation of DCO post-training metric

It looks like the metric is wrongly implemented. We need to dive deeper to understand it and fix it.

Remove old metrics and replace with new ones

In the latest set of metrics, DCA/DCR are merged as DCO, and DAR( aka PD)/DRR are merged as DLR.
We need to update the metrics to reflect this.

Carify job fails in spark mode

Thanks for this project. For my project, I'd need to configure some elements of the clarify processing and it would require respective Docker Files available for modification. More concretely, I am facing timeouts in the endpoint calls due to a very high max batch size/max payload and a slow model, but only when apache spark integration is used, i.e. instance_count > 1. In that case, the max payload is for some reason much higher than when spark integration is disabled, leading to longer response times for a batch. Choosing more or a bigger or more powerful instance in the endpoint does not solve the problem.

Can you open-source the Dockerfiles? This would be very beneficial.

In addition, sagemaker.clarify.SageMakerClarifyProcessor() should accept an optional image_uri argument so I can supply my custom image, but that I can also solve myself by forking the sagemaker sdk and create a PR

DRR Returns -Infinity for German Dataset

This is the output from the analysis.json:
{"name": "DRR", "description": "Difference in Rejection Rates (DRR)", "value": "-Infinity"}

DRR should only be in range [-1, 1].

Ran in Processing Job against German Dataset on us-east-2.