Giter Site home page Giter Site logo

aws / amazon-sagemaker-clarify Goto Github PK

View Code? Open in Web Editor NEW
66.0 20.0 38.0 635 KB

Fairness Aware Machine Learning. Bias detection and mitigation for datasets and models.

License: Apache License 2.0

Python 98.95% Shell 1.05%
machine-learning fairness fairness-ml fairness-ai

amazon-sagemaker-clarify's Introduction

Python package Pypi Python 3.8+

smclarify

Amazon Sagemaker Clarify

Bias detection and mitigation for datasets and models.

Installation

To install the package from PIP you can simply do:

pip install smclarify

You can see examples on running the Bias metrics on the notebooks in the examples folder.

Terminology

Facet

A facet is column or feature that will be used to measure bias against. A facet can have value(s) that designates that sample as "sensitive".

Label

The label is a column or feature which is the target for training a machine learning model. The label can have value(s) that designates that sample as having a "positive" outcome.

Bias measure

A bias measure is a function that returns a bias metric.

Bias metric

A bias metric is a numerical value indicating the level of bias detected as determined by a particular bias measure.

Bias report

A collection of bias metrics for a given dataset or a combination of a dataset and model.

Development

It's recommended that you setup a virtualenv.

virtualenv -p(which python3) venv
source venv/bin/activate.fish
pip install -e .[test]
cd src/
../devtool all

For running unit tests, do pytest --pspec. If you are using PyCharm, and cannot see the green run button next to the tests, open Preferences -> Tools -> Python Integrated tools, and set default test runner to pytest.

For Internal contributors, run ../devtool integ_tests after creating virtualenv with the above steps to run the integration tests.

amazon-sagemaker-clarify's People

Contributors

amazon-auto avatar dosatos avatar eytsai avatar jmikko avatar keerthanvasist avatar larroy avatar milah avatar oyangz avatar pinaraws avatar prkrishnan1 avatar satish615 avatar xgchena avatar xiaoyi-cheng avatar xinyu7030 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-sagemaker-clarify's Issues

DPL - proportions qa and qd definition interchanged.

The formula for the difference in proportions of labels is as follows:

    **DPL = (qa - qd)**

Where:

qa = na(1)/na is the proportion of facet a who have an observed label value of 1. For example, the proportion of a middle-aged demographic who get approved for loans. Here na(1) represents the number of members of facet a who get a positive outcome and na the is number of members of facet a.

qd = nd(1)/nd is the proportion of facet d who have an observed label value of 1. For example, the proportion of people outside the middle-aged demographic who get approved for loans. Here nd(1) represents the number of members of the facet d who get a positive outcome and nd the is number of members of the facet d.

In the below code,
na and na_pos needs to be for sensitive facet index and ~ operator should be used for nd and nd_pos instead.

Source code:
def DPL(feature: pd.Series, sensitive_facet_index: pd.Series, positive_label_index: pd.Series) -> float:
require(sensitive_facet_index.dtype == bool, "sensitive_facet_index must be of type bool")
require(positive_label_index.dtype == bool, "label_index must be of type bool")
na = len(feature[~sensitive_facet_index])
nd = len(feature[sensitive_facet_index])
na_pos = len(feature[~sensitive_facet_index & positive_label_index])
nd_pos = len(feature[sensitive_facet_index & positive_label_index])
if na == 0:
raise ValueError("Negative facet set is empty.")
if nd == 0:
raise ValueError("Facet set is empty.")
qa = na_pos / na
qd = nd_pos / nd
dpl = qa - qd
return dpl

How can i use it for forecasting model or the regression model

Hi Team,
Please let me know how can i use it for forecasting model

label_column = report.LabelColumn(name='{}'.format(output_name), data=output_col, positive_label_values=['{}'.format(output_name)])
in the above line what value can i pass in positive_label_values.
for classification problem i can pass like

label_column = report.LabelColumn(output_name, output_col, [1])
or
label_column = report.LabelColumn(output_name, output_col, ['yes'])

but for in my case temperature is the target column
Do please let me how config the label_column and also the below line.

bias_report = report.bias_report(train_df, facet_column, label_column, stage_type=report.StageType.PRE_TRAINING, group_variable= output_col)

Can't recreate smclarify example on AWS SageMaker. Bias report gives: CDDL metrics failed

When I try to run Bias_metrics_usage.ipynb from the examples in https://github.com/aws/amazon-sagemaker-clarify.git in a SageMaker studio notebook I get the following error from the following line (there was originally a Numpy error that I fixed by upgrading Pandas - other than that no other errors):

report.bias_report(df, facet_column, label_column, stage_type=report.StageType.PRE_TRAINING)

CDDL metrics failed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/report.py", line 292, in _categorical_metric_call_wrapper
    group_variable=group_variable,
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/__init__.py", line 26, in call_metric
    return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/pretraining.py", line 195, in CDDL
    return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/common.py", line 80, in CDD
    raise ValueError("Group variable is empty or not provided")
ValueError: Group variable is empty or not provided
CDDL metrics failed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/report.py", line 292, in _categorical_metric_call_wrapper
    group_variable=group_variable,
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/__init__.py", line 26, in call_metric
    return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/pretraining.py", line 195, in CDDL
    return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/common.py", line 80, in CDD
    raise ValueError("Group variable is empty or not provided")
ValueError: Group variable is empty or not provided
CDDL metrics failed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/report.py", line 292, in _categorical_metric_call_wrapper
    group_variable=group_variable,
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/__init__.py", line 26, in call_metric
    return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/pretraining.py", line 195, in CDDL
    return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/common.py", line 80, in CDD
    raise ValueError("Group variable is empty or not provided")
ValueError: Group variable is empty or not provided
CDDL metrics failed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/report.py", line 292, in _categorical_metric_call_wrapper
    group_variable=group_variable,
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/__init__.py", line 26, in call_metric
    return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/pretraining.py", line 195, in CDDL
    return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)
  File "/opt/conda/lib/python3.7/site-packages/smclarify/bias/metrics/common.py", line 80, in CDD
    raise ValueError("Group variable is empty or not provided")
ValueError: Group variable is empty or not provided

Is this a problem with the example? Could it be caused by the Pandas upgrade? Any help or advice would be much appreciated?

Clarify Job Failed due to Predicted Label Column series datatype is not the same as Label Column series

Hi,

I am using SageMakerClarifyProcessor to generate analysis reports for a XGBoost Binary Classification Model. However, the job failed due to Predicted Label Column series datatype is not the same as Label Column series.

The log output is:

1663079368252 | [ 2022-09-13T14:29:28.252Z ] 2022-09-13 14:29:23,993 Errors occurred when analyzing your data. Please check CloudWatch logs for more details.

1663079368252 | [ 2022-09-13T14:29:28.252Z ] 2022-09-13 14:29:23,993 exit_message: Customer Error: Predicted Label Column series datatype is not the same as Label Column series

1663079368252 | [ 2022-09-13T14:29:28.252Z ] 2022-09-13 14:29:23,992 Column None with data uniqueness fraction 0.00029424746211563924 is classifed as a CATEGORICAL column

1663079368252 | [ 2022-09-13T14:29:28.252Z ] 2022-09-13 14:29:23,951 Column STATUS with data uniqueness fraction 0.00029424746211563924 is classifed as a CONTINUOUS column

The data is something like:

0,0,3698911.49,89.12865,1,10,0,937296,1,0,0,1,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
0,0,1953000,89.30876,1,7.5,0,1116000,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0

The first column is the predict label, and the second column is the ground truth label.

Any suggestions that I can get around with this issue given that we do not have control on how sagemaker process the datatype?

Multi-categorical confusion matrix calculation for labels not presented in predicted_labels

Feedback from Bilal from a PR review: #136 (comment)

How are we supposed to handle cases when a predicted label (in this case 2) is not present in the observed labels (in this case [1])? Some options are:

We limit the confusion matrix CM to labels are present in both observed (label_series) and predicted labels (predicted_label_series). This is what sklearn does.
CM contains labels from the union of observed and predicted labels.
CM contains labels from observed labels only. If a predicted label is not found in observed labels, we raise an error saying something like "Unknown label 2".
I think we should pick option 3 since it assumes that observed labels provide us a complete list of all the possible labels. Option 1 could be problematic because it will drop some valid observed labels in case they are not found in predicted labels.

If we opt for 3, we should raise an error in this line.

Need to figure out if we want to handle this from analyzer side or library.

Carify job fails in spark mode

Thanks for this project. For my project, I'd need to configure some elements of the clarify processing and it would require respective Docker Files available for modification. More concretely, I am facing timeouts in the endpoint calls due to a very high max batch size/max payload and a slow model, but only when apache spark integration is used, i.e. instance_count > 1. In that case, the max payload is for some reason much higher than when spark integration is disabled, leading to longer response times for a batch. Choosing more or a bigger or more powerful instance in the endpoint does not solve the problem.

Can you open-source the Dockerfiles? This would be very beneficial.

In addition, sagemaker.clarify.SageMakerClarifyProcessor() should accept an optional image_uri argument so I can supply my custom image, but that I can also solve myself by forking the sagemaker sdk and create a PR

DRR Returns -Infinity for German Dataset

This is the output from the analysis.json:
{"name": "DRR", "description": "Difference in Rejection Rates (DRR)", "value": "-Infinity"}

DRR should only be in range [-1, 1].

Ran in Processing Job against German Dataset on us-east-2.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.