microsoft / responsible-ai-toolbox-mitigations Goto Github PK

View Code? Open in Web Editor NEW

51.0 9.0 4.0 170.61 MB

Python library for implementing Responsible AI mitigations.

Home Page: https://responsible-ai-toolbox-mitigations.readthedocs.io/en/latest/

License: MIT License

Python 7.65% Jupyter Notebook 92.35%

data-analysis data-science machine-learning python responsible-ai responsible-ml

responsible-ai-toolbox-mitigations's Issues

Spelling/grammar/formatting issues

I think this is missing a blank line before the bulleted list. The bulleted list is not displaying properly in the docs.

responsible-ai-toolbox-mitigations/raimitigations/databalanceanalysis/aggregate_measures.py

Line 36 in 83c26a3

* Atkinson Index - https://en.wikipedia.org/wiki/Atkinson_index

Configure pre-commit

This is not an issue, but it's something we should discuss, since there are some benefits to using pre-commit hooks in a repo. I believe it makes sense, since it will force all future pushes (either internal pushes or pushes from the community) to be consistent in some aspect. For example, I like to use the autopep8 pre-commit hook, since it enforces that the code is according to the PEP8 standards. And in some cases, it even changes the code automatically when you do the commit.

Missing docstrings

https://sturdy-barnacle-3b9f911d.pages.github.io/databalanceanalysis/databalanceanalysis.html#databalanceanalysis.aggregate_measures.AggregateBalanceMeasure

FeatureBalanceMeasure, DistributionBalanceMeasure, and AggregateBalanceMeasure classes should have docstrings, at the very least to explain the params required for init

Case3.ipynb: Invalid column name `sick-euthyroid`

When running the CTGAN section of case3.ipynb in 5 - Synthetic Data, I receive the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case3.ipynb Cell 45 in <cell line: 11>()
      [8](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3.ipynb#ch0000041?line=7) synth.fit()
     [10](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3.ipynb#ch0000041?line=9) conditions = {label_col:1}	# create more of the undersampled class
---> [11](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3.ipynb#ch0000041?line=10) syn_train_x, syn_train_y = synth.transform(X=train_x_sel, y=train_y, n_samples=200, conditions=conditions)
     [13](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3.ipynb#ch0000041?line=12) syn_train_y.value_counts()

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\sampler\synthesizer.py:570, in Synthesizer.transform(self, df, X, y, n_samples, conditions, strategy)
    568 if n_samples is not None:
    569     print(df.columns, conditions)
--> 570     samples = self.model.sample(n_samples, conditions=conditions)
    571 else:
    572     samples = self._generate_samples_strategy(df, strategy)

File c:\Users\morrissharp\Miniconda3\envs\rai\lib\site-packages\sdv\tabular\base.py:451, in BaseTabularModel.sample(self, num_rows, max_retries, max_rows_multiplier, conditions, float_rtol, graceful_reject_sampling)
    449 for column in conditions.columns:
    450     if column not in self._metadata.get_fields():
--> 451         raise ValueError(f'Invalid column name `{column}`')
    453 try:
    454     transformed_conditions = self._metadata.transform(conditions, on_missing_column='drop')

ValueError: Invalid column name `sick-euthyroid`

I am not sure what is going on. sick-euthyroid appears to be the name of the pandas Series that is passed in (train_y)

Update/fix/reorganize the documentation files for the databalanceanalysis package

I noted the following problems:

some of the classes don't have a documentation for the constructor class
some bullet points aren't formatted properly
the entire documentation of this package is contained in a single page. Maybe it would make sense to spread it out into multiple pages and create some sort of hierarchy among them

I believe it would be a good idea to just go over the docs for this package and make sure that everything is in order.

Code coverage in codecov website

Once this repo becomes public, the code coverage could be added to a codecov account, currently the codecov is just printed out in the workflow.

feat_sel_sequential.ipynb example throwing key error in section 2 (no column names)

I am receiving the following error when running SeqFeatSelection on the example with no column names.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\module_tests\feat_sel_sequential.ipynb Cell 31 in <cell line: 2>()
      [1](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/module_tests/feat_sel_sequential.ipynb#ch0000030?line=0) feat_sel = SeqFeatSelection(n_jobs=1)
----> [2](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/module_tests/feat_sel_sequential.ipynb#ch0000030?line=1) feat_sel.fit(df=dataset, label_col=11)
      [3](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/module_tests/feat_sel_sequential.ipynb#ch0000030?line=2) feat_sel.get_selected_features()

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\feat_selection\selector.py:174, in FeatureSelection.fit(self, X, y, df, label_col)
    172 if self.in_place:
    173     self.df_org = self.df
--> 174 self._fit()
    175 self.set_selected_features()
    176 self.fitted = True

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\feat_selection\sequential_select.py:381, in SeqFeatSelection._fit(self)
    379 self._check_n_feat()
    380 self._check_fixed_columns()
--> 381 self._run_feat_selection()
    382 self._save_json()

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\feat_selection\sequential_select.py:345, in SeqFeatSelection._run_feat_selection(self)
    333     verbose = 2
    334 self.selector = SFS(
    335     self.estimator,
    336     k_features=self.n_feat,
...
--> 568 k_idx = self.subsets_[best_subset]['feature_idx']
    570 if self.k_features == 'parsimonious':
    571     for k in self.subsets_:

KeyError: None```

Case3_stat.ipynb: ValueError. mismatched shapes

In Case3_stat.ipynb case study notebook, in cell `Artificial Instances - CTGAN, I receive the following error:

ValueError                                Traceback (most recent call last)
c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case3_stat.ipynb Cell 20 in <cell line: 8>()
      [5](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=4) result_tr = test_corr_transf(df, label_col, N_EXEC, dp.DataStandardScaler, MODEL_NAME, num_col)
      [6](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=5) result_df = add_results_df(result_df, result_tr, "Std.")
----> [8](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=7) restult_fs = test_ctgan_first(df, label_col, N_EXEC, MODEL_NAME, rcorr=True, scaler_ref=dp.DataStandardScaler, feat_sel_type=None, art_str=0.2, savefile="3_1.pkl")
      [9](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=8) result_df = add_results_df(result_df, restult_fs, "CTGAN 0.2 Std.")
     [11](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=10) restult_fs = test_ctgan_first(df, label_col, N_EXEC, MODEL_NAME, rcorr=True, scaler_ref=dp.DataStandardScaler, feat_sel_type=None, art_str=0.6, savefile="3_2.pkl")

c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case3_stat.ipynb Cell 20 in test_ctgan_first(df, label_col, n_exec, model_name, rcorr, scaler_ref, num_col, feat_sel_type, art_str, savefile)
    [260](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=259) if art_str is not None:
    [261](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=260) 	train_x, train_y = artificial_ctgan(train_x, train_y, art_str, savefile)
--> [262](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=261) train_x, test_x = encode_case3_train_test(train_x, test_x)
    [263](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=262) train_x, test_x = impute_case3_train_test(train_x, test_x)
    [264](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=263) if feat_sel_type is not None:

c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case3_stat.ipynb Cell 20 in encode_case3_train_test(train_x, test_x)
     [33](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=32) enc_ohe = dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)
     [34](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=33) enc_ohe.fit(train_x)
---> [35](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=34) train_x_enc = enc_ohe.transform(train_x)
     [36](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=35) test_x_enc = enc_ohe.transform(test_x)
     [37](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=36) return train_x_enc, test_x_enc

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\encoder\encoder.py:108, in DataEncoding.transform(self, df)
    106 self._check_if_fitted()
...
    391 passed = values.shape
    392 implied = (len(index), len(columns))
--> 393 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (2582, 30), indices imply (2582, 29)

I am not sure if this is related to issue #31. But potentially there is an issue with the label_col not being included in the df when necessary.

seaborn package required for case2_stat but not in requirements.txt

Seaborn is used in one of the plotting functions in case2_stat.ipynb but is not in requirements.txt. Not sure if this is really needed or if it is fine to switch it to matplotlib.

No documentation for the raimitigations.utils.get_metrics() function

The raimitigations.utils.get_metrics() is currently in the metric_utils.py file, which is not accounted in the documentation. I suggest adding this .py in the docs.

Rebalance class status progress message

The Rebalance class status message contains a string related to imputation instead:

No columns specified for imputation. These columns have been automatically identified:
[]
Running oversampling...

ValueError: case1_stat.ipynb error in CTGAN Section

I am receiving the following error when running the CTGAN section of of case1_stat.ipynb

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case1_stat.ipynb Cell 14 in <cell line: 5>()
      [2](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=1) result_base = test_base(df, label_col, N_EXEC, MODEL_NAME)
      [3](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=2) result_df = add_results_df(None, result_base, "Baseline")
----> [5](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=4) restult_fs = test_ctgan_first(df, label_col, N_EXEC, MODEL_NAME, rcorr=False, feat_sel_type=None, art_str=0.6, savefile="1_1.pkl")
      [6](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=5) result_df = add_results_df(result_df, restult_fs, "CTGAN 0.6")
      [8](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=7) restult_fs = test_ctgan_first(df, label_col, N_EXEC, MODEL_NAME, rcorr=False, feat_sel_type=None, art_str=0.9, savefile="1_2.pkl")

c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case1_stat.ipynb Cell 14 in test_ctgan_first(df, label_col, n_exec, model_name, rcorr, feat_sel_type, art_str, savefile)
    [245](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=244) if art_str is not None:
    [246](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=245) 	train_x, train_y = artificial_ctgan(train_x, train_y, art_str, savefile)
--> [247](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=246) train_x, test_x = encode_case1_train_test(train_x, test_x)
    [248](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=247) train_x, test_x = impute_case1_train_test(train_x, test_x)
    [249](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=248) if feat_sel_type is not None:

c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case1_stat.ipynb Cell 14 in encode_case1_train_test(train_x, test_x)
     [55](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=54) def encode_case1_train_test(train_x, test_x):
     [56](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=55) 	enc_ord, enc_ohe = get_encoders(df)
---> [57](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=56) 	enc_ord.fit(train_x)
     [58](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=57) 	train_x_enc = enc_ord.transform(train_x)
     [59](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=58) 	test_x_enc = enc_ord.transform(test_x)

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\encoder\encoder.py:84, in DataEncoding.fit(self, df, y)
     82 self._set_column_to_encode()
...
    120         + "the order of the existing values of the column col_encode[i]. If a value is not given, "
    121         + "it will be assigned a None value."
    122     )

ValueError: ERROR: the value '24-26' provided to the the list of values for the key 'inv-nodes' in the 'categories' parameter does not match any of the unique values found in the column 'inv-nodes' of the dataset provided.

Not sure exactly what the cause is for this yet.

Additionally, while investigating this, I noticed a couple of other issues as well:

Ordinal Encoding is performed on age, tumor size, and inv-nodes. But the lexicographic sorting is being done since they are strings, and so there is not numeric sorting.

age_order, ['20-29' '30-39' '40-49' '50-59' '60-69' '70-79']
tumor_size_order, ['0-4' '10-14' '15-19' '20-24' '25-29' '30-34' '35-39' '40-44' '45-49'
 '5-9' '50-54']
inv_nodes_order, ['0-2' '12-14' '15-17' '24-26' '3-5' '6-8' '9-11']

get_encoders is called on df, but encode_case1_train_test() does not take in df as a parameter

	enc_ord, enc_ohe = get_encoders(df)```

Correlated Features examples

I noticed a couple of issues here.

There are two different examples feat_sel_corr_tutorial.ipynb and feat_sel_corr.ipynb. feat_sel_corr.ipynb does not contain any explanatory comments in the notebook. Maybe this one is extra and not needed?
Both of these notebooks write the same json files to ./corr_json_examples/. The files that are part of the git repo are the ones belonging to feat_sel_corr.ipynb .

If the 2nd notebook is not needed, then the json files should be replaced with the ones belonging to the 1st example.

No installation instructions

I have noticed that there are no installation instructions in the main README.md as well as the documentation.

TODO: fill out support.md

Support.md needs to be filled out as per these instructions.

responsible-ai-toolbox-mitigations/SUPPORT.md

Lines 1 to 7 in 0d69bb6

    
           # TODO: The maintainer of this repo has not yet edited this file 
        
           **REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project? 
        
           - **No CSS support:** Fill out this template with information about how to file issues and get help. 
        
           - **Yes CSS support:** Fill out an intake form at [aka.ms/spot](https://aka.ms/spot). CSS will work with/help you to determine next steps. More details also available at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). 
        
           - **Not sure?** Fill out a SPOT intake as though the answer were "Yes". CSS will help you decide.

Balancing on multiple columns work around

Currently there is a work around in the end to end Jupyter notebook so that we can balance on two columns at the same time using any of the rebalancing techniques. While making API deseign changes to the Rebalance API, we should allow the user to specific multiple columns that they are interested in balancing on rather than creating a single column and balancing on that single column cohort

Seed cannot be set

responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case2.ipynb

Line 611 in 0d69bb6

    
           "As we can see, this transformation had some impact in the results (depends on the seed used) when we use KNN. Let's check how this data transformation impacts the XGBoost model:"

The case2.ipynb notebook references the ability to set a seed. But, this is not available for either split_data() , train_model_plot_results() or train_model_fetch_results(). Additionally, I have noticed that that there is no possibility to pass any parameters to the model itself for instantiation/fitting (e.g. setting the number of neighbors for KNN).

I am not sure whether you expect these functions to be used outside of the example notebooks. But if yes, you should consider allowing the user to set a random seed, as well as pass in model parameters, possibly using something like *args **kwargs.

Missing example dataset

The HR promotions dataset listed in Simple Example is not in the dataset directory.

responsible-ai-toolbox-mitigations/notebooks/dataprocessing/module_tests/model_test.ipynb

Lines 306 to 307 in 83c26a3

    
           "data_dir = '../../../datasets/hr_promotion'\n", 
        
           "dataset =  pd.read_csv(data_dir + '/train.csv')\n",

> from .utils import fetch_cohort_results, plot_value_counts_cohort
> import seaborn as sns
No module named 'seaborn'

The error doesn't occur if I install the library using pip install raimitigations[all].

To fix, add seaborn to the base set of dependencies, not only to the [all] dependency group.

	# TODO: The maintainer of this repo has not yet edited this file

	REPO OWNER: Do you want Customer Service & Support (CSS) support for this product/project?

	- No CSS support: Fill out this template with information about how to file issues and get help.
	- Yes CSS support: Fill out an intake form at [aka.ms/spot](https://aka.ms/spot). CSS will work with/help you to determine next steps. More details also available at [aka.ms/onboardsupport](https://aka.ms/onboardsupport).
	- Not sure? Fill out a SPOT intake as though the answer were "Yes". CSS will help you decide.

	"data_dir = '../../../datasets/hr_promotion'\n",
	"dataset = pd.read_csv(data_dir + '/train.csv')\n",

microsoft / responsible-ai-toolbox-mitigations Goto Github PK

responsible-ai-toolbox-mitigations's Issues

Recommend Projects

Recommend Topics

Recommend Org