Giter Site home page Giter Site logo

microsoft / responsible-ai-toolbox-mitigations Goto Github PK

View Code? Open in Web Editor NEW
51.0 9.0 4.0 170.61 MB

Python library for implementing Responsible AI mitigations.

Home Page: https://responsible-ai-toolbox-mitigations.readthedocs.io/en/latest/

License: MIT License

Python 7.65% Jupyter Notebook 92.35%
data-analysis data-science machine-learning python responsible-ai responsible-ml

responsible-ai-toolbox-mitigations's Issues

Configure pre-commit

This is not an issue, but it's something we should discuss, since there are some benefits to using pre-commit hooks in a repo. I believe it makes sense, since it will force all future pushes (either internal pushes or pushes from the community) to be consistent in some aspect. For example, I like to use the autopep8 pre-commit hook, since it enforces that the code is according to the PEP8 standards. And in some cases, it even changes the code automatically when you do the commit.

Case3.ipynb: Invalid column name `sick-euthyroid`

When running the CTGAN section of case3.ipynb in 5 - Synthetic Data, I receive the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case3.ipynb Cell 45 in <cell line: 11>()
      [8](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3.ipynb#ch0000041?line=7) synth.fit()
     [10](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3.ipynb#ch0000041?line=9) conditions = {label_col:1}	# create more of the undersampled class
---> [11](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3.ipynb#ch0000041?line=10) syn_train_x, syn_train_y = synth.transform(X=train_x_sel, y=train_y, n_samples=200, conditions=conditions)
     [13](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3.ipynb#ch0000041?line=12) syn_train_y.value_counts()

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\sampler\synthesizer.py:570, in Synthesizer.transform(self, df, X, y, n_samples, conditions, strategy)
    568 if n_samples is not None:
    569     print(df.columns, conditions)
--> 570     samples = self.model.sample(n_samples, conditions=conditions)
    571 else:
    572     samples = self._generate_samples_strategy(df, strategy)

File c:\Users\morrissharp\Miniconda3\envs\rai\lib\site-packages\sdv\tabular\base.py:451, in BaseTabularModel.sample(self, num_rows, max_retries, max_rows_multiplier, conditions, float_rtol, graceful_reject_sampling)
    449 for column in conditions.columns:
    450     if column not in self._metadata.get_fields():
--> 451         raise ValueError(f'Invalid column name `{column}`')
    453 try:
    454     transformed_conditions = self._metadata.transform(conditions, on_missing_column='drop')

ValueError: Invalid column name `sick-euthyroid`

I am not sure what is going on. sick-euthyroid appears to be the name of the pandas Series that is passed in (train_y)

Update/fix/reorganize the documentation files for the databalanceanalysis package

I noted the following problems:

  • some of the classes don't have a documentation for the constructor class
  • some bullet points aren't formatted properly
  • the entire documentation of this package is contained in a single page. Maybe it would make sense to spread it out into multiple pages and create some sort of hierarchy among them

I believe it would be a good idea to just go over the docs for this package and make sure that everything is in order.

Code coverage in codecov website

Once this repo becomes public, the code coverage could be added to a codecov account, currently the codecov is just printed out in the workflow.

feat_sel_sequential.ipynb example throwing key error in section 2 (no column names)

I am receiving the following error when running SeqFeatSelection on the example with no column names.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\module_tests\feat_sel_sequential.ipynb Cell 31 in <cell line: 2>()
      [1](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/module_tests/feat_sel_sequential.ipynb#ch0000030?line=0) feat_sel = SeqFeatSelection(n_jobs=1)
----> [2](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/module_tests/feat_sel_sequential.ipynb#ch0000030?line=1) feat_sel.fit(df=dataset, label_col=11)
      [3](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/module_tests/feat_sel_sequential.ipynb#ch0000030?line=2) feat_sel.get_selected_features()

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\feat_selection\selector.py:174, in FeatureSelection.fit(self, X, y, df, label_col)
    172 if self.in_place:
    173     self.df_org = self.df
--> 174 self._fit()
    175 self.set_selected_features()
    176 self.fitted = True

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\feat_selection\sequential_select.py:381, in SeqFeatSelection._fit(self)
    379 self._check_n_feat()
    380 self._check_fixed_columns()
--> 381 self._run_feat_selection()
    382 self._save_json()

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\feat_selection\sequential_select.py:345, in SeqFeatSelection._run_feat_selection(self)
    333     verbose = 2
    334 self.selector = SFS(
    335     self.estimator,
    336     k_features=self.n_feat,
...
--> 568 k_idx = self.subsets_[best_subset]['feature_idx']
    570 if self.k_features == 'parsimonious':
    571     for k in self.subsets_:

KeyError: None```

Case3_stat.ipynb: ValueError. mismatched shapes

In Case3_stat.ipynb case study notebook, in cell `Artificial Instances - CTGAN, I receive the following error:

ValueError                                Traceback (most recent call last)
c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case3_stat.ipynb Cell 20 in <cell line: 8>()
      [5](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=4) result_tr = test_corr_transf(df, label_col, N_EXEC, dp.DataStandardScaler, MODEL_NAME, num_col)
      [6](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=5) result_df = add_results_df(result_df, result_tr, "Std.")
----> [8](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=7) restult_fs = test_ctgan_first(df, label_col, N_EXEC, MODEL_NAME, rcorr=True, scaler_ref=dp.DataStandardScaler, feat_sel_type=None, art_str=0.2, savefile="3_1.pkl")
      [9](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=8) result_df = add_results_df(result_df, restult_fs, "CTGAN 0.2 Std.")
     [11](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=10) restult_fs = test_ctgan_first(df, label_col, N_EXEC, MODEL_NAME, rcorr=True, scaler_ref=dp.DataStandardScaler, feat_sel_type=None, art_str=0.6, savefile="3_2.pkl")

c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case3_stat.ipynb Cell 20 in test_ctgan_first(df, label_col, n_exec, model_name, rcorr, scaler_ref, num_col, feat_sel_type, art_str, savefile)
    [260](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=259) if art_str is not None:
    [261](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=260) 	train_x, train_y = artificial_ctgan(train_x, train_y, art_str, savefile)
--> [262](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=261) train_x, test_x = encode_case3_train_test(train_x, test_x)
    [263](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=262) train_x, test_x = impute_case3_train_test(train_x, test_x)
    [264](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=263) if feat_sel_type is not None:

c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case3_stat.ipynb Cell 20 in encode_case3_train_test(train_x, test_x)
     [33](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=32) enc_ohe = dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)
     [34](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=33) enc_ohe.fit(train_x)
---> [35](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=34) train_x_enc = enc_ohe.transform(train_x)
     [36](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=35) test_x_enc = enc_ohe.transform(test_x)
     [37](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case3_stat.ipynb#ch0000019?line=36) return train_x_enc, test_x_enc

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\encoder\encoder.py:108, in DataEncoding.transform(self, df)
    106 self._check_if_fitted()
...
    391 passed = values.shape
    392 implied = (len(index), len(columns))
--> 393 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (2582, 30), indices imply (2582, 29)

I am not sure if this is related to issue #31. But potentially there is an issue with the label_col not being included in the df when necessary.

Rebalance class status progress message

The Rebalance class status message contains a string related to imputation instead:

No columns specified for imputation. These columns have been automatically identified:
[]
Running oversampling...

ValueError: case1_stat.ipynb error in CTGAN Section

I am receiving the following error when running the CTGAN section of of case1_stat.ipynb

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case1_stat.ipynb Cell 14 in <cell line: 5>()
      [2](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=1) result_base = test_base(df, label_col, N_EXEC, MODEL_NAME)
      [3](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=2) result_df = add_results_df(None, result_base, "Baseline")
----> [5](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=4) restult_fs = test_ctgan_first(df, label_col, N_EXEC, MODEL_NAME, rcorr=False, feat_sel_type=None, art_str=0.6, savefile="1_1.pkl")
      [6](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=5) result_df = add_results_df(result_df, restult_fs, "CTGAN 0.6")
      [8](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=7) restult_fs = test_ctgan_first(df, label_col, N_EXEC, MODEL_NAME, rcorr=False, feat_sel_type=None, art_str=0.9, savefile="1_2.pkl")

c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case1_stat.ipynb Cell 14 in test_ctgan_first(df, label_col, n_exec, model_name, rcorr, feat_sel_type, art_str, savefile)
    [245](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=244) if art_str is not None:
    [246](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=245) 	train_x, train_y = artificial_ctgan(train_x, train_y, art_str, savefile)
--> [247](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=246) train_x, test_x = encode_case1_train_test(train_x, test_x)
    [248](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=247) train_x, test_x = impute_case1_train_test(train_x, test_x)
    [249](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=248) if feat_sel_type is not None:

c:\Users\morrissharp\Repos\responsible-ai-toolbox-mitigations\notebooks\dataprocessing\case_study\case1_stat.ipynb Cell 14 in encode_case1_train_test(train_x, test_x)
     [55](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=54) def encode_case1_train_test(train_x, test_x):
     [56](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=55) 	enc_ord, enc_ohe = get_encoders(df)
---> [57](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=56) 	enc_ord.fit(train_x)
     [58](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=57) 	train_x_enc = enc_ord.transform(train_x)
     [59](vscode-notebook-cell:/c%3A/Users/morrissharp/Repos/responsible-ai-toolbox-mitigations/notebooks/dataprocessing/case_study/case1_stat.ipynb#ch0000013?line=58) 	test_x_enc = enc_ord.transform(test_x)

File c:\users\morrissharp\repos\responsible-ai-toolbox-mitigations\raimitigations\dataprocessing\encoder\encoder.py:84, in DataEncoding.fit(self, df, y)
     82 self._set_column_to_encode()
...
    120         + "the order of the existing values of the column col_encode[i]. If a value is not given, "
    121         + "it will be assigned a None value."
    122     )

ValueError: ERROR: the value '24-26' provided to the the list of values for the key 'inv-nodes' in the 'categories' parameter does not match any of the unique values found in the column 'inv-nodes' of the dataset provided.

Not sure exactly what the cause is for this yet.

Additionally, while investigating this, I noticed a couple of other issues as well:

  • Ordinal Encoding is performed on age, tumor size, and inv-nodes. But the lexicographic sorting is being done since they are strings, and so there is not numeric sorting.
age_order, ['20-29' '30-39' '40-49' '50-59' '60-69' '70-79']
tumor_size_order, ['0-4' '10-14' '15-19' '20-24' '25-29' '30-34' '35-39' '40-44' '45-49'
 '5-9' '50-54']
inv_nodes_order, ['0-2' '12-14' '15-17' '24-26' '3-5' '6-8' '9-11']
  • get_encoders is called on df, but encode_case1_train_test() does not take in df as a parameter
	enc_ord, enc_ohe = get_encoders(df)```

Correlated Features examples

I noticed a couple of issues here.

  1. There are two different examples feat_sel_corr_tutorial.ipynb and feat_sel_corr.ipynb. feat_sel_corr.ipynb does not contain any explanatory comments in the notebook. Maybe this one is extra and not needed?
  2. Both of these notebooks write the same json files to ./corr_json_examples/. The files that are part of the git repo are the ones belonging to feat_sel_corr.ipynb .

If the 2nd notebook is not needed, then the json files should be replaced with the ones belonging to the 1st example.

No installation instructions

I have noticed that there are no installation instructions in the main README.md as well as the documentation.

TODO: fill out support.md

Support.md needs to be filled out as per these instructions.

# TODO: The maintainer of this repo has not yet edited this file
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/spot](https://aka.ms/spot). CSS will work with/help you to determine next steps. More details also available at [aka.ms/onboardsupport](https://aka.ms/onboardsupport).
- **Not sure?** Fill out a SPOT intake as though the answer were "Yes". CSS will help you decide.

Balancing on multiple columns work around

Currently there is a work around in the end to end Jupyter notebook so that we can balance on two columns at the same time using any of the rebalancing techniques. While making API deseign changes to the Rebalance API, we should allow the user to specific multiple columns that they are interested in balancing on rather than creating a single column and balancing on that single column cohort

Seed cannot be set

"As we can see, this transformation had some impact in the results (depends on the seed used) when we use KNN. Let's check how this data transformation impacts the XGBoost model:"

The case2.ipynb notebook references the ability to set a seed. But, this is not available for either split_data() , train_model_plot_results() or train_model_fetch_results(). Additionally, I have noticed that that there is no possibility to pass any parameters to the model itself for instantiation/fitting (e.g. setting the number of neighbors for KNN).

I am not sure whether you expect these functions to be used outside of the example notebooks. But if yes, you should consider allowing the user to set a random seed, as well as pass in model parameters, possibly using something like *args **kwargs.

Update the version for the SDV library

Currently, this repo is using SDV v0.13.1, but as of today (August 2022), SDV v0.16.0 is already available. Also, v0.13.1 is causing some security issues with numpy. Therefore, this repo should update the SDV version used.

No module named 'seaborn' when importing the "cohort" module

After installing the package using pip install raimitigations, when I import any class from the cohort module, I get the following error:

> from .utils import fetch_cohort_results, plot_value_counts_cohort
> import seaborn as sns
No module named 'seaborn'

The error doesn't occur if I install the library using pip install raimitigations[all].

To fix, add seaborn to the base set of dependencies, not only to the [all] dependency group.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.