stefan-grafberger / mlinspect Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 15.0 22.92 MB

Inspect ML Pipelines in Python in the form of a DAG

License: Apache License 2.0

Python 50.74% Jupyter Notebook 49.26%

mlinspect's People

Contributors

Stargazers

Watchers

Forkers

scnakandala adrianlut elephaint lc117 andreas-grafberger sunyow jinyangli01 abempah amsterdata naveen-marthala nmiranda nadav-levi pistefania benyucong andrewrade

mlinspect's Issues

Assignment to multiple columns not supported

reviews_with_products_and_ratings[['product_title', 'review_headline', 'review_body']] = \
     reviews_with_products_and_ratings[['product_title', 'review_headline', 'review_body']].fillna(value='')

Support operations like model.score

Description

We can add a model.score node to the DAG that has the fitted estimator resulting from estimator.fit as a parent without passing data along that edge and has test data and test label nodes as additional parents, similar to the current model.fit handling.

~~Will also have to capture both outputs from the train_test_spit functions.~~(done)

We can then build interesting new inspections like a SliceFinder one.

Add new citation info

Description

Add the VLDBJ paper
Add a citation button
Add a doi badge?

Initial Project Setup

Create a starting point with an example pipeline that gets loaded. There should also be pylint and a unit test.

Performance optimisations

Idea for future work

Not all inspections might need to look at all operators.
E.g., the LineageInspection only needs to modify the annotations for Concat and Join operations. We would need a very simple API for this as to not confuse users who want to build their own inspections.

Joins with differently named key columns not supported yet

products_of_interest = products.merge(categories_of_interest, left_on='category_id', right_on='id')

produces KeyError: 'on'

Include pipeline code snippets directly in DAG nodes

E.g., we could do this by:

Extract current DAg
Build map/set with all mentioned code_references
Go through the original, unmodified AST using a Python NodeVisitor and use the unparse library to get the code snippet parts

Alternative:

Use code_references + plain text code

Control flow?

Example: Users write their own one-hot-encoder with for-loops over the data

Simple loop unrolling is not an option here.

If there is enough time, we can consider adding e.g., annotations that users can use to allow us to deal with this somewhat. Another option is discovering loops and recognizing them as a single operator, there is some work about that in workflow papers.

Probably not something we will do as part of this thesis. However, we should at least tell the user if there is a loop in the code we can not deal with.

Proof of Concept for WIR Extraction

Prototype of Instrumentation and Analyzers with a Pandas Backend

Backend cleanup and performance optimisations

Performance experiments

Write example pipelines with two different ASTs that get transformed to the same DAG

Implement additional Inspections and Checks

We should also create documentation with a list of all checks and inspections.

Proper handling of non-native sklearn transformers

Description

Sklearn enables users to write their own transformers. An example: in our healthcare example, we create a custom Word2Vec transformer, as sklearn doesn't provide one. For this one, we just added processing code similar to the handling of native transformers. For users of mlinspect, this shouldn't be necessary. What makes this difficult: transformers can receive data frames with multiple columns as input. E.g., a OneHotEncoder then one-hot encodes all columns separately. In the DAG mlinspect extracts, we duplicate this transformer logically to have a separate transformer for each column. This requires us to maintain a mapping from the column to a partial output vector. A transformer like the StandardScaler has a 1:1 mapping, while the OneHotEncoder maps each column to a different length vector. It's the same for our custom Word2Vec transformer. To resolve this, the transformer instrumentation function requires this mapping. For native transformers of sklearn, we can hard-code code snippets with the mapping/code-snippets that detect the mapping for, e.g., a OneHotEncoder. For non-native transformers, we cannot do this. However, we can handle instances where non-native transformers receive, e.g., data frames with multiple columns as input, by duplicating this transformer and passing it data frames with just the content of single columns repeatedly. While this isn't as efficient, this should completely resolve this issue. And the speed-up lost this way shouldn't make a noticeable difference.

Warn users properly if there are functions we do not recognize

We can, e.g., add assertions with appropriate error messages that ensure all original outputs are wrapped in our wrapper classes. We need to ensure that all functions using DataFrames in a pipeline are handled by mlinspect. We can discover these functions by just checking the output type/check if they modify DataFrames.

GitHub Action for Pypi publishing

Description

Now that we support control flow, mlinspect is far enough along to start publishing it as a PyPI package. We should set up a GitHub action that automatically does this publishing for tagged commits. Maybe we also want to build and publish a Docker image automatically that has graphviz installed and opens a Jupyter Notebook when started.

ArgumentCapturing

Description

Expose function arguments to inspections so that they can, e.g., log hyperparameters of transformers and estimators
Implement an inspection that extracts hyperparameters from transformers and estimators

Support for GridSearchCV

Description

We want to support the sklearn grid search by only instrumenting the final variant with the best HPs

Example

param_grid = {
    'loss': ['log'],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'alpha': [0.0001, 0.0005, 0.0001, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.05, 0.1]
}
model = GridSearchCV(SGDClassifier(max_iter=1000), param_grid, cv=10)

Statsmodels example

Description

We were asked whether mlinspect can support libraries like statsmodels, too. We want to show how to do this for a few examples.

Implement checks

Start with some simple ones like "selection after join" or the fair-dag check

Failing Build

Description

Because of some outdated dependencies, the build seems to fail: https://github.com/stefan-grafberger/mlinspect/actions/runs/7875320485

Selection inside of label_binarize not supported

This does not work:

train_labels = label_binarize(train_data.helpful_votes > 0, classes=[True, False])

But this works:

train_data['is_helpful'] = train_data['helpful_votes'] > 0
train_labels = label_binarize(train_data['is_helpful'], classes=[True, False])

DAG Extraction for scikit-learn pipelines

Util functions to visualise DAG: pretty-print, generate image etc.

https://networkx.github.io/documentation/stable/reference/drawing.html

GitHub Actions

Description

Set up GitHub Actions CI, now that I run out of Travis credits

Python 3.9

Description

While mlinspect doesn't have many external users, we want to avoid having to support multiple Python versions. Especially the necessary AST instrumentation can change slightly between different Python versions as the AST changes. That is why we only want to support the newest version and want to switch to 3.9 early. This will be a breaking change for 3.8 usages.

"right" kw arg not supported for pandas.merge

reviews_with_products_and_ratings = reviews_with_ratings.merge(products_of_interest, on='product_id')

works, but the following does not

reviews_with_products_and_ratings = reviews_with_ratings.merge(on='product_id', right=products_of_interest)

Support more complex example pipelines

For this, we mainly need to add support for additional library functions in the different backend classes.

Support for sklearn FunctionTransformer

Description

Add support for code snippets like:

def safe_log(x):
    return np.log(x, out=np.zeros_like(x), where=(x!=0))
FunctionTransformer(lambda x: safe_log(x)).fit_transform(data)

Support for model.predict and .predict_proba

Description

We now have initial prediction capturing for model.score, but we also want to support predict and predict_proba

Proof of Concept for WIR to DAG Transformation

Add Backend for Scikit-Learn

We could do this by adding a debugging step to pipelines: https://stackoverflow.com/questions/34802465/sklearn-is-there-any-way-to-debug-pipelines

We could add this debugging step by using the existing after_call_value instrumentation and replacing the return value with a sub-pipeline that does the specified step surrounded by the debugging step

Demo polish

Description

On the demo branch, we made some changes to the example pipeline that we should merge back into the main branch. Most importantly, making the lineage inspection output easier to understand. We have to decide at a later date how we want to proceed with the web app, one possible course of action would be to start publishing mlinspect to pypi automatically using GitHub actions and create a new repository for the web app with a dependency on the pypi mlinspect package.

RowLineage: Restrict output materialisations to specific operators

Description

For some use cases, we need to materialize all output rows, but restrict it to specific operators to avoid the space overhead getting too big
We also need to add an option to log all output data, currently you can only do that if you know an upper bound for the output size beforehand

HashingVectorizer not supported

It would be great to have support for https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

Column Completeness

Description

Implement a first data quality inspection to compute the completeness of a column

Removal Probability Check

Description

Provide a check based on calculating the removal probabilities for different demographic groups in the data. This check detects cases where filter-like operations that affect only a small subset of data disparately impact specific demographic groups.

Proof of Concept for AST Instrumentation

Extract an AST from the example pipeline, inject a function call after pandas functions

Capture Predictions for Test Set

Description

Capture the predictions made for the test set as part of model.score and model.predict

Best-effort column tracking

Description

There are rare cases where it's hard or impossible to trace the column names throughout pipelines, especially when certain sklearn feature selection transformers are used. That's why we use array as a column name in certain cases when we can't guarantee to always track column names successfully. But we should still try to do column-level tracking on a best effort basis.

For transformers like the OneHotEncoder, that can consume pandas.DataFrames with multiple columns and output a single numpy.ndarray, we need to pass the NumPy arrays to the inspections in a way, so they know which parts of the array correspond to which logical columns.

There are some performance considerations when splitting the NumPy arrays into multiple columns from an inspection perspective (we either need to extend the schema information or add some logic to the InspectionInputRow classes and use functions that return partial NumPy views of the original array). The second solution is likely preferable, but we would need to measure its performance overhead.

Having the correct column names is definitely useful. There are rare cases where it's almost impossible to do this tracking properly through different transformers, e.g., in this example:

from sklearn.feature_selection import  VarianceThreshold
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

X = pd.DataFrame.from_dict({
    'A': [0.87, -1.34,  0.31,  1.92],
    'B': [-1.34, -0.48, -2.55, 0.65],
    'C': [-1.34, -0.48, -2.55, 0.65],
    'D': [0, 0, 0, 0],
})
f_select = VarianceThreshold(threshold=(.8 * (1 - .8)))
standard_scaler = StandardScaler()
pipeline = Pipeline([("f_select", f_select), ("scaler", standard_scaler)])
X = ColumnTransformer([
    ("obscure_example", pipeline, ['B', 'C', 'D'])]

).fit_transform(X)
print(X)

If we have feature selection transformers or ones for dimensionality reduction, then it becomes very difficult to track the column names. So I don't think we can always guarantee to provide the correct column names through different transformers.

For transformers like the OneHot Encoder, it's possible to track which values get transformed to which one-hot vector, see, e.g., this part from before the rework. But there might be transformers or other operations where we lose this column-level tracking, e.g., if we want to support apply/map operations using user-defined functions. If the user-defined function returns a numpy array, then we need some fallback like the current array.

Intersectional Group Membership Inspection

Description

Implement an IntersectionHistogramForColumns inspection similar to HistogramForColumns

Support for more Backends and API Functions

Description

mlinspect already supports a selection of API functions from pandas and scikit-learn. Extending mlinspect to support more and more API functions and libraries will be an ongoing effort. (However, mlinspect won't just crash when it encounters functions it doesn't recognize yet.)

You can find more information, a few pointers, and a list of currently supported API functions here. Contributions to extend the support of mlinspect to more backends and API functions are always welcome and good first issues.

Please create a new issue for the specific backend/API function you want to add or want to see added; this issue here is here to stay.

Wrong annotation order after reordering operations

In joins and selections (and train test split), the iterator creation methods ensure that the rows provided to the inspections are in the order of the operation output and that the input rows are also ordered correctly. This is implemented via joining input and output on the mlinspect_index. However, the created row order is currently not correct in all cases.

Bug description

In the case of selections that do not keep the row order (for example the sklearn train test split operation that creates unordered random samples) or in the case of joins that order the output based on the join column (option sort=True), the row order provided to the inspection is not the row order of the actual operation output. Therefore, the annotations created by the inspection are not ordered correctly.

Fix description

The error originates from using the pandas.merge function, which by default preserves the order of the left join keys (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html). To keep the join result ordered like the output, the output has to be the left DataFrame.

I replaced the pandas.merge calls with DataFrame.merge, always keeping the DataFrame containing the output on the left side. This ensures that the output order is preserved. Because switching left and right in a join changes the column order, I also rewrote the code calculating and creating the DataFrame slices for iterator creation.

Alternative fix

Replacing pandas.merge with DataFrame.merge is not necessary. Swapping the inputs so that the DataFrame containing the output columns is always left would suffice.

New Entry Point for PipelineInspector: onString

Directly pass the code to the inspector. This could be useful to implement stuff like a Jupyter plugin on top.

Support predefined split

It would be great to have support for sklearn's predefined split, so that we can split data based on user-defined predicates.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit

While it would be possible to write this with pandas code alone, it would be very hard to verify that the splits are disjunct. The PredefinedSplit enforces this.

Cleanup after Control Flow Rework

Description

When doing the control flow rework, I forgot to remove now unused code
We can also use this opportunity to remove other files that are not necessary anymore

Column Count Distinct Inspection

Description

Implement a second data quality inspection to compute the number of distinct values in a column.

Problematic tensorflow/keras dependency

The current depencies

tensorflow==2.5.0
keras==2.4.3

are unstable unfortunately: keras-team/keras#14632

Extend LineageInspection

The LineageInspection can be used to to materialize specific intermediate results. ToDo already present in the corresponding file with some implementation ideas:

# TODO: Add an option to pass a list of lineage ids to this inspection. Then it materializes all related tuples.
    #  To do this efficiently, we do not want to do expensive membership tests. We can collect all base LineageIds
    #  in a set and then it is enough to check for set memberships in InspectionInputDataSource inspection inputs.
    #  This set membership can be used as a 'materialize' flag we use as annotation. Then we simply need to check this
    #  flag to check whether to materialize rows.

We should also offer ways to materialize large parts of the result rows and their lineage of specific operators.

stefan-grafberger / mlinspect Goto Github PK

mlinspect's People

Contributors

Stargazers

Watchers

Forkers

mlinspect's Issues

Description

Description

Idea for future work

Description

Description

Description

Description

Example

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org