Giter Site home page Giter Site logo

mlinspect's People

Contributors

adrianlut avatar shubhaguha avatar stefan-grafberger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mlinspect's Issues

Assignment to multiple columns not supported

reviews_with_products_and_ratings[['product_title', 'review_headline', 'review_body']] = \
     reviews_with_products_and_ratings[['product_title', 'review_headline', 'review_body']].fillna(value='')

Support operations like model.score

Description

We can add a model.score node to the DAG that has the fitted estimator resulting from estimator.fit as a parent without passing data along that edge and has test data and test label nodes as additional parents, similar to the current model.fit handling.

Will also have to capture both outputs from the train_test_spit functions.(done)

We can then build interesting new inspections like a SliceFinder one.

Initial Project Setup

Create a starting point with an example pipeline that gets loaded. There should also be pylint and a unit test.

Performance optimisations

Idea for future work

Not all inspections might need to look at all operators.
E.g., the LineageInspection only needs to modify the annotations for Concat and Join operations. We would need a very simple API for this as to not confuse users who want to build their own inspections.

Include pipeline code snippets directly in DAG nodes

E.g., we could do this by:

  • Extract current DAg
  • Build map/set with all mentioned code_references
  • Go through the original, unmodified AST using a Python NodeVisitor and use the unparse library to get the code snippet parts

Alternative:

  • Use code_references + plain text code

Control flow?

Example: Users write their own one-hot-encoder with for-loops over the data

Simple loop unrolling is not an option here.

If there is enough time, we can consider adding e.g., annotations that users can use to allow us to deal with this somewhat. Another option is discovering loops and recognizing them as a single operator, there is some work about that in workflow papers.

Probably not something we will do as part of this thesis. However, we should at least tell the user if there is a loop in the code we can not deal with.

Proper handling of non-native sklearn transformers

Description

Sklearn enables users to write their own transformers. An example: in our healthcare example, we create a custom Word2Vec transformer, as sklearn doesn't provide one. For this one, we just added processing code similar to the handling of native transformers. For users of mlinspect, this shouldn't be necessary. What makes this difficult: transformers can receive data frames with multiple columns as input. E.g., a OneHotEncoder then one-hot encodes all columns separately. In the DAG mlinspect extracts, we duplicate this transformer logically to have a separate transformer for each column. This requires us to maintain a mapping from the column to a partial output vector. A transformer like the StandardScaler has a 1:1 mapping, while the OneHotEncoder maps each column to a different length vector. It's the same for our custom Word2Vec transformer. To resolve this, the transformer instrumentation function requires this mapping. For native transformers of sklearn, we can hard-code code snippets with the mapping/code-snippets that detect the mapping for, e.g., a OneHotEncoder. For non-native transformers, we cannot do this. However, we can handle instances where non-native transformers receive, e.g., data frames with multiple columns as input, by duplicating this transformer and passing it data frames with just the content of single columns repeatedly. While this isn't as efficient, this should completely resolve this issue. And the speed-up lost this way shouldn't make a noticeable difference.

Warn users properly if there are functions we do not recognize

We can, e.g., add assertions with appropriate error messages that ensure all original outputs are wrapped in our wrapper classes. We need to ensure that all functions using DataFrames in a pipeline are handled by mlinspect. We can discover these functions by just checking the output type/check if they modify DataFrames.

GitHub Action for Pypi publishing

Description

Now that we support control flow, mlinspect is far enough along to start publishing it as a PyPI package. We should set up a GitHub action that automatically does this publishing for tagged commits. Maybe we also want to build and publish a Docker image automatically that has graphviz installed and opens a Jupyter Notebook when started.

ArgumentCapturing

Description

  • Expose function arguments to inspections so that they can, e.g., log hyperparameters of transformers and estimators
  • Implement an inspection that extracts hyperparameters from transformers and estimators

Support for GridSearchCV

Description

  • We want to support the sklearn grid search by only instrumenting the final variant with the best HPs

Example

param_grid = {
    'loss': ['log'],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'alpha': [0.0001, 0.0005, 0.0001, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.05, 0.1]
}
model = GridSearchCV(SGDClassifier(max_iter=1000), param_grid, cv=10)

Statsmodels example

Description

  • We were asked whether mlinspect can support libraries like statsmodels, too. We want to show how to do this for a few examples.

Implement checks

Start with some simple ones like "selection after join" or the fair-dag check

Selection inside of label_binarize not supported

This does not work:

train_labels = label_binarize(train_data.helpful_votes > 0, classes=[True, False])

But this works:

train_data['is_helpful'] = train_data['helpful_votes'] > 0
train_labels = label_binarize(train_data['is_helpful'], classes=[True, False])

GitHub Actions

Description

Set up GitHub Actions CI, now that I run out of Travis credits

Python 3.9

Description

While mlinspect doesn't have many external users, we want to avoid having to support multiple Python versions. Especially the necessary AST instrumentation can change slightly between different Python versions as the AST changes. That is why we only want to support the newest version and want to switch to 3.9 early. This will be a breaking change for 3.8 usages.

"right" kw arg not supported for pandas.merge

reviews_with_products_and_ratings = reviews_with_ratings.merge(products_of_interest, on='product_id')

works, but the following does not

reviews_with_products_and_ratings = reviews_with_ratings.merge(on='product_id', right=products_of_interest)

Support for sklearn FunctionTransformer

Description

Add support for code snippets like:

def safe_log(x):
    return np.log(x, out=np.zeros_like(x), where=(x!=0))
FunctionTransformer(lambda x: safe_log(x)).fit_transform(data)

Demo polish

Description

On the demo branch, we made some changes to the example pipeline that we should merge back into the main branch. Most importantly, making the lineage inspection output easier to understand. We have to decide at a later date how we want to proceed with the web app, one possible course of action would be to start publishing mlinspect to pypi automatically using GitHub actions and create a new repository for the web app with a dependency on the pypi mlinspect package.

RowLineage: Restrict output materialisations to specific operators

Description

  • For some use cases, we need to materialize all output rows, but restrict it to specific operators to avoid the space overhead getting too big
  • We also need to add an option to log all output data, currently you can only do that if you know an upper bound for the output size beforehand

Column Completeness

Description

  • Implement a first data quality inspection to compute the completeness of a column

Removal Probability Check

Description

Provide a check based on calculating the removal probabilities for different demographic groups in the data. This check detects cases where filter-like operations that affect only a small subset of data disparately impact specific demographic groups.

Best-effort column tracking

Description

There are rare cases where it's hard or impossible to trace the column names throughout pipelines, especially when certain sklearn feature selection transformers are used. That's why we use array as a column name in certain cases when we can't guarantee to always track column names successfully. But we should still try to do column-level tracking on a best effort basis.

For transformers like the OneHotEncoder, that can consume pandas.DataFrames with multiple columns and output a single numpy.ndarray, we need to pass the NumPy arrays to the inspections in a way, so they know which parts of the array correspond to which logical columns.

There are some performance considerations when splitting the NumPy arrays into multiple columns from an inspection perspective (we either need to extend the schema information or add some logic to the InspectionInputRow classes and use functions that return partial NumPy views of the original array). The second solution is likely preferable, but we would need to measure its performance overhead.

Having the correct column names is definitely useful. There are rare cases where it's almost impossible to do this tracking properly through different transformers, e.g., in this example:

from sklearn.feature_selection import  VarianceThreshold
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

X = pd.DataFrame.from_dict({
    'A': [0.87, -1.34,  0.31,  1.92],
    'B': [-1.34, -0.48, -2.55, 0.65],
    'C': [-1.34, -0.48, -2.55, 0.65],
    'D': [0, 0, 0, 0],
})
f_select = VarianceThreshold(threshold=(.8 * (1 - .8)))
standard_scaler = StandardScaler()
pipeline = Pipeline([("f_select", f_select), ("scaler", standard_scaler)])
X = ColumnTransformer([
    ("obscure_example", pipeline, ['B', 'C', 'D'])]

).fit_transform(X)
print(X)

If we have feature selection transformers or ones for dimensionality reduction, then it becomes very difficult to track the column names. So I don't think we can always guarantee to provide the correct column names through different transformers.

For transformers like the OneHot Encoder, it's possible to track which values get transformed to which one-hot vector, see, e.g., this part from before the rework. But there might be transformers or other operations where we lose this column-level tracking, e.g., if we want to support apply/map operations using user-defined functions. If the user-defined function returns a numpy array, then we need some fallback like the current array.

Support for more Backends and API Functions

Description

mlinspect already supports a selection of API functions from pandas and scikit-learn. Extending mlinspect to support more and more API functions and libraries will be an ongoing effort. (However, mlinspect won't just crash when it encounters functions it doesn't recognize yet.)

You can find more information, a few pointers, and a list of currently supported API functions here. Contributions to extend the support of mlinspect to more backends and API functions are always welcome and good first issues.

Please create a new issue for the specific backend/API function you want to add or want to see added; this issue here is here to stay.

Wrong annotation order after reordering operations

In joins and selections (and train test split), the iterator creation methods ensure that the rows provided to the inspections are in the order of the operation output and that the input rows are also ordered correctly. This is implemented via joining input and output on the mlinspect_index. However, the created row order is currently not correct in all cases.

Bug description

In the case of selections that do not keep the row order (for example the sklearn train test split operation that creates unordered random samples) or in the case of joins that order the output based on the join column (option sort=True), the row order provided to the inspection is not the row order of the actual operation output. Therefore, the annotations created by the inspection are not ordered correctly.

Fix description

The error originates from using the pandas.merge function, which by default preserves the order of the left join keys (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html). To keep the join result ordered like the output, the output has to be the left DataFrame.

I replaced the pandas.merge calls with DataFrame.merge, always keeping the DataFrame containing the output on the left side. This ensures that the output order is preserved. Because switching left and right in a join changes the column order, I also rewrote the code calculating and creating the DataFrame slices for iterator creation.

Alternative fix

Replacing pandas.merge with DataFrame.merge is not necessary. Swapping the inputs so that the DataFrame containing the output columns is always left would suffice.

Cleanup after Control Flow Rework

Description

  • When doing the control flow rework, I forgot to remove now unused code
  • We can also use this opportunity to remove other files that are not necessary anymore

Extend LineageInspection

The LineageInspection can be used to to materialize specific intermediate results. ToDo already present in the corresponding file with some implementation ideas:

# TODO: Add an option to pass a list of lineage ids to this inspection. Then it materializes all related tuples.
    #  To do this efficiently, we do not want to do expensive membership tests. We can collect all base LineageIds
    #  in a set and then it is enough to check for set memberships in InspectionInputDataSource inspection inputs.
    #  This set membership can be used as a 'materialize' flag we use as annotation. Then we simply need to check this
    #  flag to check whether to materialize rows.

We should also offer ways to materialize large parts of the result rows and their lineage of specific operators.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.