scikit-learn-contrib / skdag Goto Github PK

A more flexible alternative to scikit-learn Pipelines

License: MIT License

Python 100.00%

skdag's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

DenMune: Density-peak clustering using mutual nearest neighbors

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

lightning

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

skglm

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

imbalanced-learn

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

polylearn

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

hdbscan

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

categorical-encoding

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

boruta_py

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

sklearn-pandas

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

skope-rules

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

stability-selection

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

metric-learn

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

skdag's People

Contributors

Stargazers

Watchers

Forkers

big-o luissalazarsalinas ragrawal sandy4321 galenseilis d4c0r1-nos

skdag's Issues

Merge DAGBuilder and DAG

Bring all the functionality of the DAGBuilder class into DAG to avoid the need for instantiating a separate factory class and calling the make_dag() factory method.

Build graph from expressive operators

Rather than a factory method, allow users to construct a graph by applying operators to estimators, for example:

dag1 = (
    NamedStep(est1, "step1")
    | NamedStep(est2, "step2")
    | (
        NamedStep(est3, "step3")
        & NamedStep(est4, "step4")
    )
    | NamedStep(est5, "step5")
)

This would create a linear pipeline from est1 -> est5, but with the second step feeding both steps 3 and 4, and step 5 receiving a concatenation of the two outputs from steps 3 and 4.

Complex dag construction can then be broken down into multiple statements too:

dag2 = (
    dag1.get_step("step2")
    | NamedStep(est6, "step6")
    | NamedStep(est7, "step7")
)

...would effectively create a new dag that is the original one with an extra branch added.

How to get the internally-computed node outputs to be part of the final output?

I am trying to understand how to get skdag to return all the computed columns when the predict method is called.

Here is an example from the documentation:

from skdag import DAGBuilder
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step(
        "impute",
        SimpleImputer()
        )
    .add_step(
        "vitals",
        "passthrough",
        deps={"impute": ["age", "sex", "bmi", "bp"]}
        )
    .add_step(
        "blood",
        PCA(n_components=2, random_state=0),
        deps={"impute": make_column_selector("s[0-9]+")}
        )
    .add_step(
        "lr",
        LogisticRegression(random_state=0),
        deps=["blood", "vitals"]
        )
    .make_dag()
)
dag.show()


from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
dag.fit_predict(X, y)

I tried just sticking an identity function on the end to collect the results, but it didn't work. I do not understand how things get passed along internally.

from skdag import DAGBuilder
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import FunctionTransformer

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step(
        "impute",
        SimpleImputer()
        )
    .add_step(
        "vitals",
        "passthrough",
        deps={"impute": ["age", "sex", "bmi", "bp"]}
        )
    .add_step(
        "blood",
        PCA(n_components=2, random_state=0),
        deps={"impute": make_column_selector("s[0-9]+")}
        )
    .add_step(
        "lr",
        LogisticRegression(random_state=0),
        deps=["blood", "vitals"]
        )
    .add_step(
        "out",
        FunctionTransformer(lambda x: x),
        deps=["inpute", "blood", "vitals", "lr"]
        )
    .make_dag()
)
dag.show()


from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
dag.fit_predict(X, y)

Here the traceback I got. It suggested some kind of "inconsistency" in what I have coded.

Traceback (most recent call last):
  File "/usr/lib/python3.10/idlelib/run.py", line 578, in runcode
    exec(code, self.locals)
  File "/home/galen/Dropbox/bin/try_skdag.py", line 9, in <module>
    DAGBuilder(infer_dataframe=True)
  File "/home/galen/.local/lib/python3.10/site-packages/skdag/dag/_builder.py", line 120, in add_step
    self._validate_deps(deps)
  File "/home/galen/.local/lib/python3.10/site-packages/skdag/dag/_builder.py", line 158, in _validate_deps
    raise ValueError(f"unresolvable dependencies: {', '.join(sorted(missing))}")
ValueError: unresolvable dependencies: inpute

GridSearch and skdag

Hi,

First of all, I think your library is a great add on to sklearn, especially since it addresses limitations of Pipeline.

Having said that, I tried to use skdag with GridSearchCV of sklearn but run into problem. I try to use one of your examples from the library docs (https://skdag.readthedocs.io/en/latest/quick_start.html) to do the grid search of optimal hyperparameter values. To you code I only add the following:
from sklearn.model_selection import GridSearchCV
params = {'blood__n_components': [1,2,3,4]}
grid = GridSearchCV(estimator = dag2, param_grid = params, scoring = 'accuracy')
grid.fit(X_train, y_train)

However, when I try to fit the model, I get the following error:
ValueError: Found input variables with inconsistent numbers of samples: [61, 2]

Would really appreciate if you could tell me what is going on here.
Regards,
Tonci

convert input into dataframe

Hi , thanks for the amazing library. I am maintainer of sklearn-pandas library and it was always on my todo list to convert it into proper DAG. I was trying skdag and got blocked with one problem. One of my intermediate transformer expects input to be dataframe. Wondering is there any way to force the inputs to be converted into a dataframe.

Question: Why multiple instances of nx.Digraph in DAG?

I'm really excited by skdag. I was working on a similar project when I realized that it solved all the problems I had or wanted to solve.

I am currently trying to build something on top of this which has access to the underyling dag structure. I've encountered an ambiguity I am hoping for technical assistance with.

Suppose I begin with this example from the docs loaded in memory:

from skdag import DAGBuilder
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression


dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step("impute", SimpleImputer())
    .add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]})
    .add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": make_column_selector("s[0-9]+")})
    .add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"])
    .make_dag()
)
dag.show()

I noticed that there is both dag.graph and dag.graph_ stored in memory at different addresses. They seem highly-similar when inspecting the nodes and edges. Is one a reference to the other? Or is one a shallow copy of the other? Or is one a deep copy of the other? Or are they fundamentally different?

Support cross_val_predict for stacked estimators

First -- this library looks great, pretty much exactly what I was looking for!

It appears though, that each dependent estimator is trained on the entire passed dataset directly. This can lead to overfitting. Instead, it would be nice if cross_val_predict was supported in some fashion so when we have estimators A -> B, B is trained on output of cross_val_predict rather than raw predict output.

Given activity in this repo, I'm assuming its basically inactive. But if not, I'd be happy to contribute such functionality.