scikit-mine / scikit-mine Goto Github PK

View Code? Open in Web Editor NEW

71.0 5.0 10.0 33.98 MB

scikit-mine : pattern mining in Python

Home Page: https://scikit-mine.github.io/scikit-mine/

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.11% Python 97.39% Jupyter Notebook 2.50%

scikit datamining minimum-description-length pattern-mining scikit-learn

scikit-mine's Introduction

Scikit-mine : pattern mining in Python

Descriptive analysis, leading to interpretable, concise descriptions using the Minimum Description Length Principle
Fast Algorithms
Simple, extendable API, inspired by scikit-learn

Resources

Free software: BSD license
GitHub: https://github.com/scikit-mine/scikit-mine
Documentation: https://scikit-mine.github.io/scikit-mine/

Quickstart

scikit-mine is a Python module for pattern mining built on top of Pandas/Numpy/SciPy and is distributed under the 3-Clause BSD license.

It is currently maintained by a team of volunteers.

See examples in the tutorials; the notebooks are available here. To execute the tutorials, you will have to install jupyter notebook or jupyterlab (https://jupyter.org/install)

Dependencies

scikit-mine requires Python>=3.8, and some extra dependencies

scipy>=1.2.1
pandas>=1.0.0
pyroaring>=0.3.4
joblib>=0.11.1
sortedcontainers>=2.1.0
dataclasses>=0.6
networkx
wget>=3.2
scikit-learn
graphviz
matplotlib
pydot

Development

This project benefitted from fundings from the INRIA center in Rennes, Brittany, France, as well as from the CNRS PNRIA Programme.

We welcome new contributors of all experience levels.

Internal Contributors

Rémi Adon (https://github.com/remiadon)
Hermann Courteille (https://github.com/hermann74)
Cyril Regan (https://github.com/cyril-data)
Thomas Betton (https://github.com/thomasbtnfr)
Esther Galbrun (https://github.com/nurblageij)
Peggy Cellier (https://github.com/PeggyCellier)
Alexandre Termier (https://github.com/alexandre-termier)
Luis Galárraga (https://github.com/lgalarra)
Josie Signe (https://github.com/Darlysia)
Francesco Bariatti (https://github.com/fbariatti)
Mensah-David Assigbi (https://github.com/davidassigbi)
Arnauld-Cyriaque Djedjemel (https://github.com/Ariaque)

scikit-mine's People

Contributors

Stargazers

Watchers

Forkers

vpodpecan sdall sawja-admin toshuumilia ariaque allen07tw davidassigbi done520 trellixvulnteam edriv-ds

scikit-mine's Issues

Emerging patterns -- inclusion or not ?

Describe the workflow you want to enable

Some internal discussions have raised attention for Emerging Patterns
On the long term we have to provide not only models but also methodologies to perform classification, so emerging patterns might be a solution

Describe your proposed solution

a new module called emerging providing a set of tools to deal with emerging pattern, jumping emerging patterns, and so on.

Additional context

This paper is a good place to start

Note that section 6.3 (Pairwise classification) looks very much like sklearn OneVsOneClassifier methodology

HeatMap visualization of SLIM learning

Describe the workflow you want to enable

Would be nice if we can visualize SLIM learning data representation stepwise

Some idea to do this:

create an async watcher, making deep copies of an object attributes at predefined steps
create a notebook showing data import + SLIM instantiation + watcher instantiation
make watcher send data to a matplotlib FuncAnimation or streamz DataFrame
plot the resulting animation

This way we could get a sexy representation of what MDL pattern mining algorithms are doing

Emerging patterns -- use case in documentation

Describe the issue linked to the documentation

Find a use case to showcase Emerging Patterns in a notebook in the doc

Bitmap implementation

Context

For many mining tasks, eg. tracking the transaction ids for every single items, we need efficient representations of integers sets. For this we will use Roaring Bitmaps

Candidates

In python we have access to a few implementations, mainly

PyRoaringBitmap is a wrapper over CRoaring
https://github.com/andreasvc/roaringbitmap is a pure Cython implementation

To me the winner is roaringbitmap, for:

it's written in pure cython, with cython bein a langage we want to investigate in, and widely use in this project, so this make a "unified" code base between our code and this external library
it is well implemented, well documented and well tested
it includes a few more methods than other libraires, an example of this being the intersection_len method

The only main issue I see is it's not conda installable yet, but we may work on it later.

init documentation

Describe the issue linked to the documentation

For now we only have numpy-style docstrings

Suggest a potential alternative/fix

Setup doc generation via Sphinx

[UPI] Deploy documentation on github-pages

Describe the issue linked to the documentation

Right now documentation is not deployed,
locally everything is OK, but we miss valid credential for dployment

LCM.fit_transform does not follow sklearn philosophy

Describe the workflow you want to enable

For now skmine.itemsets.LCM implements a fit_transform methods, returning a DataFrame
with two columns, namely itemsets and supports
To be fully compatible with sklearn this is not sufficient

Describe your proposed solution

move what the current fit_transform function is doing to a new function ("discover ?"), so we can still get a pretty dataframe with itemsets and supports,
and change the fit_transform method, so that:

it returns a binary pandas.DataFrame, with itemsets as column
the width of this DataFrame is constant and pre-defined (pass n_features as arg ?)

Additional context

Ideally we should be able to use our pattern mining algorithm to extract feature from transactional
datasets.
see this repo

add a "quick start" notebook in the doc

MDLP Discretizer -- use cqse in the doc

Sequence Mining

First thoughts on sequence mining algorithms

SQS (pronounce Squeeze) is a good candidate

Found another interesting paper : https://arxiv.org/pdf/1602.05012.pdf

benchmark methodology, first try with LCM

Workflow

Telling what modification in the code is responsible for what improvement on an implementation is not something easy. Telling some code is running faster simply by looking at it is a really bad habit
We need a dedicated methodology for this

Proposed solution

pandas developpers already use a benchmarking tool called asv to do the benchmarks. As we heavily rely on pandas this seems a good way to go.

Here is a proposed process to apply when adding a new mining method:

implement a V0 fitting with respect to a previously discussed functional definition
generate datasets with different properties, and add asv benchmarks for the new method
modify the code and evaluate benefits from the benchmarking interface -> iterate over and over

This is also prone to improving existing solutions, as we can make a new PR explicitly mentioning an improvement on a specific indicator, on a specific model 💯

Additional context

Note that benchmark are not necessarily dedicated to run times. In SLIM for example we might want to improve encoded data size, which states "what level of compression" we achieved.

mock external librairies

Describe the workflow you want to enable

For now external librairies are not mocked in unittest

roaringbitmap and maybe even joblib should be

Describe your proposed solution

create a .external file importing those libs
create a mock class and monkeypatch test functions

For roaringbitmap is may look like

class MockBitmap(set):
    def intersection_len(self, other):
        return len(self & other)

#@pytest.fixture(scope='session', autouse=True)  ????
def mock_bitmap(monkeypatch):
    import roaringbitmap
    monkeypatch.setattr(roaringbitmap, 'RoaringBitmap', BitmapMock)
    return monkeypatch

[UPI] add fetch_instacart_* methods

Workflow

Instacart is a popular dataset in pattern mining, it has been used by many researchers and remains interesting for benchmark purposes

Proposed solution

add fetch_instacart methods, to download and return a dataset made of instacart orders
add some documentation showing execution of LCM on this dataset

Additional context

Clément Gautrais has already provided an implementation for a paper named Widening for MDL-based Retail Signature Discovery during the 2020 Intelligent Data Analysis (IDA) Symposium

The code lies here

LCM should go in the ``preprocessing`` module

To make sure everyone understands LCM should not be used "as is"
It would be probably better if LCM was a preprocessing step

from skmine.preprocessing import LCM

SLIM V1

Workflow

Implement the SLIM

Describe your proposed solution

A dedicated class

add inclusion criteria for new algorithms

Describe the issue linked to the documentation

It is not obvious for now what algorithms should be included in scikit-mine.
My first attempt defining this would be to say we hope for MDL-based algorithms to be included, but that's certainly not sufficient.

Suggest a potential alternative/fix

We should add a dedicated paragraph in the doc, like the one in scikit-learn doc

Also we should reference this new paragraph in our github issue templates, especially new_feature.md

Emerging patterns -- Implementation of basic methods

Describe the workflow you want to enable

Implement first algo described here
https://dl.acm.org/doi/10.1145/312129.312191

Additional context

GCGrowth or MLLBorder ?
Choose method and first implem

We have some implem internally

Faster candidate generation for SLIM

Describe the workflow you want to enable

I tried SLIM on the instacart dataset (see skmine.datasets.fetch_instacart)
It started the candidate generation process (first step) and never ended ...

It is "normal" in the sense that instacart contains almost 50K products (symbols), then the search space is fairly large.

Describe your proposed solution

First thoughts:

profile everything FIRST
try to "vectorize" calls to RoaringBitmap.intersection_len : this is where the experiment got stuck, which makes sense.

>>> from numpy import vectorize
>>> from roaringbitmap import RoaringBitmap
>>> a = np.array([RoaringBitmap([2, 3]), RoaringBitmap([3, 2, 5])])
>>> c = RoaringBitmap([5])
>>> f = vectorize(RoaringBitmap.intersection_len, otypes=[np.uint32], excluded={0})
>>> f(c, a)
array([0, 1], dtype=uint32)

Note that this solution may not be faster, as vectorize actually performs a for loop ...

Additional context

One question we could ask, at least in the context of exploratory data science is:
Do we really want to summarize transactions drawn from 50K symbols ?
Even with the best compression ratio the result is unlikely to be usable by any human ...

slideset explaining SLIM

better description of transactions

Describe the issue linked to the documentation

As discussed wit @PeggyCellier and @alexandre-termier
FIMI datasets descriptions lack of extra informations on the nature of the data, and its potential use
In addition, some information about the density is missing
Finally, describe_itemsets seems not to be the best way to name the method in question
describe_transactional_dataset seems to be better, while my suggestion would be describe_transactions
@PeggyCellier what would you prefer ? I think describe_transactions is both short and explicit

Suggest a potential alternative/fix

add informations on usage in FIMI datasets docstrings
add a density field in fetch_* methods dosctrings
adapt description methods to yield information about density
choose a new name for describe_itemsets and modify the code

first performance improvements for SLIM

Describe your proposed solution

Two simple things to do:

right now SLIM make sure not to drop itemsets with zero usages when switching a codetable for a new, more concise one. This results in a final codetable with a lot of empty usage lists. Decide where to drop those zeros, and where to let them will lead to better run time and user experience
Like mentionned in the litterature, when considering a new candidate, reduce the database to itemsets containing all items in this itemset. This will make computation of covers faster

Ordered itemsets

Hi,

I have tested LCM's implementation successfully so far. That said, I wonder if it would be useful to have the items within itemsets naturally ordered. For example, given the following input

D1 = [[1, 2, 3], [1, 5, 2], [1, 9, 12, 2], [14, 15, 16], [14, 15, 16, 17]]

LCM outputs the itemset (16, 14, 15) as frequent, whereas I would have expected (14, 15, 16). I believe that having a consistent order for the output itemsets may make other tasks, such as testing, -- or discriminative pattern mining as I am doing right now -- easier.

Describe your proposed solution

I have had not a look at the code that emits itemsets, but it seems that it could be a post-processing step after closure calculation and support thresholding. On the other hand, a natural ordered may not be defined for itemsets of arbitrary data types. In such cases, the notion of order could be passed as an argument (e.g., a function) to the LCM object.

Thanks for your attention,
Luis

pip installation

publish 0.0.1 version

bash script for sklearn compatibility

DISC : get in touch with maintainers

repo is here
https://github.com/sdall/disc-aaai20

SLIM : adapt early stopping to density

In SLIMMER is introduced early stopping in SLIM

A number of bits is defined, and when compression does not better than this threshold, we stop learning.

In practice runtimes may vary a lot with different datasets. A simple strategy is to set this number of bits considering the number of transactions in the data. Here is what we do

tol = len(D) / 10

A better strategy might be to set this considering the density. It's quite straightforward since we have all information in the standard codetable. This would look like

usages = self._standard_codetable.map(len)
density = usages.mean()
tol = density * len(D)

Artificial data generators

Describe the workflow you want to enable

In sklearn.datasets one can find data generators, eg make_classif

Describe your proposed solution

Implement a first generator in skmine.datasets for mere itemsets
It should return a pd.Series following the properties passed as arguments

Additional context

A describe_itemsets is already implemented in datasets.utils, and can be used to test our the description of a generated dataset matches the input arguments

First MDL formal definition

Write a document explaining our MDLInterface

Try it with MDLP

Skyline Queries

Describe the workflow you want to enable

Skyline queries have been well studied and enable very usefull workflows.
For example if I want to find the optimal offers for a set of homes I would like to reserve for vacation, skyline queries allow me to keep (sort of cache) the set of homes that are on the skyline.

Describe your proposed solution

A SkylineClassifier class.
At .fit time, find the points on the skyline and keep them as an internal state.
This is possible because there are only a few points that verify the property of being on the skyline. A bit like MDL base methods, we get a concise representation.

A decision_function method would compute the distances from the skyline. If I pass new data to it, I should get a notion of "Is this product better than products from the skyline ?"

Additional context

Max Halford wrote a really nice post on it

MDLRuleList

Describe the workflow you want to enable

Interpretable multiclass classification by MDL-based rule lists,
by van Leeuwen & Al.

Implementation of this paper is available here

It is claimed to provide efficient, parameter free multiclass classification

Describe your proposed solution

The implementation relies on pyFIM, replacing it by our LCM implementation (or SLIM ?) should do the trick ? If not possible, I have an implementation of APriori somewhere

@PeggyCellier @alexandre-termier any comment ?

V0 - LCM + FIMI datasets

Implement the first base blocks, along with unit tests
This includes :

Windows CI

set up CI via appveyor.yml

online discovery in LCM

Describe the workflow you want to enable

LCM mines a lot of patterns. In some situation users may want to use LCM to mine a fairly large
amount of patterns and post-filter them according to a criterion that is not based on the MDL principle

In addition online discovery of patterns is a good-to-have.

Our implementation could be compatible with this kind of workflow.
We have to make sure we don't break any compatibility rules with scikit-learn, we only extend it

Describe your proposed solution

implement "discover_yield()" method in LCM. This methods returns a generator containing pandas.DataFrame, i.e chunks of discovered patterns

PROS : very flexible
CONS : in this setting one would not benefit from parallelism. A dataframe is yielded everytime we discover itemsets from an original item (a root node)

Additional context

@lgalarra mentioned this in a private thread

DISC - integration in scikit-mine

Decide either we implement this paper or not

conda install

first user tests

SLIM in sklearn pipeline

Build a symbolic classifier with SLIM and sklearn OneVsRest meta estimator

Generic MDL Interface

Describe the workflow you want to enable

A strong ambition of scikit-mine is to formally define MDL optimization, and make easier to understand and use this methodology in other methods.

To achieve this we should both think about fitting formal definitions AND ensuting flawless interactions with existing tools in the ecosystem

Describe your proposed solution

Define the main methods in the MDLOptimizer interface
A dedicated page in the documentation (name "Why using the Minimum Description Length") may come in the future

Describe alternatives you've considered, if relevant

If not we let users define their own methodology. But we miss the opportunity to formalize an amazing principle that may benefit the entire community

Additional context

A good place to start is sklearn-expertsys, see this file especially
Reimplementing this file with our methodology and discussing the pros/cons should be a good exercise

MDLP discretizer -- first implem

Describe the workflow you want to enable

MLDP is a method for automatic binning of numerical attributes based on the MDL principle

I found a really nice implem here https://github.com/hlin117/mdlp-discretization

Is this out of the scope of skmine ?

@alexandre-termier @PeggyCellier

init preprocessing/ module with an ItemsetEncoder

Describe the workflow you want to enable

In pattern mining it is well established that items are represented as non-negative integers (ids). FIMI contain only integer entries.
Therefore we should implement an encoder to get from any raw data to an integer-based representation of items, getting closer from the kind of input one can find in the literature

Describe your proposed solution

Instantiate a preprocessing/ module with a simple class, namely ItemsetEncoder, to provide a simple 2-way mapping (something like a bi-dict) to encode any hashable items into integers, and the other way around

Describe alternatives you've considered, if relevant

It is not straightforward though that this brings a performance improvement for the mining method applied downstream. From my first benchmark it is true if the mining method makes a lot of inter-items comparisons (eg. our LCM implementation), but may not hold true for every method

Additional context

In terms of data-engineering : Should we make this integer-based representation mandatory and forbid any other input from our method ? This will come at the cost of harder data ingestion, and may not guarantee better algorithms ...

LCM paper quote

Describe the issue linked to the documentation

Quote LCM. 2 instead of HLCM
Update doc in regard

make repo fully sklearn compatible

start from there https://github.com/scikit-learn-contrib/project-template

wrong min_supp when relative value passed

Describe the bug
A clear and concise description of what the bug is.
current LCM does not compute the min_supp properly when a value between 0 and 1 is passed at instantiation time

##Solution
replace *= by = at the end of the fit method

LCMMax

For emerging patterns we need a class that mines maximal itemsets

Uno & al. gives a description in LCM2

A simplest way might be to adapt our existing LCM implementation
With little change in the code we can ouput maximal patterns : we need to only output itemsets when there are on a leaf in the search space

consitstent skmine.datasets.make_transactions

Describe the workflow you want to enable

make_transactions generates a datasets with know properties, containing random items, with respect to this propoerties
The ideal situation is being able to generate the EXACT same dataset for the same properties, for a given random seed

from skmine.datasets import make_transactions
from skmine.datasets.utils import describe_itemsets
D1 = make_transactions(100, 20, 20)
D2 = make_transactions(100, 20, 20)
assert decribe_itemsets(D1) == describe_itemsets(D2)
# raises an error

Describe your proposed solution

add as seedargument to make_transactions (default to None), call it everytime before generating
transactions

Describe alternatives you've considered, if relevant

First make sure this is really required. Standard datasets may be enough if we want to test on consistent data

Additional context

Careful with list comprehensions, np.seed() should be called before every random generation, meaning either we make the entire dataset at once with a fixed transaction size, or we find a workaround to make arbitrary-size transactions look realistic

LCM fit_transform as a form of One Hot Encoding

I recently added a "fit_discover" method to LCM, wich renders patterns in a pretty format
This format is made for pattern miners

On the other, people from the Machine Learning community like to deal with tabular formats
This is currently addressed via fit_transform
The current implementation just calls fit_discover, and then One-Hot-Encode (or dummify), the resulting patterns.

But this has several drawback

different parameters will lead to different dimensions in the resulting matrix. As each pattern is a dimension, if you mine more pattern, you end up having more dimensions
this results in big memory consumption, because the matrix is big

On chess we have 3196 single items, but with a min_supp of 2000, we get more than 60K patterns.
With the current implementation this leads to a DataFrame with more than 60K columns ...

Proposed solution
People with ML background already know about sklearn.preprocessing.OneHotEncoder.
It may be a good solution to make LCM.fit_transform a sort of more evolved OneHotEncoder

How do we do this
Instead of setting every itemset as a column. We use items instead. This makes dimensions consistent between different runs. For every pattern, we fill the corresponding cells to the lengths of the pattern.

Example

   >>> from skmine.itemsets import LCM
   >>> D = [[1, 2, 3, 4, 5, 6], [2, 3, 5], [2, 5]]
    >>> lcm = LCM(min_supp=2)
    >>> lcm.fit_transform(D, sort=True)   # current solution
       (2, 5)  (2, 3, 5)
    0       1          1
    1       1          1
    2       1          0
    >>> lcm.fit_transform(D, sort=True)  # proposed solution
         2     3     5
    0    2     2     2
    1    2     2     2
    2    3     0     3

I am not sure about encoding pattern lengths in the result
But setting single items as columns is definitely a functional requirement

@PeggyCellier @alexandre-termier what do you think ?

"interactions with sklearn" page in the doc

Binary matrix as input

Workflow

For many people in the machine learning community representing transactional datasets as we do in skmine is not something usual
Moreover, a common workflow we will encounter at some point consists in using the output of our Transformers as input to sklearn.

In pattern mining researchers are already familiar with matrix representations of transactional databases. Quoting Vreeken and Al. from SLIM, Section 2.1

Note that any binary or categorical dataset can be trivially converted into a transaction database

Proposed solution

To bridge this gap we should show some example working with a transactional dataset
We should be able to

transform a standard dataset (eg. chess) into a binary matrix
inverse transforming should not be very useful, considering the use cases

The most straight-forward solution seems to be using the sklearn.preprocessing.MultilabelBinarizer, in this way

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> from skmine.datasets.fimi import fetch_chess
>>> D = fetch_chess()
>>> mlb = MultiLabelBinarizer()
>>> X = mlb.fit_transform(D)
>>> X
array([[1, 0, 1, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 1, 1],
       [0, 0, 1, ..., 0, 1, 1]])

Describe alternatives you've considered, if relevant

If scikit-learn does not fit out purpose we can still implement our own transformer.
But the preferred solution would be to make a PR to scikit-learn

Additional context

Note that the MutliLabelBinarizer is only suitable to mere itemsets, it does not work out of the box on eg. sequential itemsets.
Even if it can be twicked for this purpose, eg:

>>> s = pd.Series([[2, 3, 2], [0, 2]])  # 2 is present at two different positions in the first transaction
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform(s.map(enumerate))  # note that inverse_transform will fail to reconstruct the original input
array([[0, 1, 0, 1, 1, 1],
       [1, 0, 1, 0, 0, 0]])
>>> mbl.classes_
array([(0, 0), (0, 2), (1, 2), (1, 3), (2, 4), (3, 2)], dtype=object)

"data ingestion policy" page in the doc

labeled data generator for multiclass classification problems

Describe the workflow you want to enable

Just like sklearn's make_classification method, we should be able to generate LABELED synthetic data to train on

Describe your proposed solution

Vreeken and Al. provide a script in DiffNorm's implementation, here, in scripts/gen.py

Adapt this script and test a multiclass methodology on the generated dat

scikit-mine / scikit-mine Goto Github PK

scikit-mine's Introduction

Resources

Quickstart

Dependencies

Development

Internal Contributors

scikit-mine's People

Contributors

Stargazers

Watchers

Forkers

scikit-mine's Issues

Describe the workflow you want to enable

Describe your proposed solution

Additional context

Describe the workflow you want to enable

Describe the issue linked to the documentation

Context

Candidates

Describe the issue linked to the documentation

Suggest a potential alternative/fix

Describe the issue linked to the documentation

Describe the workflow you want to enable

Describe your proposed solution

Additional context

Workflow

Proposed solution

Additional context

Describe the workflow you want to enable

Describe your proposed solution

Workflow

Proposed solution

Additional context

Workflow

Describe your proposed solution

Describe the issue linked to the documentation

Suggest a potential alternative/fix

Describe the workflow you want to enable

Additional context

Describe the workflow you want to enable

Describe your proposed solution

Additional context

Describe the issue linked to the documentation

Suggest a potential alternative/fix

Describe your proposed solution

Describe your proposed solution

Describe the workflow you want to enable

Describe your proposed solution

Additional context

Describe the workflow you want to enable

Describe your proposed solution

Additional context

Describe the workflow you want to enable

Describe your proposed solution

Describe the workflow you want to enable

Describe your proposed solution

Additional context

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Describe the workflow you want to enable

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Describe the issue linked to the documentation

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Workflow

Proposed solution

Describe alternatives you've considered, if relevant

Additional context

Describe the workflow you want to enable

Describe your proposed solution

Recommend Projects

Recommend Topics