Giter Site home page Giter Site logo

scikit-mine / scikit-mine Goto Github PK

View Code? Open in Web Editor NEW
71.0 5.0 10.0 33.98 MB

scikit-mine : pattern mining in Python

Home Page: https://scikit-mine.github.io/scikit-mine/

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.11% Python 97.39% Jupyter Notebook 2.50%
scikit datamining minimum-description-length pattern-mining scikit-learn

scikit-mine's Introduction

image

image

image

image

image

Scikit-mine : pattern mining in Python

Resources

Quickstart

scikit-mine is a Python module for pattern mining built on top of Pandas/Numpy/SciPy and is distributed under the 3-Clause BSD license.

It is currently maintained by a team of volunteers.

See examples in the tutorials; the notebooks are available here. To execute the tutorials, you will have to install jupyter notebook or jupyterlab (https://jupyter.org/install)

Dependencies

scikit-mine requires Python>=3.8, and some extra dependencies

  • scipy>=1.2.1
  • pandas>=1.0.0
  • pyroaring>=0.3.4
  • joblib>=0.11.1
  • sortedcontainers>=2.1.0
  • dataclasses>=0.6
  • networkx
  • wget>=3.2
  • scikit-learn
  • graphviz
  • matplotlib
  • pydot

Development

This project benefitted from fundings from the INRIA center in Rennes, Brittany, France, as well as from the CNRS PNRIA Programme.

We welcome new contributors of all experience levels.

Internal Contributors

scikit-mine's People

Contributors

alexandre-termier avatar ariaque avatar cyril-data avatar fbariatti avatar hermann74 avatar lgalarra avatar remiadon avatar thomasbtnfr avatar trellixvulnteam avatar vpodpecan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

scikit-mine's Issues

Emerging patterns -- inclusion or not ?

Describe the workflow you want to enable

Some internal discussions have raised attention for Emerging Patterns
On the long term we have to provide not only models but also methodologies to perform classification, so emerging patterns might be a solution

Describe your proposed solution

a new module called emerging providing a set of tools to deal with emerging pattern, jumping emerging patterns, and so on.

Additional context

This paper is a good place to start

Note that section 6.3 (Pairwise classification) looks very much like sklearn OneVsOneClassifier methodology

HeatMap visualization of SLIM learning

Describe the workflow you want to enable

Would be nice if we can visualize SLIM learning data representation stepwise

Some idea to do this:

  • create an async watcher, making deep copies of an object attributes at predefined steps
  • create a notebook showing data import + SLIM instantiation + watcher instantiation
  • make watcher send data to a matplotlib FuncAnimation or streamz DataFrame
  • plot the resulting animation

This way we could get a sexy representation of what MDL pattern mining algorithms are doing

Bitmap implementation

Context

For many mining tasks, eg. tracking the transaction ids for every single items, we need efficient representations of integers sets. For this we will use Roaring Bitmaps

Candidates

In python we have access to a few implementations, mainly

To me the winner is roaringbitmap, for:

  • it's written in pure cython, with cython bein a langage we want to investigate in, and widely use in this project, so this make a "unified" code base between our code and this external library
  • it is well implemented, well documented and well tested
  • it includes a few more methods than other libraires, an example of this being the intersection_len method

The only main issue I see is it's not conda installable yet, but we may work on it later.

init documentation

Describe the issue linked to the documentation

For now we only have numpy-style docstrings

Suggest a potential alternative/fix

Setup doc generation via Sphinx

LCM.fit_transform does not follow sklearn philosophy

Describe the workflow you want to enable

For now skmine.itemsets.LCM implements a fit_transform methods, returning a DataFrame
with two columns, namely itemsets and supports
To be fully compatible with sklearn this is not sufficient

Describe your proposed solution

move what the current fit_transform function is doing to a new function ("discover ?"), so we can still get a pretty dataframe with itemsets and supports,
and change the fit_transform method, so that:

  • it returns a binary pandas.DataFrame, with itemsets as column
  • the width of this DataFrame is constant and pre-defined (pass n_features as arg ?)

Additional context

Ideally we should be able to use our pattern mining algorithm to extract feature from transactional
datasets.
see this repo

benchmark methodology, first try with LCM

Workflow

Telling what modification in the code is responsible for what improvement on an implementation is not something easy. Telling some code is running faster simply by looking at it is a really bad habit
We need a dedicated methodology for this

Proposed solution

pandas developpers already use a benchmarking tool called asv to do the benchmarks. As we heavily rely on pandas this seems a good way to go.

Here is a proposed process to apply when adding a new mining method:

  • implement a V0 fitting with respect to a previously discussed functional definition
  • generate datasets with different properties, and add asv benchmarks for the new method
  • modify the code and evaluate benefits from the benchmarking interface -> iterate over and over

This is also prone to improving existing solutions, as we can make a new PR explicitly mentioning an improvement on a specific indicator, on a specific model 💯

Additional context

Note that benchmark are not necessarily dedicated to run times. In SLIM for example we might want to improve encoded data size, which states "what level of compression" we achieved.

mock external librairies

Describe the workflow you want to enable

For now external librairies are not mocked in unittest

roaringbitmap and maybe even joblib should be

Describe your proposed solution

create a .external file importing those libs
create a mock class and monkeypatch test functions

For roaringbitmap is may look like

class MockBitmap(set):
    def intersection_len(self, other):
        return len(self & other)

#@pytest.fixture(scope='session', autouse=True)  ????
def mock_bitmap(monkeypatch):
    import roaringbitmap
    monkeypatch.setattr(roaringbitmap, 'RoaringBitmap', BitmapMock)
    return monkeypatch

[UPI] add fetch_instacart_* methods

Workflow

Instacart is a popular dataset in pattern mining, it has been used by many researchers and remains interesting for benchmark purposes

Proposed solution

add fetch_instacart methods, to download and return a dataset made of instacart orders
add some documentation showing execution of LCM on this dataset

Additional context

Clément Gautrais has already provided an implementation for a paper named Widening for MDL-based Retail Signature Discovery during the 2020 Intelligent Data Analysis (IDA) Symposium

The code lies here

SLIM V1

Workflow

Implement the SLIM

Describe your proposed solution

A dedicated class

add inclusion criteria for new algorithms

Describe the issue linked to the documentation

It is not obvious for now what algorithms should be included in scikit-mine.
My first attempt defining this would be to say we hope for MDL-based algorithms to be included, but that's certainly not sufficient.

Suggest a potential alternative/fix

We should add a dedicated paragraph in the doc, like the one in scikit-learn doc

Also we should reference this new paragraph in our github issue templates, especially new_feature.md

Faster candidate generation for SLIM

Describe the workflow you want to enable

I tried SLIM on the instacart dataset (see skmine.datasets.fetch_instacart)
It started the candidate generation process (first step) and never ended ...

It is "normal" in the sense that instacart contains almost 50K products (symbols), then the search space is fairly large.

Describe your proposed solution

First thoughts:

  • profile everything FIRST
  • try to "vectorize" calls to RoaringBitmap.intersection_len : this is where the experiment got stuck, which makes sense.
>>> from numpy import vectorize
>>> from roaringbitmap import RoaringBitmap
>>> a = np.array([RoaringBitmap([2, 3]), RoaringBitmap([3, 2, 5])])
>>> c = RoaringBitmap([5])
>>> f = vectorize(RoaringBitmap.intersection_len, otypes=[np.uint32], excluded={0})
>>> f(c, a)
array([0, 1], dtype=uint32) 

Note that this solution may not be faster, as vectorize actually performs a for loop ...

Additional context

One question we could ask, at least in the context of exploratory data science is:
Do we really want to summarize transactions drawn from 50K symbols ?
Even with the best compression ratio the result is unlikely to be usable by any human ...

better description of transactions

Describe the issue linked to the documentation

As discussed wit @PeggyCellier and @alexandre-termier
FIMI datasets descriptions lack of extra informations on the nature of the data, and its potential use
In addition, some information about the density is missing
Finally, describe_itemsets seems not to be the best way to name the method in question
describe_transactional_dataset seems to be better, while my suggestion would be describe_transactions
@PeggyCellier what would you prefer ? I think describe_transactions is both short and explicit

Suggest a potential alternative/fix

  • add informations on usage in FIMI datasets docstrings
  • add a density field in fetch_* methods dosctrings
  • adapt description methods to yield information about density
  • choose a new name for describe_itemsets and modify the code

first performance improvements for SLIM

Describe your proposed solution

Two simple things to do:

  • right now SLIM make sure not to drop itemsets with zero usages when switching a codetable for a new, more concise one. This results in a final codetable with a lot of empty usage lists. Decide where to drop those zeros, and where to let them will lead to better run time and user experience
  • Like mentionned in the litterature, when considering a new candidate, reduce the database to itemsets containing all items in this itemset. This will make computation of covers faster

Ordered itemsets

Hi,

I have tested LCM's implementation successfully so far. That said, I wonder if it would be useful to have the items within itemsets naturally ordered. For example, given the following input

D1 = [[1, 2, 3], [1, 5, 2], [1, 9, 12, 2], [14, 15, 16], [14, 15, 16, 17]]

LCM outputs the itemset (16, 14, 15) as frequent, whereas I would have expected (14, 15, 16). I believe that having a consistent order for the output itemsets may make other tasks, such as testing, -- or discriminative pattern mining as I am doing right now -- easier.

Describe your proposed solution

I have had not a look at the code that emits itemsets, but it seems that it could be a post-processing step after closure calculation and support thresholding. On the other hand, a natural ordered may not be defined for itemsets of arbitrary data types. In such cases, the notion of order could be passed as an argument (e.g., a function) to the LCM object.

Thanks for your attention,
Luis

SLIM : adapt early stopping to density

In SLIMMER is introduced early stopping in SLIM

A number of bits is defined, and when compression does not better than this threshold, we stop learning.

In practice runtimes may vary a lot with different datasets. A simple strategy is to set this number of bits considering the number of transactions in the data. Here is what we do

tol = len(D) / 10

A better strategy might be to set this considering the density. It's quite straightforward since we have all information in the standard codetable. This would look like

usages = self._standard_codetable.map(len)
density = usages.mean()
tol = density * len(D)

Artificial data generators

Describe the workflow you want to enable

In sklearn.datasets one can find data generators, eg make_classif

Describe your proposed solution

Implement a first generator in skmine.datasets for mere itemsets
It should return a pd.Series following the properties passed as arguments

Additional context

A describe_itemsets is already implemented in datasets.utils, and can be used to test our the description of a generated dataset matches the input arguments

Skyline Queries

Describe the workflow you want to enable

Skyline queries have been well studied and enable very usefull workflows.
For example if I want to find the optimal offers for a set of homes I would like to reserve for vacation, skyline queries allow me to keep (sort of cache) the set of homes that are on the skyline.

Describe your proposed solution

A SkylineClassifier class.
At .fit time, find the points on the skyline and keep them as an internal state.
This is possible because there are only a few points that verify the property of being on the skyline. A bit like MDL base methods, we get a concise representation.

A decision_function method would compute the distances from the skyline. If I pass new data to it, I should get a notion of "Is this product better than products from the skyline ?"

Additional context

Max Halford wrote a really nice post on it

MDLRuleList

Describe the workflow you want to enable

Interpretable multiclass classification by MDL-based rule lists,
by van Leeuwen & Al.

Implementation of this paper is available here

It is claimed to provide efficient, parameter free multiclass classification

Describe your proposed solution

The implementation relies on pyFIM, replacing it by our LCM implementation (or SLIM ?) should do the trick ? If not possible, I have an implementation of APriori somewhere

@PeggyCellier @alexandre-termier any comment ?

V0 - LCM + FIMI datasets

Implement the first base blocks, along with unit tests
This includes :

  • first implementation of LCM
  • first fetching methods for fimi datasets
  • a basic BaseMiner interface
  • numpy style doc
  • units tests
  • 100% coverage

online discovery in LCM

Describe the workflow you want to enable

LCM mines a lot of patterns. In some situation users may want to use LCM to mine a fairly large
amount of patterns and post-filter them according to a criterion that is not based on the MDL principle

In addition online discovery of patterns is a good-to-have.

Our implementation could be compatible with this kind of workflow.
We have to make sure we don't break any compatibility rules with scikit-learn, we only extend it

Describe your proposed solution

implement "discover_yield()" method in LCM. This methods returns a generator containing pandas.DataFrame, i.e chunks of discovered patterns

PROS : very flexible
CONS : in this setting one would not benefit from parallelism. A dataframe is yielded everytime we discover itemsets from an original item (a root node)

Additional context

@lgalarra mentioned this in a private thread

Generic MDL Interface

Describe the workflow you want to enable

A strong ambition of scikit-mine is to formally define MDL optimization, and make easier to understand and use this methodology in other methods.

To achieve this we should both think about fitting formal definitions AND ensuting flawless interactions with existing tools in the ecosystem

Describe your proposed solution

Define the main methods in the MDLOptimizer interface
A dedicated page in the documentation (name "Why using the Minimum Description Length") may come in the future

Describe alternatives you've considered, if relevant

If not we let users define their own methodology. But we miss the opportunity to formalize an amazing principle that may benefit the entire community

Additional context

A good place to start is sklearn-expertsys, see this file especially
Reimplementing this file with our methodology and discussing the pros/cons should be a good exercise

init preprocessing/ module with an ItemsetEncoder

Describe the workflow you want to enable

In pattern mining it is well established that items are represented as non-negative integers (ids). FIMI contain only integer entries.
Therefore we should implement an encoder to get from any raw data to an integer-based representation of items, getting closer from the kind of input one can find in the literature

Describe your proposed solution

Instantiate a preprocessing/ module with a simple class, namely ItemsetEncoder, to provide a simple 2-way mapping (something like a bi-dict) to encode any hashable items into integers, and the other way around

Describe alternatives you've considered, if relevant

It is not straightforward though that this brings a performance improvement for the mining method applied downstream. From my first benchmark it is true if the mining method makes a lot of inter-items comparisons (eg. our LCM implementation), but may not hold true for every method

Additional context

In terms of data-engineering : Should we make this integer-based representation mandatory and forbid any other input from our method ? This will come at the cost of harder data ingestion, and may not guarantee better algorithms ...

LCM paper quote

Describe the issue linked to the documentation

Quote LCM. 2 instead of HLCM
Update doc in regard

wrong min_supp when relative value passed

Describe the bug
A clear and concise description of what the bug is.
current LCM does not compute the min_supp properly when a value between 0 and 1 is passed at instantiation time

##Solution
replace *= by = at the end of the fit method

LCMMax

For emerging patterns we need a class that mines maximal itemsets

Uno & al. gives a description in LCM2

A simplest way might be to adapt our existing LCM implementation
With little change in the code we can ouput maximal patterns : we need to only output itemsets when there are on a leaf in the search space

consitstent skmine.datasets.make_transactions

Describe the workflow you want to enable

make_transactions generates a datasets with know properties, containing random items, with respect to this propoerties
The ideal situation is being able to generate the EXACT same dataset for the same properties, for a given random seed

from skmine.datasets import make_transactions
from skmine.datasets.utils import describe_itemsets
D1 = make_transactions(100, 20, 20)
D2 = make_transactions(100, 20, 20)
assert decribe_itemsets(D1) == describe_itemsets(D2)
# raises an error

Describe your proposed solution

add as seedargument to make_transactions (default to None), call it everytime before generating
transactions

Describe alternatives you've considered, if relevant

First make sure this is really required. Standard datasets may be enough if we want to test on consistent data

Additional context

Careful with list comprehensions, np.seed() should be called before every random generation, meaning either we make the entire dataset at once with a fixed transaction size, or we find a workaround to make arbitrary-size transactions look realistic

LCM fit_transform as a form of One Hot Encoding

I recently added a "fit_discover" method to LCM, wich renders patterns in a pretty format
This format is made for pattern miners

On the other, people from the Machine Learning community like to deal with tabular formats
This is currently addressed via fit_transform
The current implementation just calls fit_discover, and then One-Hot-Encode (or dummify), the resulting patterns.

But this has several drawback

  • different parameters will lead to different dimensions in the resulting matrix. As each pattern is a dimension, if you mine more pattern, you end up having more dimensions
  • this results in big memory consumption, because the matrix is big

On chess we have 3196 single items, but with a min_supp of 2000, we get more than 60K patterns.
With the current implementation this leads to a DataFrame with more than 60K columns ...

Proposed solution
People with ML background already know about sklearn.preprocessing.OneHotEncoder.
It may be a good solution to make LCM.fit_transform a sort of more evolved OneHotEncoder

How do we do this
Instead of setting every itemset as a column. We use items instead. This makes dimensions consistent between different runs. For every pattern, we fill the corresponding cells to the lengths of the pattern.

Example

   >>> from skmine.itemsets import LCM
   >>> D = [[1, 2, 3, 4, 5, 6], [2, 3, 5], [2, 5]]
    >>> lcm = LCM(min_supp=2)
    >>> lcm.fit_transform(D, sort=True)   # current solution
       (2, 5)  (2, 3, 5)
    0       1          1
    1       1          1
    2       1          0
    >>> lcm.fit_transform(D, sort=True)  # proposed solution
         2     3     5
    0    2     2     2
    1    2     2     2
    2    3     0     3

I am not sure about encoding pattern lengths in the result
But setting single items as columns is definitely a functional requirement

@PeggyCellier @alexandre-termier what do you think ?

Binary matrix as input

Workflow

For many people in the machine learning community representing transactional datasets as we do in skmine is not something usual
Moreover, a common workflow we will encounter at some point consists in using the output of our Transformers as input to sklearn.

In pattern mining researchers are already familiar with matrix representations of transactional databases. Quoting Vreeken and Al. from SLIM, Section 2.1

Note that any binary or categorical dataset can be trivially converted into a transaction database

Proposed solution

To bridge this gap we should show some example working with a transactional dataset
We should be able to

  • transform a standard dataset (eg. chess) into a binary matrix
  • inverse transforming should not be very useful, considering the use cases

The most straight-forward solution seems to be using the sklearn.preprocessing.MultilabelBinarizer, in this way

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> from skmine.datasets.fimi import fetch_chess
>>> D = fetch_chess()
>>> mlb = MultiLabelBinarizer()
>>> X = mlb.fit_transform(D)
>>> X
array([[1, 0, 1, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 1, 1],
       [0, 0, 1, ..., 0, 1, 1]])

Describe alternatives you've considered, if relevant

If scikit-learn does not fit out purpose we can still implement our own transformer.
But the preferred solution would be to make a PR to scikit-learn

Additional context

Note that the MutliLabelBinarizer is only suitable to mere itemsets, it does not work out of the box on eg. sequential itemsets.
Even if it can be twicked for this purpose, eg:

>>> s = pd.Series([[2, 3, 2], [0, 2]])  # 2 is present at two different positions in the first transaction
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform(s.map(enumerate))  # note that inverse_transform will fail to reconstruct the original input
array([[0, 1, 0, 1, 1, 1],
       [1, 0, 1, 0, 0, 0]])
>>> mbl.classes_
array([(0, 0), (0, 2), (1, 2), (1, 3), (2, 4), (3, 2)], dtype=object)

labeled data generator for multiclass classification problems

Describe the workflow you want to enable

Just like sklearn's make_classification method, we should be able to generate LABELED synthetic data to train on

Describe your proposed solution

Vreeken and Al. provide a script in DiffNorm's implementation, here, in scripts/gen.py

Adapt this script and test a multiclass methodology on the generated dat

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.