Comments (2)
UPDATE
Considering only basic transactional datasets, MultiLabelBinarizer gives access to a .classes_
attributes. The problem is that it is an attribute of the object, it does not transit within data --> sklearn MultiLabelBinarizer produces np.array or scipy.sparse matrics, but they are not labels
In the context of Pattern mining we absolutely need the labels : theses are our beloved symbols$
>>> from skmine.preprocessing import MulitLabelBinarizer
>>> mb = MultiLabelBinarizer(sparse_output=True)
>>> D = pd.Series([ # SLIM takes a pd.Series as input
>>> ['bananas', 'milk'],
>>> ['milk', 'bananas', 'cookies'],
>>> ['cookies', 'butter', 'tea'],
>>> ['tea'],
>>> ['milk', 'bananas', 'tea'],
>>> ])
>>> tab_D = mb.fit_transform(D) # scipy sparse matrix
>>> tab_D[:2].todense()
matrix([[1, 0, 0, 1, 0],
[1, 0, 1, 1, 0]])
>>> mb.classes_
array(['bananas', 'butter', 'cookies', 'milk', 'tea'], dtype=object)
Note that pandas can build a DataFrame from sparse matrices, as introduced in version 0.25
>>> tab_D = pd.DataFrame.sparse.from_spmatrix(tab_D, columns=mb.classes_)
>>> tab_D
bananas butter cookies milk tea
0 1 0 0 1 0
1 1 0 1 1 0
2 0 1 1 0 1
3 0 0 0 0 1
4 1 0 0 1 1
>>> tab_D.dtypes
bananas Sparse[int64, 0]
butter Sparse[int64, 0]
cookies Sparse[int64, 0]
milk Sparse[int64, 0]
tea Sparse[int64, 0]
dtype: object
My solution:
- reimplement skearn MultiLabelBinarizer to make its
transform
function return a pandas.DataFrame with some explicit columns (our symbols) - propose this in sklearn as a feature request. If accepted, we will use it as implemented in sklearn to make sure we don't reimplement things twice
from scikit-mine.
see
e4b467b
from scikit-mine.
Related Issues (20)
- avoid data copies in PeriodicCycleMiner
- SLIM for high dimensional data HOT 1
- Make scikit-mine profile friendly HOT 2
- KRIMP Imputation
- MDLP Discretizer v2 HOT 1
- notebook for PeriodicCycleMiner
- fetch_instancart is broken HOT 1
- [perf] skmine.periodic.cycles.extract_triples is really slow for n_points > 200
- Question about parameter k of SLIM HOT 2
- Return type of SLIM HOT 4
- Don't understand return values of decision_function of SLIM HOT 3
- apriori and CBA HOT 1
- MDLPDiscretizer: cut_points_ is not sorted HOT 2
- inherit sklearn BaseEstimator HOT 2
- MDLPDiscretizer: cut_points_ sometimes contains ambiguous values HOT 1
- max_time follows an anti-pattern in SLIM HOT 1
- preprocessing of transactionnal database affect code length HOT 2
- OneVsOneClassifier for SLIM doesnt't work
- environment for doc generation
- CoverTransformer HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scikit-mine.