scikit-tda / kepler-mapper Goto Github PK

View Code? Open in Web Editor NEW

624.0 41.0 178.0 48.31 MB

Kepler Mapper: A flexible Python implementation of the Mapper algorithm.

Home Page: https://kepler-mapper.scikit-tda.org

License: MIT License

Python 71.41% HTML 9.46% JavaScript 15.39% CSS 1.78% TeX 1.96%

data-visualization mapper-algorithm topological-data-analysis tda python visualization kepler-mapper hacktoberfest

kepler-mapper's Issues

verbose

Can you please explain me what we mean by verbose
for example
mapper = km.KeplerMapper(verbose=2)
how to use it...I am confused...please guide...

Thanks

Often I've figured it may be useful to have a write/read utility for mapper networks produced by KeplerMapper. Is this something being developed? If not, how would you suggest one could be written? I suppose a basic interface with networkX could work, but maybe you reckon there is a better way?

Let me know your thoughts

Add Travis-CI to project.

By integrating CI, Travis will automatically run the test suite before any pull request.

I've added the .travis.yml settings file and tested everything so it works on my fork. From what I can tell, only the owner of the repo can integrate travis into the repo.

@MLWave, when you have a second, could you turn this on? It took me <3 minutes on my fork, just follow the first few steps here.

Once this is done, we can

Add the build passing badge to the README
Incorporate other tools to the repo:
- pep8speaks
Add codecov

visualize in a jupyter notebook?

Great SW! I'm trying to run the code in a jupyter notebook, but it doesn't work, i.e. I get the .html file saved but no display in the notebook.

from IPython.core.display import display, HTML
...
display(mapper.visualize(graph, path_html="make_circles_keplermapper_output.html", 
                 title="make_circles(n_samples=5000, noise=0.03, factor=0.3)"))

I also tried HTML(mapper.visualize...) with no result.

Any suggestion?

An alternative to the actual graph plot in Jupyter Notebook

Hello Kepler-Mapper devs and users,
This is not an opened issue, but a proposal for a possible future pull request.
Over the last weekend I worked on a new version of kmapper on my local repo, that generates an interactive Plotly plot in the Jupyter Notebook.
Here are a few experiments: http://nbviewer.jupyter.org/gist/empet/3ad5d13ad662eb18ae682f2a49035420

Please express your opinion on such a possibility of plotting the topological graph associated to a dataset.

Package km for pip install

On my master sauln/kepler-mapper@274e3ea I have added a setup.py file so kepler-mapper can be installed with python setup.py install. New import format would be

import KeplerMapper from kmapper or
import kmapper as km

What are your thoughts on this? Once we set the format, I can run through the docs and examples with an update.

How to show the 3D

I am totally new to the topological data analysis. I try to learn some. When I copy the code to the Python Notebook and replace with my own data, The code is ok. But no plot output. Do i need to install other software? Then python I use is Anaconda.

Issues using HDBSCAN

Hi there,

When using HDBSCAN (from hdbscan library, as suggested in Issue #68), the code terminates with :
ValueError: k must be less than or equal to the number of training points

This is with the default behaviour (i.e. using mapper.map(projected_X=projected_data, inverse_X=data, clusterer=hdbscan.HDBSCAN())

The same data and projection work fine with DBSCAN(), so I assume its an issue with the interaction with hdbscan. Maybe you know something about this?

Get KeplerMapper on Kaggle Kernels

The process: https://github.com/Kaggle/docker-python

Some examples already:
https://inclass.kaggle.com/triskelion/mapping-with-sum-row/notebook
https://www.kaggle.com/triskelion/testing-python-3/notebook
https://www.kaggle.com/triskelion/isomap-all-the-digits-2

If we get the new containers to build with kmapper, we can use KeplerMapper on all Kaggle's datasets and use their notebooks for replication/easy forks.

Error KeplerMapper Newsgroup20 Pipeline

Dear all,

Many thanks for Kepler Mapper. It is a very interesting project.

I have looked at the examples at the notebook folder. I have found a problem with the notebook KeplerMapper Newsgroup20 Pipeline.ipynb. I get the following error when I tried to run the line "projected_X = mapper.fit_transform(X, projection=[TfidfVectorizer(analyzer="char", ngram_range=(1,6), max_df=0.83, min_df=0.05), TruncatedSVD(n_components=100, random_state=1729), Isomap(n_components=2, n_jobs=-1)] :

IndexError Traceback (most recent call last)
in ()
10 Isomap(n_components=2,
11 n_jobs=-1)],
---> 12 scaler=[None, None, MinMaxScaler()])
13
14 print("SHAPE",projected_X.shape)

~/kepler-mapper/kmapper/kmapper.py in fit_transform(self, X, projection, scaler, distance_matrix)
219 if self.verbose > 0:
220 print("\n..Projecting data using: %s" % (str(projection)))
--> 221 X = X[:, np.array(projection)]
222
223 # Scaling

IndexError: arrays used as indices must be of integer (or boolean) type

I have tried different ways to solve the problem, but without any success. I wonder whether it would be possible for you to suggest me some possible solutions?

Many thanks for considering my request.

Best wishes

Incorporate all relevant information into "meta_graph"

All information about the constructed graph should be stored in graph["meta_graph"]. The visualize method should use attributes in this dictionary rather than found from self.

This would allow us to separate the construction of the visualization from the construction of the graph object alone. Additionally, the graph could be saved with all of its relevant information and graphed at a later time.

What is the best way to assimilate information from both fit_transform and map? Maybe fit_transform sets what it can and then map included this information also?

Statistical Analysis of Nodes in Graph

Hi, I am extremely sorry , If I am Asking some simple questions.

As I am Mathematician by profession and Not the programmer..

I would like to know

How to perform statistical analysis on any of the node in mapper graph?
Is there any way to grey out the unwanted nodes and display relevant nodes in the graph?
Can we search any important information regarding any variable (s) that are in the clusters as a node in the mapper graph?
is there any way to extract all the attributes associated to mapper graph

I would highly appreciate, If you respond ...

How to download the plot with white background?

Hi, thanks for providing the great software for TDA!

I am looking for a way to download the TDA plot with white background to use and cite it in a research paper, is there a way to extract it as a clean and high-resolution plot, other than the current option to download the html with black background?

Thanks!

A Gui version [Suggestion/Request]

Thank you for your hard work, you are appreciated.
It would be great if you considered releasing a GUI for the kepler-mapper, in-which allowing the user to:

Import data file (csv/ xlsx/ etc...)
Choose multiple criteria
Allow selecting multiple nodes via "lasso tool" or by color.
Allow export whole visualized data, or partial based on selection.

I would be glad to donate to such project to bring it to life
Waiting for your reply

Add support for new kinds of nerves: n-simplices and min-intersection.

Now that @michiexile abstracted a nerve class, we can include support for other nerves. Most obvious are a min-intersection nerve and an n-simplices nerve. Though I believe we will also need a custom nerve for multi-mapper and we could experiment with a multi-nerve.

min-intersection
n-simplices

min-intersection

It would be nice to set a minimum number of points each cluster has to intersect on to be considered connected.

edges = nerve({"a": [1,2,3], "b": [2,3,4], "c": [4,5,6]}, min_intersection=2)
-> ["a", "b"] in edges and ["b", "c"] not in edges

n-simplices

It would be nice to have a general n-simplex nerve that constructs simplices of order n or less.

Before building this, is there an established format for simplexes? Are there any libraries that we could use?
Most promising simplicial complex libraries found in the wild:

I'd prefer not to reinvent the wheel but I think a strong python simplicial complex library could be useful to the community.

dict_to_json bug

There's a bug in dict_to_json that prevents linking from the json data back to node data.
The json for a given node is defined as "name": "node_id" when it should actually be "name": node_id. I've made a quick fix.

Inverse_X issue and cluster select.

Hello All,

Really like this project, better than all the other mapper tools on python right now imo.

However, I am consistently having a problem where whenever I include inverse_X in the mapping, the mapping returns 0 nodes and 0 edges. I have tried this with a number of different data sets without success. I have done this on data sets with 300D and 200,000 rows and 200D, 100D, and all your sample sets with the same result. Its possible I am just ignorant, but I believe that all of these data sets should work with Inverse_X given it is just projecting along the original data. There are no errors produced for this problem. If inverse_X is not included, the mapper works as expected.

On another note, I am going to start working on a right click selection tool for displaying contents of nodes as i need this for my research. No idea how successful i'll be. Ill let you know.

Display distribution of each node

The distribution plot on the right pane shows the distribution for all data points. It would be nice to show the distribution of points in each node.

One option for this could be to show a pie chart at each node, or during hover, change the histogram in the right hand pane.

Python, HTML & Javascript separation

I'd like to make some tweaks to the output web page, but as it stands that's slightly awkward to do; the HTML is contained within the Python as a string and so is the JS. It's particularly a problem as I'm doing these changes for someone with less programming experience who might have a hard time resolving any conflicts that could arise trying to keep a fork up-to-date.

Would splitting them out into separate files that are then read in by the Python and used to produce the finished HTML be a problem? Or potentially just setting them up as Jinja2 templates and rendering them that way? It would make the HTML/Javascript a bit neater by replacing all the %s markers with actual variable names. I could try either, but would prefer not to spend time on it if it's incompatible with the project design somehow!

User Defined Cover

Hi there

I can see that there is a TODO to implement a cover defining API. I was wondering what is the best way of creating a user-defined cover at the moment (if it is possible at all). From what I can tell, we are currently restricted to a (n_bins, overlap_perc) method. Is it possible to define a cover explicity (for one or more dimensions in the lens), using cutoff values or similar (like, setting the maximum and minimum values of the covering space in each dimension)? I ask because in its current implementation I think the [non-]presence of an outlier can skew the covering space quite drastically

Let me know what my options are for the covering space. I would also be interested to know the status of the above TODO. More information as to how the cover class currently works might also be useful if I was going to write my own.

Thanks!

Edit: I've modified the code such that you can pass kmapper.map a CoverBounds variable.
if CoverBounds == None: Normal behavior
However, CoverBounds can also be a (ndim_lens, 2) array, with min, max for every dimension of your lens. If the default behavior is fine for a particular dimension, pass it np.float('inf'), np.float('inf').

For example, if I have a lens in R2 and want to set the maximum and minimum of the second dimension to be 0 and 1, I can pass:
mapper.map(CoverBounds = np.array([[np.float('inf'), np.float('inf')],[0,1]])) and that should have the desired behavior.

Edit 2: Might change it so rather than inf detection, works off None detection in the CoverBounds array.

I think a system designed like this should produce exactly the same cover, independent of input data limits.

Devs - let me hear your thoughts on this - I can clean up and submit a pull request.

Minimum samples to cluster & cluster count linked

There's a slightly odd interaction with minimum cluster size and cells with few entries. In kmapper.py:372, cells are only checked for clustering if there are >= min_cluster_samples samples within them. But min_cluster_samples is set to n_clusters.

So if you set n_clusters to 3, then any cell with 3 samples in will produce 3 separate 1-sample clusters in the output. Any cell with 2 samples will produce 0 clusters (and thus likely a different unique sample count in the output). This probably has little to no impact on the graph and is unlikely to show up except in small trial datasets, but it is a bit confusing that the parameter is reused for a different (if related) purpose.

Breast cancer example

In Breast cancer example you are using the following code lines

We create a custom 1-D lens with Isolation Forest
model = ensemble.IsolationForest(random_state=1729)
model.fit(X)
lens1 = model.decision_function(X).reshape((X.shape[0], 1))

My question is .. What does this lense1 is doing i mean what kind of projection do we have ? are we projecting our data on predictions of isolation forest...I am not sure please help????

Linking issues with JS/CSS on dev

The switch to using links to the JS/CSS rather than folding it into the file directly has unfortunately introduced some portability & sustainability issues.

It makes the saved HTML files dependent on the install directory of the module, and any environment used whilst running it- making redistributing them difficult. In addition, the HTML version is static whilst the JS/CSS is linked, potentially leading to old HTML files becoming unusable if the JS/CSS they depend on changes to require tags or elements not present in the old HTML.

[Discussion] Future Research & Use Cases

Research directions

Multiscale mapper.
- persistent homology of mapper - include visualization of barcode
- choose the most stable mapper based on barcode
Implement parameter tuning.
Simplicial Complex Feature extraction.
Embeddings, Word2Vec or otherwise.
Self-Guessing / Self-Dissimilarity.
Random Topological Forests.

Business Cases

Transactional Fraud & AML. PII data.
Acquisition fraud. PII data.
Access control. Business sensitive data.
Firewall monitoring. Business sensitive data.
Credit risk & Reject Inference. PII data.
Customer management (churn, high spenders). PII data.

Some Important queries

I have following queries ?
1. How do i use kepler mapper with categorical data.
2. How do i use distance function described in kmapper.py, I mean when i ran mapper in R
I apply mapper on distance matrix. What is idea of distance in kmapper?
3. How do i color nodes of graphs as per my Rule..for example If i want to color nodes in my mapper output graph according to number of points in each node as per color scale?

I would highly appreciate if you help me out in thses queries??
Thanks

Prediction for the missing segment of the blue sea star image

I really enjoyed reading this notebook: https://github.com/MLWave/kepler-mapper/tree/master/notebooks/self-guessing#references

I'm curious though: what do the predictions from a strong self-guesser for the "blue sea star" image at the start of the notebook look like?

Image Tooltips

Hi Devs

In the confidence_graphs notebook, image tooltips are used in the visualisation. This is something which we feel could be very useful in our work here.

However, when we create our image tooltips in the same way, and follow the procedure outlined in the notebook, we find that rather than get a nice array of images under MEMBERS in the output html file, all of our images overlay (essentially, the html tag doesn't seem to be processed correctly).

I assume this is down to a change in the visualization since the notebook was written. I'm not sure this is a browser issue (using Chrome). Are you aware of this, and if so if it is possible to create image tooltips?

Lee

Wrote my own color function not working..I dont know how much I am right

Hi @ dev

I am sure you will solve this issue..
I am sendng you relevant part of code here

graph = mapper.map(projected_data,X,nr_cubes=14,overlap_perc=0.8,clusterer=sklearn.cluster.DBSCAN(eps=15, min_samples=4))


model = ensemble.IsolationForest(random_state=1729)
model.fit(X)
usecolor = model.decision_function(X).reshape((X.shape[0], 1))

node=graph['nodes'].values()
cluster= list(node)
# My own Function..I am trying to color each node by avg anomaly score of the data points in that node
z=[]
for i in range(0,len(cluster)):   
    z.append(np.round(np.mean(usecolor[cluster[i]])*10,decimals=4,out=None))  
print(z)
s=np.asarray(z)
s

I am getting following output

Mapping on data shaped (150, 4) using lens shaped (150, 2)

Creating 196 hypercubes.

Created 57 edges and 24 nodes in 0:00:00.054112.
[0.6021, 0.4893, 0.2385, -0.2715, 0.7183, 0.7395, 0.6809, 0.4972, 0.333, 0.6087, 0.6839, 0.5749, 0.3008, 0.4609, 0.3176, 0.2205, 0.1462, -0.05, 0.4609, 0.4583, 0.0631, 0.559, 0.5166, 0.1199]

array([ 0.6021,  0.4893,  0.2385, -0.2715,  0.7183,  0.7395,  0.6809,
        0.4972,  0.333 ,  0.6087,  0.6839,  0.5749,  0.3008,  0.4609,
        0.3176,  0.2205,  0.1462, -0.05  ,  0.4609,  0.4583,  0.0631,
        0.559 ,  0.5166,  0.1199])

Now I go to next part of code

# Visualize it

mapper.visualize(simplicial_complex, path_html="/home/dhananjay/kepler-mapper/keplermapper-iris.html", custom_meta={"Data:": "Me"}, custom_tooltips=Y,color_function=z)

But I am getting error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-14-0692ad6431ec> in <module>()
      6         0.4972,  0.333 ,  0.6087,  0.6839,  0.5749,  0.3008,  0.4609,
      7         0.3176,  0.2205,  0.1462, -0.05  ,  0.4609,  0.4583,  0.0631,
----> 8         0.559 ,  0.5166,  0.1199]))

~/kepler-mapper/kmapper/kmapper.py in visualize(self, graph, color_function, custom_tooltips, custom_meta, path_html, title, save_file, X, X_names, lens, lens_names, show_tooltips)
    507         mapper_data = format_mapper_data(graph, color_function, X,
    508                                          X_names, lens,
--> 509                                          lens_names, custom_tooltips, env)
    510 
    511         histogram = graph_data_distribution(graph, color_function)

~/kepler-mapper/kmapper/visuals.py in format_mapper_data(graph, color_function, X, X_names, lens, lens_names, custom_tooltips, env)
     58     for i, (node_id, member_ids) in enumerate(graph["nodes"].items()):
     59         node_id_to_num[node_id] = i
---> 60         c = _color_function(member_ids, color_function)
     61         t = _type_node()
     62         s = _size_node(member_ids)

~/kepler-mapper/kmapper/visuals.py in _color_function(member_ids, color_function)
    219 
    220 def _color_function(member_ids, color_function):
--> 221     return _color_idx(np.mean(color_function[member_ids]))
    222     # return int(np.mean(color_function[member_ids]) * 30)
    223 

IndexError: index 50 is out of bounds for axis 1 with size 24

Map does not work with affinityPropagation

Currently, the map method assumes the clustering class has an attribute labels_. This is not part of the API and so is not true for all sklearn clustering methods.

It would be preferable to use the fit_predict method instead to extract the cluster labels.

'list' object has no attribute 'fit_transform' when running "make_circles" example

AttributeError Traceback (most recent call last)
in ()
9
10 # Fit to and transform the data
---> 11 projected_data = mapper.fit_transform(data, projection=[0,1]) # X-Y axis
12
13 # Create dictionary called 'complex' with nodes, edges and meta-information

C:\Users\admin\Anaconda3\lib\km.py in fit_transform(self, X, projection, scaler)
62 pass
63 print("\n..Projecting data using: \n\t%s\n"%str(projection))
---> 64 X = reducer.fit_transform(X)
65
66 # Detect if projection is a string (for standard functions)

AttributeError: 'list' object has no attribute 'fit_transform'

Question: Metric selection is not available?

Hi all,

I am switching from Ayasdi to open source Mapper and was looking for whether Kepler-mapper has a metric selection function, just like the "projection"/lens selection function in this class. As far as I know, in the original python Mapper algorithm, they have a list of metric options as follows:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

Can anyone help on this please?

-thanks.

Better visualizations

Currently:

the visualizations are not paper/print friendly.
It is also not possible to zoom or drag the network, creating problems for larger graphs.
Implementation is not very extensible and does not make use of a template engine.

Implement/Research:

Jinja2 Templates
Zoom, Focus & Drag, more in line with http://mlwave.github.io/tda/newsgroup-20.html
Paper-friendly output mode (perhaps SVG graphics)

Re-visit:

WebGL graphing with https://github.com/ayasdi/grapher

ValueError on array with same shape as example

Hi first of all this is a great open source implementation of the Mapper been searching for one. I am trying to build a complex using some weather data.

My shape:

X.shape

(7477, 2)

Circle example shape:

data.shape

(5000, 2)

Yet I'm getting ValueError: all the input array dimensions except for the concatenation axis must match exactly.

Here's a sample of my data:

X[0]

array([ 1.02, 28. ])

I'm sure I'm doing something silly could you guide me?

Customized color function

Hi, I am customizing my color function using an external variable in the same dataset (code bellow).

import numpy as np
import pandas as pd
import kmapper as km
from mst_clustering import MSTClustering

lens = df['var_lens']
graph = mapper.map(lens, df[vars_used_for_distance], nr_cubes=10, 
                   overlap_perc=0.95,
                   clusterer=MSTClustering(cutoff=2))

# Visualization
mapper.visualize(graph, path_html="psycho-scores-MST.html",
                 color_function=df['var_color'].values,
                 custom_tooltips=df['var_color'].values)

I noticed kepler-mapper uses a discrete range (0-10) of colors, however I could not understand from the visuals.py code which statistic within the cluster it uses.

Using

for cluster in graph['nodes']:
    print np.nanmean(df['var_color'][graph['nodes'][cluster]].values)

I get the means in each cluster, but I can't pass the values to node colors in color_function.

I have a sufficient mathematical background, but my python skills are not very good. Congratulations for the library. It is really good and has wonderful visualizations.

suggestion: download row ID in clusters

Hi there,

thanks for this amazing fancy version of mapper! After working through couple datasets using km, I have few suggestions for the next update and hopefully would be helpful to others as well:

in the 3D output, when we move the mouse to nodes, we only see the classification label (e.g., if the outcome is binary, then we only see 0/1), what was not generated but would be extremely helpful in later on validation of the results in traditional statistical approaches is: (number of ) row IDs within each nodes. if there is a way to see how many rows (assuming your data is one ID per row) are in each cluster, it would be really informative...
after realizing that feature, maybe it would be worth adding another function from which we can select a specific cluster (assuming we have several clusters in the output) and download the row IDs in there. In this way, we can make use of the clusters generated from km and load them in logistic regression or other traditional approaches to find out what drives the separation of such clusters.

Thanks again and please let me know if you need extra clarification on this.

-Yuzu

Make API more inline with scikit-learn

We'd like all idioms used in KeplerMapper to fit seamless with the scikit-learn API.

What needs to be changed? Below are some things that have been brought up before and questions to ask about the current API.

nr_cubes should be changed to n_cubes
Need to implement fit method and have fit_transform only run fit and return the result.
How should map method fit into this design? Should map be fit instead?
Does the current graph format abide? Does scikit-learn have a graph/simplicial complex format? Should we adopt another library's (networkx or other) format?

Please suggest other changes to the API!

Same error although installed from source

Hi, I installed KeplerMapper from source and running the newsgroup notebook I still got the same error as that in the recently closed issue. Is there something wrong? Thank you for the wonderful work and dedication.

mapper = km.KeplerMapper(verbose=2)

projected_X = mapper.fit_transform(X,
    projection=[TfidfVectorizer(analyzer="char",
                                ngram_range=(1,6),
                                max_df=0.83,
                                min_df=0.05),
                TruncatedSVD(n_components=100,
                             random_state=1729),
                Isomap(n_components=2,
                       n_jobs=-1)],
    scaler=[None, None, MinMaxScaler()])

print("SHAPE",projected_X.shape)


..Projecting data using: [TfidfVectorizer(analyzer='char', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.83, max_features=None, min_df=0.05,
        ngram_range=(1, 6), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None), TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
       random_state=1729, tol=0.0), Isomap(eigen_solver='auto', max_iter=None, n_components=2, n_jobs=-1,
    n_neighbors=5, neighbors_algorithm='auto', path_method='auto', tol=0)]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-3a323e834380> in <module>()
     10                 Isomap(n_components=2,
     11                        n_jobs=-1)],
---> 12     scaler=[None, None, MinMaxScaler()])
     13 
     14 print("SHAPE",projected_X.shape)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\kmapper\kmapper.py in fit_transform(self, X, projection, scaler, distance_matrix)
    219             if self.verbose > 0:
    220                 print("\n..Projecting data using: %s" % (str(projection)))
--> 221             X = X[:, np.array(projection)]
    222 
    223         # Scaling

IndexError: arrays used as indices must be of integer (or boolean) type

Mapper long term direction

I'd like to talk about future directions of kepler-mapper and some work I'd like to do. Before I get too far ahead of myself, I want to make sure you (@MLWave) agree with the directions so a permanent fork won't be necessary.

Immediate steps are

Get kepler-mapper into pypi.
Decompose some of the main routines to allow for easier unit testing.

Do you have a vision or direction for kepler-mapper? There is considerable new research on the method and I think kepler-mapper would be a great platform to introduce some of these ideas.

Automatic parameter tuning
Multiscale mapper
Automatic feature extraction

Specify number of cubes and overlap in each dimension

Currently, map takes arguments for nr_cubes and overlap_perc as a singleton integer or float. I would like the option to specify a different number of cubes and overlap for each dimension.

I imagine a test would look something like this:

lens = np.random.rand(10,3)
cover = Cover(data, nr_cubes=[10,20,30])
assert len(cover.cubes) == 10*20*30

Examples out of date

After the API changes of v00008 some of the examples have not been updated.

These include

Cat
Horse
Lion

Is there a way to use strings for the custom tooltip labels?

It would be really helpful for interactive analysis if the tooltip labels could be strings instead of just integer labels. Otherwise one has to reference their original label mapping to understand which integers correspond to what labels/categories. If this functionality is already implemented please demonstrate. Thank you.

Digits example appears to be using a different version of km

For example, the version of km that digits.py uses assumes that the KeplerMapper class has a reducer attribute. In km.py given in the repository, this parameter is in the fit_transform method and is called projection. Also digits.py uses the parameter name cluster_algorithm instead of clusterer (as it is in km.py).

Re-use part cartographer improvements/design in KeplerMapper

I have seen that this project has gone through considerable developments and improvements in the last months. About one year ago, I spent about 2 months studying TDA and its applications to (moistly scientific) data analysis. When it was time to review MAPPER, I started playing with KeplerMapper and I found it extremely convenient for MAPPER-based data exploration (basically was the only good open implementation out there).

However, I was missing some interactivity (changing the variable used for node colouring, recomputing the simplicial complex with other clustering/coverer parameters, etc.) for the exploration of the output simplicial complex. With the aim of understanding better MAPPER I started rewriting my own implementation (which I called cartographer) re-using scikit-learn as much as possible, inspired with what was done in KeplerMapper.

Given that this project is now undergoing active development now and it is definitely more mature and has more user adoption than mine, I think it would be interesting to see wether some of the simplified API design choices and implementation changes could be ported to KeplerMapper to improve usability and performance in case you are interested.

So we can discuss what could be re-used within KeplerMapper, I list now the main design and implementation changes when rewriting the implementation (it is pretty simple and can bee seen here):

Opted for separated visualisation of the simplicial complex from the actual computation of the simplicial complex (the scikit-learn cluster-like model). My aim was to be able to adapt the visualisation details at a later stage and also have the possibility to either serve a standalone html or also see the visualisation within a Jupyter Notebook/Lab.
The Mapper class inherits from ClusterMixin and the three Mapper components can be configured in the constructor call: filterer (the transforming function reducing from the high-dimensionality of the data of from a lower dimensionality to compute the nerve), coverer (the transformer function that defines the overlapping spaces from which the nerve is computed) and clusterer (the clustering algoritm that is actually used in the algorithm).
The coverer, see the HyperRectangleCoverer divides the input space in overlapping regions. One trick that I discovered to speed up considerably the execution was to reduce set intersection checks to overlapping regions, by means of returning with the coverer a overlapping matrix (could be also sparse) and checking for intersection only on those subsets.
Standalone documentation with Jupyter notebooks (the output D3.js graph can be explored within a Jupyter output cell ), executed with nbsphinx when the docs are built by the CI.

In addition to the features I implemented in cartographer, I also spent a few weeks thinking on how to implement other improvements (e.g. bidirectional Jupyter widget visualization, how to deal with hyper-parameters, multi-scale MAPPER approaches). I will also be glad to discuss also those and contribute to their integration.

How to plot picture like matplotlib style?

I want to plot that picture in an article, however, direct print from browser always have problems.

I also used that way in this issue #73 but failed.

I am wondering if there is a way to plot picture neatly in article.

Thanks a lot.

Enable precomputed similarity matrices for clustering

Hi, I just opened PR #87. Context is that I had a precomputed distance matrix that I wanted to cluster with. I used T-SNE to scale it down to two dimensions for the filter function, but I didn't see a way to use metric='precomputed' with DBSCAN for the clustering with inverse_X because of the hypercube slicing making things un-square. So this PR is what I did to tell mapper to give a square matrix to the clusterer.

Model prediction functions are broken

ValueError: shape mismatch: value array of shape (40000,2) could not be broadcast to indexing result of shape (40000,1)

When using a classifier.

[WIP] Release v 1.1

All changes for v1.1 are currently incorporated in the dev branch.

Included in the release is

Official docs site : https://MLWave.github.io/kepler-mapper
New visualization interface.
Cover and nerve customizable interface

What needs to happen before we can deploy the next release :

Examples display error messages when they are ran

Hi kepler-mapper team,

Thank you very much for your great TDA tool!

Could you please update the examples in the keppler-mapper repo, according to the last version of kmapper?

I ran digits.py and got this error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-1ae2740795d8> in <module>()
     51                  path_html="keplermapper_digits_custom_tooltips.html",
     52                  graph_gravity=0.25,
---> 53                  custom_tooltips=tooltip_s) 
     54 # Tooltips with the target y-labels for every cluster member
     55 mapper.visualize(graph,

TypeError: visualize() got an unexpected keyword argument 'graph_gravity'

I checked the visualize() method definition, and indeed it has no keyword 'graph_gravity'.
I commented it out and it worked, but I noticed that although in your code
the visualize() is called with the default value, None, for the color_function
the node colors in my html file are different from those of nodes in your html file posted in the folder examples (my nodes have mostly the central colors in the jet colormap, and only a few are red).
It's obvious that your plot was generated with a special color_function that is not set in digits.py.
I'd like to know a method of node coloring in the case of a filter function with values in $\mathbb{R}^2$.
All references on TDA, related to mapper algorithm, explain only how to color the nodes in the case of a filter function with real values.

For breast-cancer example you provided the color_function="average_signal_cluster", but the color_function value should be a numpy.array of floats
and this string caused the error:

./visuals.py in init_color_function(graph, color_function)
     13         color_function = np.arange(n_samples).reshape(-1, 1)
     14     else:
---> 15         color_function = color_function.reshape(-1, 1)
     16     # MinMax Scaling to be friendly to non-scaled input.
     17     scaler = preprocessing.MinMaxScaler()

AttributeError: 'str' object has no attribute 'reshape'

Thanks!!! :)

Expand docs

The README is getting very long and it would be nice to add more examples, tutorials, and guides.We could migrate most of the docs to another venue and leave the README as a short guide for setup and contribute instructions.I think to do this we could use mkdocs to build the pages for us and publish the docs with ReadTheDocs

Deploy on gh-pages
documentation of past versions
Tutorial on how to use KeplerMapper
Document covering schemes
Document nerve building schemes

Error in figure 8 data

I am running the following code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
import kmapper as km
import sklearn

d = {'x': np.cos(np.arange(1,100)), 'y': np.sin(np.arange(1,100))}
df=DataFrame(data=d)
mapper = km.KeplerMapper(verbose=2)
lens = mapper.fit_transform(df)
complex = mapper.map(lens,df,
	clusterer=sklearn.cluster.DBSCAN(eps=0.1, min_samples=5),
	nr_cubes=10, overlap_perc=0.5)
mapper.visualize(complex, path_html="/home/dhananjay/kepler-mapper/keplermapper-fig8-xaxis.html",
				 title="fig8-xaxis")

I am getting following output and Error

..Composing projection pipeline length 1:
Projections: sum

Distance matrices: False

Scalers: MinMaxScaler(copy=True, feature_range=(0, 1))

..Projecting on data shaped (99, 2)

..Projecting data using: sum

..Scaling with: MinMaxScaler(copy=True, feature_range=(0, 1))

Mapping on data shaped (99, 2) using lens shaped (99, 1)

Minimal points in hypercube before clustering: 1
Creating 10 hypercubes.
There are 19 points in cube_0 / 10
Found 0 clusters in cube_0

There are 10 points in cube_1 / 10
Found 0 clusters in cube_1

There are 9 points in cube_2 / 10
Found 0 clusters in cube_2

There are 11 points in cube_3 / 10
Found 0 clusters in cube_3

There are 28 points in cube_4 / 10
Found 0 clusters in cube_4

There are 17 points in cube_5 / 10
Found 0 clusters in cube_5

There are 9 points in cube_6 / 10
Found 0 clusters in cube_6

There are 9 points in cube_7 / 10
Found 0 clusters in cube_7

There are 12 points in cube_8 / 10
Found 0 clusters in cube_8

There are 14 points in cube_9 / 10
Found 0 clusters in cube_9


Created 0 edges and 0 nodes in 0:00:00.020664.

kmapper/kmapper.py:133: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  X = np.sum(X, axis=1).reshape((X.shape[0], 1))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-d130d74910ae> in <module>()
     12 # Visualize it
     13 mapper.visualize(complex, path_html="/home/dhananjay/kepler-mapper/keplermapper-fig8-xaxis.html",
---> 14 				 title="fig8-xaxis")
     15 


/home/dhananjay/kepler-mapper/kmapper/kmapper.pyc in visualize(self, graph, color_function, custom_tooltips, custom_meta, path_html, title, save_file, inverse_X, inverse_X_names, projected_X, projected_X_names)
    438         """
    439 
--> 440         color_function = init_color_function(graph, color_function)
    441         json_graph = dict_to_json(
    442             graph, color_function, inverse_X, inverse_X_names, projected_X, projected_X_names, custom_tooltips)

/home/dhananjay/kepler-mapper/kmapper/visuals.pyc in init_color_function(graph, color_function)
      9     # If no color_function provided we color by row order in data set
     10     # Reshaping to 2-D array is required for sklearn 0.19
---> 11     n_samples = np.max([i for s in graph["nodes"].values() for i in s]) + 1
     12     if color_function is None:
     13         color_function = np.arange(n_samples).reshape(-1, 1)

/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.pyc in amax(a, axis, out, keepdims)
   2270 
   2271     return _methods._amax(a, axis=axis,
-> 2272                           out=out, **kwargs)
   2273 
   2274 

/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.pyc in _amax(a, axis, out, keepdims)
     24 # small reductions
     25 def _amax(a, axis=None, out=None, keepdims=False):
---> 26     return umr_maximum(a, axis, None, out, keepdims)
     27 
     28 def _amin(a, axis=None, out=None, keepdims=False):

**ValueError: zero-size array to reduction operation maximum which has no identity**

Can you please resolve

scikit-tda / kepler-mapper Goto Github PK

kepler-mapper's Issues

min-intersection

n-simplices

Research directions

Business Cases

Recommend Projects

Recommend Topics

Recommend Org