Giter Site home page Giter Site logo

ognis1205 / spark-tda Goto Github PK

View Code? Open in Web Editor NEW
47.0 5.0 5.0 32.2 MB

SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.

License: Apache License 2.0

Scala 98.10% Python 1.90%
topological-data-analysis tda spark apache-spark machine-learning ml mllib

spark-tda's Introduction

SparkTDA

Build Status codecov.io Join the chat at https://gitter.im/ognis1205/spark-tda

The scalable topological data analysis package for Apache Spark. This project aims to implement the following features:

If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.

Status

WIP and EXPERIMENTAL. This package is still a proof-of-concept of scalable topological data analysis support for Apache Spark, hence you cannot expect that this package is ready for production use.

Examples

Mapper

2-skeltons of Reeb Diagram of MNIST (40 intervals on the 1st primcipal component with 50% overlap) 2-skeltons of Reeb Diagram of MNIST (20 intervals on the 1st primcipal component with 50% overlap)
60k images clustered in 784 dimensions without any projection loss 60k images clustered in 784 dimensions witout any projection loss

Requirements

This library requires Spark 2.0+

Building and Running Unit Tests

To compile this project, run sbt package from the project home directory. This will also run the Scala unit tests. To run the unit tests, run sbt test from the project home directory. This project uses the sbt-spark-package plugin, which provides the 'spPublish' and 'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by supplying a comma-delimited list of Maven coordinates with --packages and download the package from the locally repository or official Spark Packages repository.

The package can be published locally with:

$ sbt spPublishLocal

The package can be published to Spark Packages with (requires authentication and authorization):

$ sbt spPublish

Using with Spark Shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11

Future Works

Mapper

  • Write Wiki
  • Implement Python APIs
  • Publish to Spark Packages
  • Benchmark
  • Consider using GraphFrames instead of plain GraphX
  • Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers

Related Softwares & Projects

  1. Python Mapper
  2. TDAMapper (R)
  3. Spark Mapper (Spark)
  4. KeplerMapper (Python with GUI)

References

Mapper

  1. G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.
  2. J. Curry (2013). Sheaves, Cosheaves and Applications, arXiv 2013
  3. T. K. Dey, F. Memoli, Y. Wang (2015), Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps, arXiv 2015
  4. E. Munch, B. Wang (2015). Convergence between Categorical Representations of Reeb Space and Mapper, arXiv 2015
  5. E. Munch, B. Wang (2015). Reeb Space Approximation with Guarantees, The 25th Fall Workshop on Computational Geometry 2015.
  6. H. E. Kim (2015). Evaluating Ayasdi's Topological Data Analysis for Big Data, Master Thesis, Goethe University Frankfurt 2015.

KNN/ANN/SNN

  1. L. Ting, et al (2004). An investigation of practical approximate nearest neighbor algorithms, Advances in neural information processing systems. 2004.
  2. L. Ting, C. Rosenberg, H. Rowley (2007). Clustering billions of images with large scale nearest neighbor search. Applications of Computer Vision, 2007. WACV'07. IEEE Workshop on. IEEE, 2007.
  3. D. Ravichandran, P. Pantel, E. Hovy (2005). Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering, ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics pp 622-629
  4. M. Steinbach, L. Ertoez, V. Kumar (2004). The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics, pp 273-309
  5. L. Ertoez, M. Steinbach, Vipin Kumar (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Proceedings of the Third SIAM International Conference on Data Mining, 2003.
  6. M. E. Houle, H. P. Kriegel, P. Kroeger, E. S. A. Zimek (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?, Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, 2010.

LSH

  1. M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms, 34th STOC, 2002.

spark-tda's People

Contributors

ognis1205 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

spark-tda's Issues

Use Docker

Is your feature request related to a problem? Please describe.

Nope!

Describe the solution you'd like

Use Docker for making handy and easy-to-use package.

Additional context

N/A

Estimate buffer width for each cover intervals.

Is your feature request related to a problem? Please describe.

This is not related to any problems. For now, buffer width for VP tree search is estimated for
entire data. But in practical datasets, preferable buffer width is computed for each intervals of
cover. VP tree construction is designed to handle data skew, this implies that the buffer width
would be change for different intervals.

Describe the solution you'd like

In ReebDiagram estimator, buffer width estimation will be done for each intervals.

Additional context

N/A

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.