Giter Site home page Giter Site logo

spark-knn-graphs's Introduction

spark-knn-graphs

Maven Central Build Status Javadocs

Spark algorithms for building and processing k-nn graphs.

Currently implemented k-nn graph building algorithms:

  • Brute force
  • NN-Descent (which supports any similarity)
  • LSH SuperBit (for cosine similarity)
  • NNCTPH (for text datasets)
  • Fast online graph building

Implemented k-nn graph processing algorithms:

  • Distributed exhaustive nearest neighbor search
  • Distributed graph based nearest neighbor search

All algorithms support custom classes as value. See an example with custom class as value.

Installation and requirements

spark-knn-graphs requires Spark 1.4.0 or above. It is currently tested with Spark versions 1.4.1, 1.5.2, 1.6.0 and 1.6.2.

Installation using Maven:

<dependency>
    <groupId>info.debatty</groupId>
    <artifactId>spark-knn-graphs</artifactId>
    <version>RELEASE</version>
</dependency>

Or check Spark Packages

Examples

Here are only a few short examples. Check the examples folder for more examples and complete code.

NN-Descent

public class NNDescentExample {

    public static void main(String[] args) {

        // Configure spark instance
        SparkConf conf = new SparkConf();
        JavaSparkContext sc = new JavaSparkContext(conf);

        // Create some nodes
        // the value of the nodes will simply be an integer:
        List<Node<Integer>> data = new ArrayList<Node<Integer>>();
        for (int i = 0; i < 1000; i++) {
            data.add(new Node(String.valueOf(i), i));
        }
        JavaRDD<Node<Integer>> nodes = sc.parallelize(data);
        
        // Instanciate and configure NNDescent for Integer node values
        NNDescent nndes = new NNDescent<Integer>();
        nndes.setK(10);
        nndes.setMaxIterations(10);
        nndes.setSimilarity(new SimilarityInterface<Integer>() {

                    // Define the similarity that will be used
                    // in this case: 1 / (1 + delta)
                    public double similarity(Integer value1, Integer value2) {

                        // The value of nodes is an integer...
                        return 1.0 / (1.0 + Math.abs((Integer) value1 - (Integer) value2));
                    }
        });
        
        // Compute the graph...
        JavaPairRDD<Node, NeighborList> graph = nndes.computeGraph(nodes);
        
        // BTW: until now graph is only an execution plan and nothing has been
        // executed by the spark cluster...
        
        // This will actually compute the graph...
        double total_similarity = graph.aggregate(
                0.0,
                new  Function2<Double,Tuple2<Node,NeighborList>,Double>() {

                    public Double call(Double val, Tuple2<Node, NeighborList> tuple) throws Exception {
                        for (Neighbor n : tuple._2()) {
                            val += n.similarity;
                        }
                        
                        return val;
                    }
                },
                new Function2<Double, Double, Double>() {

                    public Double call(Double val0, Double val1) throws Exception {
                        return val0 + val1;
                    }
                    
                });
        
        System.out.println("Total sim: " + total_similarity);
        System.out.println(graph.first());
    }
}

LSH SuperBit

public class LSHSuperBitExample {

    public static void main(String[] args) {
        
        // Configure spark instance
        SparkConf conf = new SparkConf();
        JavaSparkContext sc = new JavaSparkContext(conf);
        
        // Create some nodes consisting of double[]
        int d = 100; // dimensions
        int n = 1000; // items
        Random r = new Random();
        List<Node<double[]>> data = new ArrayList<Node<double[]>>();
        for (int i = 0; i < n; i++) {
            double[] vector = new double[d];
            for (int j = 0; j < d; j++) {
                vector[j] = r.nextDouble() * 100;
            }
            
            data.add(new Node(String.valueOf(i), vector));
        }
        JavaRDD<Node<double[]>> nodes = sc.parallelize(data);
        
        // Configure LSHSuperBit graph builder
        LSHSuperBitDoubleArray gbuilder = new LSHSuperBitDoubleArray();
        gbuilder.setK(10);
        gbuilder.setStages(2);
        gbuilder.setBuckets(10);
        // LSH hashing requires the dimensionality
        gbuilder.setDim(d);
        
        // Build the graph...
        JavaPairRDD<Node<double[]>, NeighborList> graph = gbuilder.computeGraph(nodes);
        System.out.println(graph.first());
    }
}

spark-knn-graphs's People

Contributors

tdebatty avatar

Watchers

James Cloos avatar Li Guo avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.