neo4j-contrib / neo4j-graph-algorithms Goto Github PK
View Code? Open in Web Editor NEWEfficient Graph Algorithms for Neo4j
Home Page: https://github.com/neo4j/graph-data-science/
License: GNU General Public License v3.0
Efficient Graph Algorithms for Neo4j
Home Page: https://github.com/neo4j/graph-data-science/
License: GNU General Public License v3.0
Follows #7
For smaller graphs, the LightGraph could use int[] instead of IntArray for the adjacency list and int[] instead of long[] for the offsets, which does reduce memory consumption and one indirection.
We should explain how exactly the LightGraph and HeavyGraph encode the graph and how their internals work.
Currently we have conditional checks on any get the see if we have a mapping at all or if we just have to return the default value.
We should turn WeightMapping into an interface with at most two implementations.
One is the current implementation and the second one always returns a default value and never stores.
The Graph currently provides multiple ways to access the graph data. We should decide for one with the consumer based API being the favorite.
run JMH benchmarks with csv output or something similar machine-parsable. Dump results over time to someplace.
Some algorithms operate on all nodes (e.g. PageRank) and instead of returning a large list of results, we should write the result back to the graph. Writes can be be partitioned which makes them embarrassingly parallel.
Graph: #4 (comment)
IdMap: #4 (comment)
Similar to #16, the provided values should be validated before building the Graph.
The WeightMapping assumes a default weight of 0.0
. The number should be configurable per algorithm. Since the defaults aren't stored, this would allow algorithms to reduce memory usage by not storing default weights if they require a different default from 0.0
Test that providing invalid/unknown labels/types/properties behave as they are expected to.
Instead of allocating a List<Node> from Cypher while calling the procedure, we can accept a Cypher statement and run it ourselves, using the far more efficient PLongIterators.
The GraphView currently relies on node Ids smaller then int.max. We should add some kind of lazy IdMapping if it is considerable to support graphs (or subsets of a graph) with Ids which exceed the int size.
The test procedures at https://github.com/neo4j-contrib/neo4j-graph-algorithms/blob/607ec2c03ae57fd0edca75f25da3b6ad4177361d/algo/src/main/java/org/neo4j/graphalgo/impl/TestProcedure.java have unsafe configurability.
This general purpose procedure allows loading the graph once and then allows multiple, differently configured algorithms on top of it, e.g. also page-rank with different configs or page-rank and clustering
We can also consider one algorithm feeding into the next. E.g. the first-page-rank is not (just) persisted into the graph but immediately (with the in-memory computed values) taken into account for the next algorithm (centrality or clustering)
We should try to use Neo4j's primitive collections where possible and document and explain, when we use a different collection.
Where we settle on a third-party collection, we can think about changing to algorithm to be able to use a Neo4j collections instead, or PR a change to the Neo4j collections.
Not every algorithm needs every dimension of the graph. For example, PageRank only requires incoming relationships on for outgoing ones, it just needs the degree.
We should find an API that allows us to express those requirements, so that we don't have to load and store all outgoing relationships, just their degree.
org.neo4j.graphalgo.core.heavyweight.HeavyGraphTest is currently ignored due to failing tests (weighted iterator / weighted forEach). Fix weight mapping and unignore test.
So far we assumed edge weights to be double values in [0, 1) and we have a basic mapper logic for turning arbritary property objects into doubles. We also have to duplicate the relationship-iterator ifaces for the weighted versions. This make the handling of weights very inflexible. With the api2 approach we could consider to switch over to a weight-datasource instead of haven them bound to the iterator.
Since we consider one property per relation and one relation between a pair of nodes we could use the mapped-nodeIds as key-pair.
I'd suggest something like this
wheightOf(sourceNodeId:int, targetNodeId:int):double
This would reduce the amount of different ifaces and implementations. We could also have different impl. for the wheight-source with their own characteristics.
We have a large amount of objects in indirections in AdjacencyMatrix. We should reduce the overhead by finding a better adjacency encoding.
GraphView loads the ReadOperations in the constructor, which bind the graph the the same thread.
We should make it such that it can be used from multiple threads, esp. if we start to implement parallel graph algorithms.
Something that's useful / useable, like label-prop or union-find, I leave the choice up to you.
Reasons:
One relevant feature would be to consider the "weight" property in a relationship for the "strength" of the connection to a cluster.
As a simple solution to start with, could be to filter relationships to consider "weight" as a filter, e.g. only consider relationship-input that exceeds a certain weight at all.
Just by the looks of it, the HeavyGraph shouldn't be that much faster than the LightGraph.
Let's see if we can figure out why the difference is how it is and whether we can make LightGraph faster.
Loading and writing graph into a file serialization might be useful for others besides our tests
The GraphView doesn't care about the restrictions given in its constructor. To implement more UnitTests we need a fully working GraphView.
The current batchSize is nodeCount / nrOfThreads
. It would be better to use a fixed batch size like 10k oder 100k.
The set*
methods are uncommon for the fluent/builder style interface of the GraphLoader, better would be to use with*
The HeavyGraph ignores weights when loading happens in parallel but Weights are still required.
I suggest to add Floyd Warshall algorithm to compute the shortest path from many sources to many destinations and to call it directly using APOC Procedures.
Thank you
Currently we have to initialize the RelationContainer.Builder with the degree. Add a logic which grows the relationship array on demand. Also check if growing is an option for the parent (data) array too.
The current WeightMap is not threadsafe for write access. Evaluate int->double / int->int backed mapping logic. To implement this we first need some kind of mapping between the long-relationship ids and their inner representation
Since we might decide to get rid of the Iterator-methods (in #29) we should add the possibility to terminate the iteration within a forEach(..) method before all values have been emitted. This could be implemented by changing the Consumer into Functions which return a Boolean that either stops or continues the current iteration.
In our current approach the Id-mapping returns intergers starting at 0. Yet there is often the case where nodeId-arrays have to be initialized with some kind of start value. An 1-based mapping could save us some initialization loops.
The HeavyGraphParallelLoadingTest occasionally fails with AIOOBEs thrown by the GraphFactory. Tests should always be deterministic.
The Manager should decide which graph or config to use if no further configuration is specified in the cypher statement.
https://github.com/mknblch/graphtest/commit/33fba7f591fb6e8bf5a1f172aea6d7e304b55717 has removed a optimization of IntArray, that uses a single page.
It should be revived and added, but for a larger array size than just a single page.
We need another Setter in the GraphLoader for the propertyDefaultWeight which just overwrites the actual defaultWeight but leaves the relation type unchanged.
Some arrays are allocated and then pre-filled with a default value. Array.fill
does a naïve linear iteration over all array indices and sets each element, which can become quite inefficient for large arrays. As this happens mostly for arrays that store something pre-node, we have a linear dependency on the number of nodes that we can strive to eliminate.
arraycopy
ing from pre-filled arrayscreate card for each algorithm track progress in
also note findings, links, ideas from research / reading in each card
and also limitations of the current implementation / alternatives
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.