The neo4j-graph-algorithms's discuss from neo4j-contrib

int[] specialized form of LightGraph

Follows #7

For smaller graphs, the LightGraph could use int[] instead of IntArray for the adjacency list and int[] instead of long[] for the offsets, which does reduce memory consumption and one indirection.

Document graph encoding

We should explain how exactly the LightGraph and HeavyGraph encode the graph and how their internals work.

Turn WeightMapping into interface with two cases

Currently we have conditional checks on any get the see if we have a mapping at all or if we just have to return the default value.
We should turn WeightMapping into an interface with at most two implementations.
One is the current implementation and the second one always returns a default value and never stores.

Decide for one approach of Iterator vs Consumer vs Lookup in Graph API

The Graph currently provides multiple ways to access the graph data. We should decide for one with the consumer based API being the favorite.

#4 (comment)

JMH recording

run JMH benchmarks with csv output or something similar machine-parsable. Dump results over time to someplace.

Write results of algorithms to the graph

Some algorithms operate on all nodes (e.g. PageRank) and instead of returning a large list of results, we should write the result back to the graph. Writes can be be partitioned which makes them embarrassingly parallel.

JavaDocument public Graph API

Graph: #4 (comment)
IdMap: #4 (comment)

Validation of GraphLoader values

Similar to #16, the provided values should be validated before building the Graph.

#4 (comment)

Investigate using Hilbert Curves for better cache locality

Custom default weight

The WeightMapping assumes a default weight of 0.0. The number should be configurable per algorithm. Since the defaults aren't stored, this would allow algorithms to reduce memory usage by not storing default weights if they require a different default from 0.0

Add negative tests for all GraphLoaders

Test that providing invalid/unknown labels/types/properties behave as they are expected to.

Load input for Graphs via Cypher

Instead of allocating a List<Node> from Cypher while calling the procedure, we can accept a Cypher statement and run it ourselves, using the far more efficient PLongIterators.

GraphView ID handling relies on int values

The GraphView currently relies on node Ids smaller then int.max. We should add some kind of lazy IdMapping if it is considerable to support graphs (or subsets of a graph) with Ids which exceed the int size.

Replace LongToIntFunction with a more domain specific interface in GraphLoaders

Configuration of procedure API

The test procedures at https://github.com/neo4j-contrib/neo4j-graph-algorithms/blob/607ec2c03ae57fd0edca75f25da3b6ad4177361d/algo/src/main/java/org/neo4j/graphalgo/impl/TestProcedure.java have unsafe configurability.

parallelism is done by a copied procedure of a different name but should instead be a config parameter
there is no validation of parameters
there are no default parameters

General-purpose, power procedure

This general purpose procedure allows loading the graph once and then allows multiple, differently configured algorithms on top of it, e.g. also page-rank with different configs or page-rank and clustering

procedure pipelines

We can also consider one algorithm feeding into the next. E.g. the first-page-rank is not (just) persisted into the graph but immediately (with the in-memory computed values) taken into account for the next algorithm (centrality or clustering)

Revise primitive collections

We should try to use Neo4j's primitive collections where possible and document and explain, when we use a different collection.
Where we settle on a third-party collection, we can think about changing to algorithm to be able to use a Neo4j collections instead, or PR a change to the Neo4j collections.

Remove relation ids from Graph API

Allow Graphs to skip loading of certain relationships

Not every algorithm needs every dimension of the graph. For example, PageRank only requires incoming relationships on for outgoing ones, it just needs the degree.
We should find an API that allows us to express those requirements, so that we don't have to load and store all outgoing relationships, just their degree.

Fix weight mapping in heavy graph

org.neo4j.graphalgo.core.heavyweight.HeavyGraphTest is currently ignored due to failing tests (weighted iterator / weighted forEach). Fix weight mapping and unignore test.

Considerations on weights

So far we assumed edge weights to be double values in [0, 1) and we have a basic mapper logic for turning arbritary property objects into doubles. We also have to duplicate the relationship-iterator ifaces for the weighted versions. This make the handling of weights very inflexible. With the api2 approach we could consider to switch over to a weight-datasource instead of haven them bound to the iterator.

Since we consider one property per relation and one relation between a pair of nodes we could use the mapped-nodeIds as key-pair.

I'd suggest something like this

wheightOf(sourceNodeId:int, targetNodeId:int):double

This would reduce the amount of different ifaces and implementations. We could also have different impl. for the wheight-source with their own characteristics.

JMH Benchmarks for Shortest Path

RelationContainer.Builder doesn't care about the given label-id

Simplify Matrix representation

We have a large amount of objects in indirections in AdjacencyMatrix. We should reduce the overhead by finding a better adjacency encoding.

Allow concurrent access in GraphView

GraphView loads the ReadOperations in the constructor, which bind the graph the the same thread.
We should make it such that it can be used from multiple threads, esp. if we start to implement parallel graph algorithms.

First clustering graph algorithm

Something that's useful / useable, like label-prop or union-find, I leave the choice up to you.

Reasons:

we have a practical use-case that would benefit from it
we want to exercise the graph-API also from different kinds of algorithms

One relevant feature would be to consider the "weight" property in a relationship for the "strength" of the connection to a cluster.

As a simple solution to start with, could be to filter relationships to consider "weight" as a filter, e.g. only consider relationship-input that exceeds a certain weight at all.

Investigate performance differences between Light and Heavy Graph

Just by the looks of it, the HeavyGraph shouldn't be that much faster than the LightGraph.
Let's see if we can figure out why the difference is how it is and whether we can make LightGraph faster.

ThreadPool handling

optionally use Neo4j Thread Pool like in APOC or via dependency resolver of GD-API

Move Fileloaders back to core module

Loading and writing graph into a file serialization might be useful for others besides our tests

Move every loader to the core module
Replace reflective access with package-private access
document accessors and constructors, why they are package private

Don't rely on availableProcessors as a default/fallback

The default should be something unrelated to the number of host processors. With #5 and #16 implemented, the impact of the default parallelism should be minimal.

Restrict GraphView to use only supplied label/relation/property/defaultValue

The GraphView doesn't care about the restrictions given in its constructor. To implement more UnitTests we need a fully working GraphView.

implement restriction for label-, relation and property types.

Smaller and fixed batch sizes for parallel imports

The current batchSize is nodeCount / nrOfThreads. It would be better to use a fixed batch size like 10k oder 100k.

Better work stealing possibilities, if one batch contains mostly deleted nodes
More predictable resource usage, temporary arrays could be reused for multiple batches

Parallel loading for LightGraph

document internal workings of UndirectedTree

Add SCC and Dijkstra algos from graph test repo

Better speaking names in GraphLoader interface

The set* methods are uncommon for the fluent/builder style interface of the GraphLoader, better would be to use with*

Implement parallel WeightMapping

The HeavyGraph ignores weights when loading happens in parallel but Weights are still required.

Floyd Warshall algorithm

I suggest to add Floyd Warshall algorithm to compute the shortest path from many sources to many destinations and to call it directly using APOC Procedures.

Thank you

Autogrowing array in RelationContainer

Currently we have to initialize the RelationContainer.Builder with the degree. Add a logic which grows the relationship array on demand. Also check if growing is an option for the parent (data) array too.

Thread safe WeightMap

The current WeightMap is not threadsafe for write access. Evaluate int->double / int->int backed mapping logic. To implement this we first need some kind of mapping between the long-relationship ids and their inner representation

Change Consumer into Functions to allow premature termination

Since we might decide to get rid of the Iterator-methods (in #29) we should add the possibility to terminate the iteration within a forEach(..) method before all values have been emitted. This could be implemented by changing the Consumer into Functions which return a Boolean that either stops or continues the current iteration.

Consider IdMapping starting at 1 (0 exclusive)

In our current approach the Id-mapping returns intergers starting at 0. Yet there is often the case where nodeId-arrays have to be initialized with some kind of start value. An 1-based mapping could save us some initialization loops.

Write benchmarks for larger graphs

couple of million nodes and edges (wikipedia/dbpedia size)
Singleshot execution
less iterations and warmups

Remove nondeterminism in HeavyGraphParallelLoadingTest

The HeavyGraphParallelLoadingTest occasionally fails with AIOOBEs thrown by the GraphFactory. Tests should always be deterministic.

Evaluate a manager concept for graph loading

The Manager should decide which graph or config to use if no further configuration is specified in the cypher statement.

Include all benchmarks from graphtest repository

Add single array optimization for IntArray

https://github.com/mknblch/graphtest/commit/33fba7f591fb6e8bf5a1f172aea6d7e304b55717 has removed a optimization of IntArray, that uses a single page.
It should be revived and added, but for a larger array size than just a single page.

Add Setter in GraphLoader for propertyDefaultWeight

We need another Setter in the GraphLoader for the propertyDefaultWeight which just overwrites the actual defaultWeight but leaves the relation type unchanged.

Faster Array.fill for large arrays

Some arrays are allocated and then pre-filled with a default value. Array.fill does a naïve linear iteration over all array indices and sets each element, which can become quite inefficient for large arrays. As this happens mostly for arrays that store something pre-node, we have a linear dependency on the number of nodes that we can strive to eliminate.

fill in batches by arraycopying from pre-filled arrays
check if we can maybe rewrite some algorithms to make use of the system default value, instead of having to use a custom default value

turn list of graph algorithms into issues with progress tracking

create card for each algorithm track progress in

also note findings, links, ideas from research / reading in each card
and also limitations of the current implementation / alternatives

Implementation of the selected algorithms and exposed them as a Java API and Java Stored Procedure
- PageRank
- Label Propagation
- Louvain
- Betweenness Centrality
- Closeness Centrality
- Degree Centrality
- Single Shortest Path
- Strongly, Weakly connected components
- Parallel BFS / DFS
Evaluation
- Performance Tests (different dataset sizes, regressions, level of concurrency)
- Performance comparison with other implementations (Spark/Flink/Graphlytic) on same dataset and same level of concurrency
- Review existing implementations for its functional correctness and performance:
  - A*
  - Dijkstra
  - Shortest Path

neo4j-contrib / neo4j-graph-algorithms Goto Github PK

neo4j-graph-algorithms's Issues

procedure pipelines

issues created

Recommend Projects

Recommend Topics

Recommend Org