Giter Site home page Giter Site logo

pometry / raphtory Goto Github PK

View Code? Open in Web Editor NEW
323.0 323.0 54.0 138.67 MB

Scalable graph analytics database powered by a multithreaded, vectorized temporal engine, written in Rust

Home Page: https://raphtory.com

License: GNU General Public License v3.0

Python 17.51% Rust 82.33% HTML 0.01% JavaScript 0.08% Dockerfile 0.03% Shell 0.03%
analytics database embedded-database graph graph-database neo4j olap python rust temporal time-series

raphtory's People

Contributors

alnaimi- avatar aw4309 avatar brandon-haugen avatar d4rkisek avatar dependabot[bot] avatar dullaz avatar fabianmurariu avatar felixcdr avatar github-actions[bot] avatar haaroon avatar hallofstairs avatar haoxins avatar hellekev avatar jamesalford avatar jatindersangha avatar lejohnyjohn avatar ljeub avatar ljeub-pometry avatar louisch avatar miratepuffin avatar narnolddd avatar nrs1729 avatar peijie-zhong avatar pometry-team avatar rachchan avatar ricopinazo avatar rutujasurve94 avatar shivam-880 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

raphtory's Issues

Fine grained clearing of temporary analytical state

Currently after the completion of a flattening algorithm the algorithm state of vertices is wiped by dumping the whole map and creating a new one. This is because the removal of objects from the map is drastically slow and requires tracking any of property names added. This, however, means that flattenings running in parallel may remove the state of another algorithm. There needs to therefore be a higher level controller that flushes this data when there is no algorithms currently running (i.e. return results has been completed on all flattenings).

Fix complex entity filtering

Currently, if a user were to filter vertices based on entity type/property, some vertices may be removed, but their neighbours will be unaware of this. Any messages sent to the vertex will then raise an error. This is also important in instances of edge counting etc.

Dynamic scaling of cluster

Can currently monitor the utilisation of resources within a cluster, and deploy more containers automatically via Docker Swarm. Not currently implemented, but Routers can currently be brought into a cluster as they are stateless. New Spouts may join to push/pull new data. Live Analysis Managers may join but Analyser must already be present (see #6). Partition Managers are currently static as entities cannot migrate (#31)

Add iterative functionality to Live Analysis managers

Currently Live Analysis Managers can only probe the graph, requesting a Analyser be run by the Partition managers. These receive the Vertex and Edge map, but cannot save data to the partition manager.

Therefore, need to add area within each entity to store processing information (such as PageRank etc.)

Speed up vertex mailbox

The current implementation of the vertex mailbox/multi-queue is clearly using the wrong data-structures as it is by far the slowest part of analysis (even in a local deployment where we are passing by reference and the network stack can be ignored). This must initially be improved on by a more appropriate choices of structure, but may require a full redesign.

Finalise conversion of Routers to pull based

Initial testing of Routers in more of a pull based model from the spouts largely reduces their crashing. Need to remove the current buffering of data within the Spout as this is now having issue.

Global Analytical State

For many algorithms such as pagerank and Hub/authority their needs to be a small global state maintained. This should be completed by allowing the analysis manager to aggregate and report back with the next analyse superstep request.

Vertex messaging state not flushing

Similar (and possibly related to) issue #82 the vertex messaging state seems to be taking up an absurd amount of memory and in many instances does not get flushed out of the partition manager for some time (if at all).

Singularity compatibility

Currently there are some basic scripts to convert the docker image into singularity, but this need to be expanded to run fully on the QM HPC cluster.

Broken Pipe error on background thread when CPU maxing out

When any component is running at maximum CPU it seems that some background process is timing out for its TCP connection and printing a broken pipe stacktrace over and over. Seems to be related to Prometheus/kamon scraping, but needs to be investigated further.

GC overhead management

Long running instances of Raphtory seem to be suffering from GC issues with the collector not being able to clean up any heap. This would initially appear to be because of the graph state, however, on profiling the graph itself is a fraction of the memory. Currently running on shenandoah GC inside of the docker image to run GC in parallel, but needs to be investigated fully.

Update documentation following dev branch merge

Current citation (dev) branch has made some large changes to API, therefore, before it is merged with the master the documentation on the Raphtory site needs to be updated to include these changes.

Remove LOCAL argument

Currently if running locally you must set a flag, this is because the analysis task will otherwise send the analyser object by reference and cause the readers to all refer to the same one. Analysis task should, therefore, be updated to serialise this first, removing the need for the argument.

Clean up generation scripts

Current mutlimachinesetup is a bit of mess -- need to clean this up, incorperating seednode.sh into machine0 and to print less crap to terminal.

PreviousStates periodic compression

We can store another (null) map (let's say toCleanMap) for each entity; periodically a thread will:

  1. Assign to toCleanMap the reference of the previousHistory
  2. Make a new instance of TreeMap and assign it to previousHistory
  3. The worker thread will clean up the toCleanMap
  4. toCleanMap and PreviousHistory have to be concatenated and reassigned to PreviousHistory at the end of the process. toCleanMap will be collected by the GC.

Add CI/CD pipeline

Once unit tests have been added via issue #88 this should be expanded to test new PRs as well as testing the dockerised version of Raphtory to ensure no issues occur in a distributed environment. To be assigned to Matt.

Back-pressure on kafka spout

Current kafka spout on a pull from the broker seems to send everything it has on the topic, which for some can be hundreds of GB's. Seems to be an issue with the scala library, but needs to be fixed to make the spout viable.

Add Snapshotting features

Add rolling snapshotter which removes the oldest entity history when a partition managers memory is close to full. This data should be the stored in an offline format which can be read back in if the Analysis Managers require it.

Fix Akka Mailbox Crashing

Currently when an actors mailbox is full the container will crash. There are two proposed solutions for this. Firstly, we can drop messages when this occurs, incorporating a middle man (Kafaka etc.) which will allow us to recover lost messages. The second option is to have communication between Graph Routers and Partition Managers to control the speed at which messages are sent. Possible third option ???

Slow down of analysis over a range of flattenings

Currently if a range is set running with many flattenings of the graph the first will run in milliseconds, but this will soon increase. This makes some sense as obviously the graph is larger in later flattenings. However, picking later flattening and running a singular analysis job on often shows a 10x reduction in run time. This suggests some state is not being removed causing a slowdown over time.

Add range queries to Entity Retrieval Proxy

Currently entities which are fully archived may be retrieved from Redis, however, we need to additionally be able to pull a given range of history and merge that with an entity which is already in-memory. Additionally need to track the first appearance of an entity to know how far backs its history goes.

Allow Analysis Mangers to send new Analysers to running clusters

Currently to run an Analyser it must already be part of the image which the running cluster was established with. This means if you create a new function to run and request the partition managers to execute it a ClassNotFoundException will be thrown. Thus the whole cluster must be taken down and rerun with the new image.

To fix this, the LAM should be able to sent new Analysers to the partition managers where it can be compiled and placed into the correct directory. The function may then run as required.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.