intellabs / pmgd Goto Github PK

Persistent Memory Graph Database

License: MIT License

Makefile 0.51% Batchfile 1.88% C++ 87.17% Java 5.89% C 0.15% Shell 3.09% Yacc 1.08% Lex 0.23%

pmgd's Introduction

Persistent Memory Graph Database (PMGD)

⚠️ DISCONTINUATION OF PROJECT - This project will no longer be maintained by Intel. Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project. Intel no longer accepts patches to this project. If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Recent developments in persistent memory technologies like 3D XPoint promise storage elements providing nearly the speed of DRAM and the durability of block-oriented storage. To provide an efficient storage solution addressing the increasing popularity of connected data and applications that benefit from graph like processing, we have designed and implemented an in-persistent-memory graph database, PMGD, optimized to run on a platform equipped with a vast amount of persistent memory.

Features

Atomicity, consistency, isolation, and durability (ACID): As expected, each transaction either completes entirely or not at all, the database remains consistent, and a completed transaction’s effects survive any failure. Concurrency is supported since version 2.0.0.
Recovery or restart from PM: Should the system fail, PMGD recovers to a previous consistent state upon restart, without requiring its content to be loaded into memory. Thus, it avoids any access to disk or unmarshalling of data from a disk-friendly format to an application usable format, unlike other existing disk-based graph databases.
Java bindings: PMGD is a C++ implementation with client bindings available for Java. This allows us to observe some performance differences due to our use of C++.

In the current release, we have focused our efforts on understanding challenges presented by a PM-based design before considering a case for scaling the database out. Since the current expectations for persistent memory offer the prospect of individual platforms with memory capacity in terabytes, we can still evaluate graph sizes that cover a large spectrum of deployments without scaling out. Hence, PMGD currently supports a single node operation. It is implemented as a library that is linked into an application and as a server that can be accessed by multiple clients. We plan to work on a distributed solution soon.

System Overview

Graphs stored in PMGD consist of nodes (or vertices) optionally connected with edges (or relationships). Graphs may be directed or undirected. PMGD always stores directed edges but its interface is such that direction may be ignored. All nodes need not be connected; a directed graph may be weakly connected (i.e., a path may not exist between all pairs of nodes).

PMGD supports a property graph model with the following features:

Each node and each edge has an associated tag that can be used for classification. A tag is a short string that groups items into classes. For example, in a metadata graph of emails, a tag may be "Person", "Message", "To", or "Keyword". In applications that don’t require tags, they may be omitted.
Each node and each edge may have associated properties (dis- tinct from tags) stored as key-value pairs. Keys are short strings. Values must be one of six predefined types: booleans, integers, floats, strings, date-time, or blobs (i.e., arbitrary strings of bits). With the exception of blobs, all other predefined types are orderable. Examples of properties include <"Firstname", "Leonhard">, <"Lastname", "Euler">, and so on.

These features provide a powerful data model for storing any form of connected data.

We implement add, read, modify, and remove primitives for all entities—nodes, edges, and properties. PMGD supports queries that look up nodes or edges based on (a) a specific tag and/or property, (b) a specific property value, or (c) a property value within a specified range of values. Range lookups are supported for all types of properties except blobs.

PMGD implements indexing to support all of these types of lookups. Users can choose which properties to create indices for, based on the expected query patterns, with an understanding that indices occupy additional memory. PMGD also allows a query to provide a predicate function that can examine the properties or relationships of a node or edge and determine whether it matches the query. Once relevant nodes have been found, PMGD supports graph-oriented queries such as a) get neighbors of a node at n-hops, where n >= 1; b) get all nodes within a neighborhood of up to n-hops from a node; and c) get common neighbors of a set of nodes. Each of these queries includes the ability to specify the direction and tag of edges to follow.

Library sources

The public interface headers are in the include/ folder while the library C++ sources are in src/. The Java bindings are implemented in the java/ folder. Some higher level functionalities like neighbor functions are present in the util/ folder.

Tools

We provide some simple tools like:

mkgraph: to make an empty graph and provide parameters like memory region sizes, indexes etc. Run mkgraph -h for help.
loadgraph: that can load from certain supported file formats into an empty graph created with mkgraph. Run loadgraph -h for help.
dumpgraph: to print the contents of the graph to screen. dumpgraph -h for help.

Tests and sample code

The test folder has unit tests for a lot of the modules and the tests can be run using the run_all.sh script. clean_all.sh cleans up all the graphs created by run_all.sh. We plan to move our testing to GTEST in future release.

pmgd's People

Contributors

Stargazers

Watchers

Forkers

liuliuluk locussam boy1583 idsol gino1610 kmistri tymcbbs dualword

pmgd's Issues

allocabortest fails on some systems

Since the allocator unit tests are so specific to the logic, it seems some variation in DRAM sort order can cause this test to fail on some systems. Needs further investigation on how to make the test better.

Capabilities & Limitations Estimation?

Hello,

Hope that you are doing well.

Although I know that you are currently working on the new version which will be much more advanced, I was wondering if I might be able to get some "rough" estimates from you on the PMGD limitations.

I also did come across a paper "Addressing the Dark Side of Vision Research: Storage" (Vishakha Gupta-Cledat, Luis Remis, Christina R. Strong) which talked briefly about PMGD but did not really go into its capabilities and limitations towards answering my question.

In particular, I am interested in:

approximate max nodes/vertices & edges
approximate number of transactions per second
any information on the stability
I think that PMGD is also only single node.

I am considering to use PMGD in a cutting edge project but am trying to get so base estimations on its current capabilities and limitations.

Best Regards :)

run_all doesn't catch BindingsTest.java failure

The test/run_all.sh script doesn't show failure for BindingsTest even if it fails. It should change behavior similar to DateTest.java execution.

GCC version issues

The code DO NO compile at GCC version higher than 5.

The issue is in iterator.h, the array _chunk0 is defined as a flexible array, and GCC 6 (and higher) does not allow that. We may need to make _chunk0 into a vector instead.

suggestions for adding event triggers and new types?

Hello,

I am definitely using PMGD in my project development and have an idea for a few advanced features that I want to include but wanted to get some suggestions from you, the developers.

I want to be able to add:

New data types --- What I have in mind is a data type that will allow you to associate a file, maybe a property of a node, but that PMGD would maybe either encrypt the file, or filename, and then store the actual file in some directory on the local machine. Basically, PMGD would hold the filename in the property but it would be linked to the actual file in the directory. This gives the ability to "pin" files to nodes but not actually store the file in the database itself.
Event Triggers --- I am researching to figure out a way to add graph node, property, and edge triggers to PMGD such that these triggers can be assigned to a node as maybe a "trigger" property such that they can be set up to trigger on "Read", "Write", "Execute" (like PRE traversing into, or POST traversing out from) , "Timer", others. These triggers could be applied to the node, a property, or and edge maybe just as some initial ideas.

I am building a modular type application which will be a type of graph database with a number of different new qualities like these and some others as well, but will also be cross-platform for Linux, Windows, and MacOS.

Just a small note on the project. I will probably be using a Golang main driver code to call the PMGD library methods as needed, but was trying to determine if I should investigate adding in the Event library in C++ to augment PMGD or if maybe have PMGD call Golang event handlers. This is more to the questions as to what might be the best approach to interface with PMGD.

That's the goal, at least.

Any thoughts or suggestions would be greatly appreciated.
Thanks and have a great day

Simplest makefile for filtertest

Hello,

I am trying to use the test/filtertest.cc as a template but am having trouble determining what is the either the simplest makefile or gcc command line to build just that test file and link it against the pmgd.lib and pmgd-util.lib

Can you please show me a simple command like compile for it?

I should be able to integrate it from there.
Thanks

Any node and edge core size size estimations?

Hello,

Things are progressing well with my project using PMGD and I am extremely happy with what I have seen in this project although a little more feedback from the developers would be helpful at times.

One question that I am wondering is about the size estimations.

I would like to try and find out how much space is used (approximately) for each node and each edge that is added. I know that there is additional overhead when adding properties to a node or edge as well, but I am now having some ideas on how to speed up my searches when the whole database grows very large.

Right now, it seems that PMGD is using AVL-Trees in the core structure for node management which is good since they balance well, but I think that this idea could be greatly extended to the realm of properties on both edges and nodes. Just a thought though.

Have a good weekend :)

Specifying a wrong date in TimeType property doesn't warn or throw error

Instead of providing the date: Sat Oct 1 17:59:24 PDT 1936
if I give: Sat 1 17:59:24 PDT 1936
There is no error and the property value looks like this: 1899-12-31T00:00:00+00:00

Complex DRAM-PM consistency in concurrency code

Since there can be exceptions or failure in the final free code that happens once a transaction has called commit, there is code now in the concurrency branch to fix the DRAM states and make them consistent with the PM state as it will be after rollbacks. It is not clear it is worth introducing so much complexity for caching code. So here is a capture of solutions suggested by Philip during code review and II seems like it could work well:

As I understand it, the problem isn’t with maintaining the PM list of chunks, but with maintaining the DRAM list. (If that’s wrong, you can ignore the rest of this message.)

I. My first suggestion is that you simply let the DRAM list get out of sync, in the rare case* where there is an exception during commit. So, what are the issues that can arise when it is out of sync:
a) A chunk in the DRAM list has no free slots when you go to allocate from it. This currently asserts, I believe, because currently that is never supposed to happen. It should be no problem to instead just remove the chunk from the DRAM list and go to the next chunk.
b) There are no chunks in the DRAM list even though there are some chunks with free space in the PM list. To work around this, simply rebuild the DRAM chunk list from the PM list when it is empty. If there are no free chunks in the PM list either, then of course it will proceed to allocate another chunk.
c) last_chunk_scanned or an entry in the DRAM list points to a chunk that is no longer in the PM list. This is a problem that must be avoided. I’d have to think some more about how that might be done. Perhaps this is a showstopper for this approach.

II. Alternatively, you could simply blow away all the DRAM lists anytime an exception occurs in commit, and have them automatically rebuilt the next time a chunk is needed. (I believe the code to rebuild the lists as needed is there already, since it is the usual process whenever an existing graph is opened, isn’t it?)

Philip

It should be rare outside of your stress test.

Strange compile error

Hello,

I was previously able to compile PMGD on my Ubuntu 16.04 system.

Recently, I upgraded to Ubuntu 18.04 and tried to do a fresh compile of PMGD but am now getting some strange errors:

~/VARS/pmgd/test$ make all
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[CC] src/graph.o
[CC] src/GraphConfig.o
[CC] src/node.o
[CC] src/edge.o
[CC] src/StringTable.o
[CC] src/PropertyList.o
[CC] src/property.o
[CC] src/stringid.o
In file included from src/property.cc:31:0:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
In file included from ./include/edge.h:34:0,
from src/edge.cc:31:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
src/Makeconf:35: recipe for target 'src/property.o' failed
make[1]: * [src/property.o] Error 1
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Edge’
[CC] src/TransactionManager.o
In file included from ./include/graph.h:34:0,
from ./include/transaction.h:33,
from src/TransactionImpl.h:38,
from src/StringTable.cc:33:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
In file included from ./include/graph.h:34:0,
from src/GraphConfig.cc:33:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
In file included from ./include/graph.h:34:0,
from src/graph.cc:32:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
In file included from src/node.cc:33:0:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Node’
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Edge’
In file included from ./include/graph.h:34:0,
from src/PropertyList.cc:36:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
In file included from ./include/graph.h:34:0,
from ./include/transaction.h:33,
from src/TransactionImpl.h:38,
from src/stringid.cc:34:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
src/Makeconf:35: recipe for target 'src/StringTable.o' failed
make[1]: * [src/StringTable.o] Error 1
[CC] src/transaction.o
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Node’
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Node’
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Edge’
In file included from ./include/graph.h:34:0,
from ./include/transaction.h:33,
from src/TransactionImpl.h:38,
from src/TransactionManager.cc:33:
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::PropertyList’
uint8_t _chunk0[];
^
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Node’
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Edge’
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Node’
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Edge’
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Node’
./include/iterator.h:259:25: error: flexible array member ‘PMGD::PropertyList::_chunk0’ in an otherwise empty ‘class PMGD::Edge’
src/Makeconf:35: recipe for target 'src/edge.o' failed
make[1]: * [src/edge.o] Error 1
[CC] src/Index.o
src/Makeconf:35: recipe for target 'src/GraphConfig.o' failed
make[1]: * [src/GraphConfig.o] Error 1

There are, of course, many more errors showing up now.

Did I miss something here?
Cheer

indexstringtest fails on some systems

that do not have the locales used in the test. Could check for the available locales or ask the user for them

requirements to compile on Windows?

Hello,

I am looking at PMGD more and more these days as it could offer a great core graph database library to build a project but basically have 2 options that I can see, in order to make the project viable since the main system drivers will be in Golang.

Compile PMGD libraries (Linux, windows, etc.) and them link then with the Golang driver code.
Rewrite PMGD in Golang.

The second option seems like the most challenging and something that might take a long time to complete.

I would like to know if anyone has had luck compiling the PMGD library under Windows?

Any suggestions or ideas would be grateful.

Add msync for systems with no pcommit

We can have this as a separate branch with this change.

virtual size vs actual size?

Hello,

When the test application create the database, the report sizes of 1 TB each for a number of the databases in the directory.

Of course, this is not the actual size of the each database, but how can I read the actual size?

I actually want to compress the directory but most compressors are trying to utilize the "reported" size which give a directory of over 3 TB when in fact the whole directory is about 4M.

Can you please shed some light on this problem?

Thanks in advance

Configurable RWLock parameters

With the MSYNC options introduced for persistence, transactions can take longer to flush DRAM on writes. That can cause locks to be held longer. But that isn't the case when the code is compiled for PM and runs natively on it. So preferred way is to allow graph to be opened with the correct RWLock options configurable in the Config structure and/or automatically adapt timings depending on what the code is compiled for OR both.

String table being modified in ReadOnly TX

If a query looks up a string id that is not present in our string table, the default behavior is to add that string to the string table but it should not happen if the transaction is read only.

Multithreaded tests fail on systems with fewer than 'n' threads

Since the number of threads some tests like mtalloctest etc, run with for number of allocators, they fail on systems with fewer than expected number of cores.

Attaching Lists to Nodes?

Hello,

Lately, I've been working with the "latest-concurrency-with-msync" version and it runs very nicely, so far.

In the db.get_nodes(...) I know that it returns an iterator from all nodes or nodes meeting a requirement which what I need, but I also need to dynamically db.add_node() and if it is one of the nodes addressable in the get_nodes() iterator then I need it to append that node to the list.

Basically, I am building a dynamic list of nodes that I can insert, remove or append to the iterator and also would like a quick way to get a "count' of nodes in the iterator. STL has a count and sort implementation, but I do not know if those go over to the PMGD implementation.

Also, I looked into the test/listtest.cc example and like how you can work with lists.

With that in mind, I am wondering if you can attach, one or more lists to a node and then work with those lists for things like adding elements, deleting elements, inserting elements, and appending elements?

This would be very powerful and I would like to be able to do this if at all possible.

Any comments or suggestions would be greatly appreciated.
Cheers and have a great day :)

Consider separate Config structure in Graph API for DRAM settings

Now that we have some variables that can be configured each time Graph is opened, should we separate out PM and DRAM config?

Thread Safe?

Hello,

I am trying to locate a peer paper or other documentation on pmgd but have not found anything and would also like to find out if it is thread safe?

Thanks, :)

Any possible re-write with PMDK library coming?

Hello,

I have really liked the PMGD code for a long time and have worked with it a bit as I think that it holds a very good potential for some work that I am doing now.

In particular, I think that it could for a great foundation for a nice graph database application for users. In as much, I am going to work on such a project but think that for having cross-platform (Linux, Windows, MacOSX) capabilities the core code should probably be written in Golang but still call PMGD libraries.

To achieve this, I recently came across the Intel PMEM work:

https://github.com/pmem

and in particular the PMDK (Persistent Memory Development Kit) at:

https://github.com/pmem/pmdk

for which I have been able to compile on Windows 11 with Visual Studio 2022 as well as to run some of their existing examples and tests, all of which seem to run very well.

It made me wonder if perhaps PMGD could benefit from the PMDK work and I was wondering if there was, or is, a planning for a rewrite of the PMGD code and examples using the PMDK libraries.

The reason is that if there was, then that would allow me to easily compile PMGD on Windows as well as to start investigating some type of shim layer, perhaps using SWIG or similar, to allow for interfacing with Golang as well.

Mostly just wondering since I think that PMGD had come out prior to the existing and seemingly currently active work by the PMDK effort.

Any thoughts, suggestions, or ideas, would be greatly appreciated.
Cheers and have a great day

Commit 9d1f553 in FixedAllocator broke the code for free within same TX

The commit removed turning on FREE_BIT in the FixedAllocator which in turn causes incorrect results when doing get_nodes(tag=0) (and probably any code that relies on FREE_BIT when traversing this allocator) within the same transaction. The change goes into effect when the transaction commits.

time_to_string misses usec return values

Even if there is a usec component to Time, the time_to_string() function ignores it.

Occassional tree height disbalance in AVL remove

When running the mtavltest.cc in concurrency branch, 1 in 1000 times the test can have one error where the remove code causes disbalance. Needs some investigation on which level needs to be locked.

Use intrinsics like in libpmem

We have assembly instructions in code right now. It would make sense to use gcc intrinsics

test/nodeedgetest.cc question

Hello All,

Hope that your day is going well.

I have been reviewing the "test/nodeedgetest.cc" to get an idea on how to navigate and inspect property information from one Node-A to Node-B connected via 2 edges in different directions since I want to use my NodeIterator to start on Node-A and via and "Outgoing" edge get the Node-B.Property("hash").c_str() for comparison to another hash string that I send to a function.

The idea is to have a very fast way of comparing connected node neighbors "hash" string property against one that I have to see if it is already there.

In any case, I was digging through your "nodeedgetest.cc" code and saw that you seem to have both new edges as "Outgoing" for "tag0" and "tag1"

    Graph db("nodeedgegraph", create ? Graph::Create : Graph::ReadOnly);

    Transaction tx(db, Transaction::ReadWrite);

    Node *prev = 0;
    for (int i = 1; i < argc; i++) {
        Node &n = db.add_node(0);
        if (prev != NULL) {
            db.add_edge(*prev, n, "tag0");
            db.add_edge(*prev, n, "tag1");
        }
        prev = &n;
    }

Shouldn't that be reversed for "tag1" to be:

            db.add_edge(*prev, n, "tag0");
            db.add_edge(n, *prev, "tag1");

So that one is Outgoing and one is Incoming?

I think that this "add_edge()" method is of the form

add_edge( from, to, tagID);

but I could be wrong.
Cheers :)

Get_property return value

The get_property method in Node, Edge, and EdgeRef should return PropertyRef instead of Property. Probably the result parameter of check_property should also be a PropertyRef. This change will prevent making a redundant copy of the property value in a case such as this:

EdgeIterator ei = node.get_edges(tag);
Property p = ei->get_property(prop_id);
String val = p.string_value();

It was not implemented this way originally because PropertyRef was invented to support property iterators, after get_property had already been specified and implemented.

This should be a trivial change to PropertyList::get_property. The definition of PropertyRef should be moved from iterator.h to property.h.

get_neighborhood(....) suggestion

Hello,

I really like the way that the nodes and edges methods work

db.get_nodes(...)
db.get_edges(...)

and am becoming fond of this type of Iterator:

NodeIterator pi = db.get_nodes("Person", PropertyPredicate{ "Name", PropertyPredicate::Eq, "John" });

In trying to work with nodes, edges, and neighborhoods I would like to suggest the addition of perhaps something like this as well:

NodeIterator pi = db.get_neighbors( *mi, "Person", PropertyPredicate{ "Name", PropertyPredicate::Eq, "John" });

and

NodeIterator pi = db.get_neighbors(*mi, PropertyPredicate{ "Name", PropertyPredicate::Eq, "John" });

where get_neighbors() would look at the

NeighborhoodIterator get_neighborhood
(const PMGD::Node &node, int max_hops, StringID tag, const PropertyPredicate &, bool reverse = false)

There could be a number of methods for this implementation, but I would find it particularly useful to have one like I mentioned above so that get_neighborhood() could collect neighbors with a particular "StringID tag", and "const PropertyPredicate &" such that I could have finer-grained control over which neighbors would be returned.

What is the possibility of getting something like this added to the latest "development" version?
Cheers and have a great day

intellabs / pmgd Goto Github PK

pmgd's Introduction

Persistent Memory Graph Database (PMGD)

Features

System Overview

Library sources

Tools

Tests and sample code

pmgd's People

Contributors

Stargazers

Watchers

Forkers

pmgd's Issues

Recommend Projects

Recommend Topics

Recommend Org