Giter Site home page Giter Site logo

neo4j / graph-data-science-client Goto Github PK

View Code? Open in Web Editor NEW
168.0 13.0 44.0 14.92 MB

A Python client for the Neo4j Graph Data Science (GDS) library

Home Page: https://neo4j.com/product/graph-data-science/

License: Apache License 2.0

Python 99.23% Shell 0.36% Jinja 0.41%
python neo4j machine-learning algorithms graph data-science graph-algorithms graph-database python3 graph-data-science

graph-data-science-client's Introduction

Neo4j Graph Data Science Client

Latest version PyPI downloads month Python versions Documentation Discord Community forum License

graphdatascience is a Python client for operating and working with the Neo4j Graph Data Science (GDS) library. It enables users to write pure Python code to project graphs, run algorithms, as well as define and use machine learning pipelines in GDS.

The API is designed to mimic the GDS Cypher procedure API in Python code. It abstracts the necessary operations of the Neo4j Python driver to offer a simpler surface. Additionally, the client-specific graph, model, and pipeline objects offer convenient functions that heavily reduce the need to use Cypher to access and operate these GDS resources.

graphdatascience is only guaranteed to work with GDS versions 2.0+.

Please leave any feedback as issues on the source repository. Happy coding!

Installation

To install the latest deployed version of graphdatascience, simply run:

pip install graphdatascience

Getting started

To use the GDS Python Client, we need to instantiate a GraphDataScience object. Then, we can project graphs, create pipelines, train models, and run algorithms.

from graphdatascience import GraphDataScience

# Configure the driver with AuraDS-recommended settings
gds = GraphDataScience("neo4j+s://my-aura-ds.databases.neo4j.io:7687", auth=("neo4j", "my-password"), aura_ds=True)

# Import the Cora common dataset to GDS
G = gds.graph.load_cora()
assert G.node_count() == 2708

# Run PageRank in mutate mode on G
pagerank_result = gds.pageRank.mutate(G, tolerance=0.5, mutateProperty="pagerank")
assert pagerank_result["nodePropertiesWritten"] == G.node_count()

# Create a Node Classification pipeline
pipeline = gds.nc_pipe("myPipe")
assert pipeline.type() == "Node classification training pipeline"

# Add a Degree Centrality feature to the pipeline
pipeline.addNodeProperty("degree", mutateProperty="rank")
pipeline.selectFeatures("rank")
features = pipeline.feature_properties()
assert len(features) == 1
assert features[0]["feature"] == "rank"

# Add a training method
pipeline.addLogisticRegression(penalty=(0.1, 2))

# Train a model on G
model, train_result = pipeline.train(G, modelName="myModel", targetProperty="myClass", metrics=["ACCURACY"])
assert model.metrics()["ACCURACY"]["test"] > 0
assert train_result["trainMillis"] >= 0

# Compute predictions in stream mode
predictions = model.predict_stream(G)
assert len(predictions) == G.node_count()

The example here assumes using an AuraDS instance. For additional examples and extensive documentation of all capabilities, please refer to the GDS Python Client Manual.

Full end-to-end examples in Jupyter ready-to-run notebooks can be found in the examples source directory:

Documentation

The primary source for learning everything about the GDS Python Client is the manual, hosted at https://neo4j.com/docs/graph-data-science-client/current/. The manual is versioned to cover all GDS Python Client versions, so make sure to use the correct version to get the correct information.

Known limitations

Operations known to not yet work with graphdatascience:

License

graphdatascience is licensed under the Apache Software License version 2.0. All content is copyright © Neo4j Sweden AB.

Acknowledgements

This work has been inspired by the great work done in the following libraries:

graph-data-science-client's People

Contributors

adamnsch avatar breakanalysis avatar brs96 avatar darthmax avatar dependabot[bot] avatar florentind avatar ioannispanagiotas avatar jexp avatar jjaderberg avatar kedarghule avatar knutwalker avatar lassewesth avatar lidiazuin avatar mats-sx avatar nvitucci avatar orazve avatar recrwplay avatar s1ck avatar soerenreichardt avatar vnickolov avatar yuvalr1neo avatar zach-blumenfeld avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graph-data-science-client's Issues

streamNodeProperty() doesn't work with gds.run_cypher() as I guess

graphdatascience 1.3

I tried a query like this:

query = f'''
   call gds.graph.streamNodeProperty(
      'xxx',
      'xxxx',
      ['xxxxx']
   )
yield nodeId as id, propertyValue as degree
return id, degree limit 100
...
result = gds.run_cypher(query)

=> KeyError: 'graph_name'

I figured out to make it work like this:

query = f'''
   ...
'''
params = {
   'graph_name': 'xxx',
   'properties': 'xxxx',
   'entities'" ['xxxxx'],
   'config': ''
}
result = gds.run_cypher(query, params)

=> No error, but it returned all rows(not limited to 100) as nodeId and propertyValue(not renamed as id and degree)

Other cypher queries works with gds.run_cypher(query) as expected.

Security Vulnerability

Describe the bug
The latest graph data science client (1.8) depends on pyarrow >= 4.0, < 15.0 which includes the vulnerable versions 0.14.0 to 14.0.0 (severity: critical) as described here:

Proposed mitigation is to make sure to use pyarrow version 14.0.1 or greater.

Would be great if you can update the requirements, as we currently cannot install package due to policy violations because of this vulnerability.

Algorithms Not Referenced in Python Client

I am running Neo4j 4.4.5 Community on a docker instance. I installed the library first via pip install graphdatascience and then via pip install git+https://github.com/neo4j/graph-data-science-client on a conda python 3.10.2 env. The library installs but from some reason I am unable to see a reference to the algorithms. (In the documentation it shows you can reference gds.fastRP or 'gds.shortestPath.dijkstra.stream' for example but none of these show up.) Oddly enough I can project a graph via the python client and then run astar and yens in the Neo4j browser with no issue. I have tried both GDS 2.1.4 and 2.1.8 jar plugins. Please advise.

gds_issue

Add more jupyter notebooks

Is your feature request related to a problem? Please describe.

In our examples we have several jupyter notebooks to show how to use GDS.
However, there are several areas left uncovered as we mostly focused on the ML parts of our library.
Adding a new example would help to learn the best practices of the GDS client as well as other users.

Potential areas to cover (see https://neo4j.com/docs/graph-data-science/current/algorithms/ for all of our algorithms):

  • Path algorithms such as Steiner trees
  • Similarity algorithms
  • Community detection algorithms
  • Centrality algorithms
  • ...

You can also come up with your own idea of course.

Please drop a comment if you like to work on an example :)

ModuleNotFoundError: No module named 'gdsclient.algo'

I've tried to incorporate the library in a Jupyter notebook and get the following error:

Steps to reproduce:

git clone https://github.com/neo4j/gdsclient.git
cd gdsclient
pip install .

Then when I try to setup the library in the notebook:

from gdsclient import Neo4jQueryRunner, GraphDataScience
from neo4j import GraphDatabase

driver = GraphDatabase.driver(host,auth=(user, password))
gds = GraphDataScience(Neo4jQueryRunner(driver))

I get the following error:

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-155fb41d8c7a> in <module>
----> 1 from gdsclient import Neo4jQueryRunner, GraphDataScience
      2 
      3 gds = GraphDataScience(Neo4jQueryRunner(driver))

~/anaconda3/lib/python3.8/site-packages/gdsclient/__init__.py in <module>
----> 1 from .graph_data_science import GraphDataScience
      2 from .query_runner import Neo4jQueryRunner, QueryRunner
      3 
      4 __all__ = [GraphDataScience, QueryRunner, Neo4jQueryRunner]

~/anaconda3/lib/python3.8/site-packages/gdsclient/graph_data_science.py in <module>
----> 1 from .call_builder import CallBuilder
      2 
      3 
      4 class GraphDataScience:
      5     def __init__(self, query_runner):

~/anaconda3/lib/python3.8/site-packages/gdsclient/call_builder.py in <module>
----> 1 from .algo.algo_endpoints import AlgoEndpoints
      2 from .graph.graph_endpoints import GraphEndpoints
      3 
      4 
      5 class CallBuilder(AlgoEndpoints, GraphEndpoints):

ModuleNotFoundError: No module named 'gdsclient.algo'

and my python environment is:

Python 3.8.8
conda 4.10.1

I've tried without conda environment and the error persists

I tried the example demo but failed..

Hi
I did the example on the link below.
https://github.com/neo4j/graph-data-science-client/blob/main/examples/fastrp-and-knn.ipynb

And I got the error when it executed the mutate call:
...
result = gds.fastRP.mutate(
G,
mutateProperty='embedding',
embeddingDimension=4,
relationshipWeightProperty='amount',
iterationWeights=[0.8, 1, 1, 1]
)
< the return and error >
Required memory for native loading: 515 KiB
Graph named 'purchases' projected
Required memory for running FastRP: 3936 Bytes
Traceback (most recent call last):
File "/mnt/nvme/pycharm-code/neo4j-gds/gds_test.py", line 79, in
result = gds.fastRP.mutate(
File "/mnt/nvme/pycharm-code/neo4j-gds/venv/lib/python3.8/site-packages/graphdatascience/algo/algo_proc_runner.py", line 38, in call
return self._run_procedure(G, config).squeeze() # type: ignore
File "/mnt/nvme/pycharm-code/neo4j-gds/venv/lib/python3.8/site-packages/graphdatascience/algo/algo_proc_runner.py", line 22, in _run_procedure
return self._query_runner.run_query_with_logging(query, params)

Test

This is just a test

uri keyword parameter missing

They uri keyword parameter is missing for GraphDataScience and Neo4j python client, it creates an inconsistancy, so you cannot use **credentials unpaking in python to spreed all credentials parameters at once.

As an example, with this beauty:

credentials = {
'uri':'bolt://data.test.com',
'auth':('user,'password'),
'database':'test'
}

We could simply run:

GraphDataScience(**credentials)

Instead we have to do this mess because uri keyword is missing and not consistant with the keyword pattern.

GraphDataScience(credentials['uri'], auth= credentials['auth'], database= credentials['database'])

Neo.ClientError.Procedure.ProcedureCallFailed when constructing a GDS graph object from DataFrame

Neo4j Version: 5.10
Operating System: Ubuntu 20.04
API: Docker

Hello! When I use Graph Data Science to construct a graph with 1 node and 0 relationships, Neo4j throws a strange exception. I believe that this behavior may be unexpected and related to a potential bug since we should allow running some algorithms in the graph without relationships. Could you further confirm and investigate it? It would be highly appreciated!

Steps to reproduce

Run the following python code:

nodes = pandas.DataFrame(
    {
        "nodeId" : [0],
        "labels" : ["person"],
    }
)

relationships = pandas.DataFrame(
    {
        "sourceNodeId" : [],
        "targetNodeId" : [],
        "relationshipType" : []
    }
)

gds.graph.construct(
    graph_name="my_graph",
    nodes=nodes,
    relationships=relationships
)

Expected behavior

Successfully constructing the graph object.

Actual behavior

Neo.ClientError.Procedure.ProcedureCallFailed

message: Failed to invoke function gds.graph.project: Caused by: java.lang.IllegalArgumentException: The node has to be either a NODE or an INTEGER, but got Double.

Collaborate on a GDS + DGL pipeline?

Is your feature request related to a problem? Please describe.

Dear GDS team,

I came across your talk and found this nice library. Nice work! My name is Minjie. I'm the tech lead of the Deep Graph Library (DGL) project (homepage, github). DGL is one of the most-used library for deep learning models on graphs such as Graph Neural Networks. It provides graph as the core programming abstraction, efficient GPU implementation and scalable multi-GPU or distributed solutions for training on massive, industry-scale graphs. Are you interested to add an example to showcase the integration of GDS and DGL? This will be extremely valuable for the entire community. We are also generally interested in more collaboration with the entire Neo4j community to see how can we bring Graph ML to the vast world of graph data customers. Look forward to your thoughts!

Describe the solution you would like

I'm open to any suggestions. Current thought is to have an example similar to this.

Describe alternatives you have considered

None.

Additional context

gds.degree.stream not respecting orientation?

Describe the bug
I'm trying to reproduce a similar result that I would get in GDS.

To Reproduce
Cypher code:

call gds.graph.project('xdc-test-search', ['Event','Search'], { HAS_SEARCH:{orientation:'REVERSE'}})
call gds.degree.stream('xdc-test-search')
YIELD nodeId, score
return gds.util.asNode(nodeId).search_name_1 as SearchTerm, score As NumberOfSearches
Order by NumberOfSearches Descending, SearchTerm Limit 10;

The above returns the Search nodes.

Python code:

node_projection = ["Event","Search"]
relationship_projection = {"HAS_SEARCH": {"orientation": "REVERSE"}}
G, _ = gds.graph.project("xdc-test-search", node_projection, relationship_projection)
degree_stream = gds.degree.stream(G)

degree.stream above is returning the node ids of the Event nodes (other side of direction), even if I were to use G = gds.graph.get("xdc-test-search") and use the projection created with cypher.

I'm sure it's something I'm doing on my end :)

graphdatascience library version: 1.3
GDS plugin version: 2.1.7
Python version: 3.9.12
Neo4j version: 4.4.10
Operating system: macOS 12.6

The image below is what I would expect from the python approach.
image

Set target database within initial constructor call

Is your feature request related to a problem? Please describe.
I would like to set the database I plan to use upon initializing my gds object GraphDataScience(URI, auth=creds, database='my-db') rather than having to call gds.set_database("my-db"). This would purely be a simple convenience to save a line of code.

Describe the solution you would like
Allow the setting of the database within the constructor call

nodeLabels are not returned appropriately

When a gds graph is constructed using gds.graph.construct, the nodeLabels are not returned appropriately.

nodes_df = pd.DataFrame([[1, 'Paper',11],[2,'Paper',22]], columns=['nodeId','labels','topic'])  
edges_df = pd.DataFrame([[1,2,'Cites']], columns = ['sourceNodeId','targetNodeId','relationshipType'])  
G = gds.graph.construct("test", nodes_df, edges_df)  
print(gds.graph.nodeProperties.stream(G, ["topic"]))  
# results 
#	nodeId	nodeProperty	propertyValue	nodeLabels
#0	1	topic	11	[]
#1	2	topic	22	[]

Return query results as pandas dataframe

A lot of data scientists like to use pandas for various data aggregations and manipulations, so it would make sense to add another method that returns a pandas dataFrame.

Should be quite easy as the current return is a list of dicts, so all you need to use is to use

pd.DataFrame.from_dict()

GDS project.cypher progressbar only takes node loading into account?

graphdatascience 1.6

I'm running a fairly large query:


    print("Projecting graph %s" % graphname)
    #this next line is fairly slow - "unlicensed GDS" can only support a readConcurrency of 4 or less. We have to have the
    #relationship validation turned off because some f2's are not included in the repo! But if we do a join of those two
    #sets, it will take literally forever. 
    #I've looked into it though - and if you are a GDS licensed user you can set readConcurrency higher,
    #but that's basically the only way to improve this (sorry)
    G, res = gds.graph.project.cypher(graphname, "MATCH (repo:Repo {db_key: $db_key})<-[:INCLUDED]-(f:File) RETURN id(f) as id",
                                      """MATCH (repo:Repo {db_key: $db_key})<-[:INCLUDED]-(f:File)-[r:CONNECTED_TO]->(f2:File)
                                         RETURN id(f) as source, id(f2) as target, type(r) as type, r.weight as weight""",
                                      parameters = params, validateRelationships=False)
    print("Projected graph %s" % graphname)

What I see obviously is this:
image

If I was able to attach an animation you'd see the bar go to 100% and then sit there for as long as it takes the second part of the query to run (which can be a very long time).

Perhaps I'm wrong about this but my feeling is the progressbar is tied to the node loading and not the relationship loading!

Thanks!

GraphSAGE doesn't train with differing node properties

Describe the bug
When training graphSAGE on a projection with differing node properties, an error is returned despite GDS supporting this functionality per the documentation

To Reproduce
G, res = gds.graph.project("graph",
# NODES
["Label1",
"Label2",
{"Label3": {"properties": ['overall',
'appearance',
'taste',
'aroma',
'palate']}}
],
# RELS
[{"REL1": {"orientation": 'UNDIRECTED'}},
{"REL2": {"orientation": 'NATURAL'}}
])

model, res = gds.beta.graphSage.train(G,
modelName="myModel",
featureProperties=['overall',
'appearance',
'taste',
'aroma',
'palate'])

res = model.predict_stream(G)

ClientError: {code: Neo.ClientError.Procedure.ProcedureCallFailed} {message: Failed to invoke procedure gds.beta.graphSage.train: Caused by: java.lang.IllegalArgumentException: The following node properties are not present for each label in the graph: [overall, appearance, taste, aroma, palate]. Properties that exist for each label are []}

GDS Doc Snippet for this functionality:

I am able to train the model by calling the following functions directly, but not with the GDS python client.

CALL gds.graph.project(
'persons_with_instruments',
{
Person: {
properties: ['age', 'heightAndWeight']
},
Instrument: {
properties: ['cost']
}
}, {
KNOWS: {
orientation: 'UNDIRECTED'
},
LIKES: {
orientation: 'UNDIRECTED'
}
})

CALL gds.beta.graphSage.train(
'persons_with_instruments',
{
modelName: 'multiLabelModel',
featureProperties: ['age', 'heightAndWeight', 'cost'],
projectedFeatureDimension: 4
}
)

graphdatascience library version: 1.5
GDS plugin version: 2.0.0-alpha
Python version: 3.8.8
Neo4j version: 4.4.10

Steps to reproduce the behavior:

  • See above code

Expected behavior
I should be able to access all GDS functionality via the client

Add convenience methods over train_result from `pipeline.train()`

Is your feature request related to a problem? Please describe.

Training a pipeline returns a train_result which are nested dictionaries.
These are a bit hard to work with.

To make our lives a bit easier, we could add convenience methods around them, similar to how we lately added convenience methods to Model objects #119.

Ideally, this change does not break the API.

Additional context

Please drop a comment if you like to work on an example :)

issue when running the example kge-predict-transe-pyg-train.ipynb

Describe the bug
In the Jupyter notebook of "kge-predict-transe-pyg-train.ipynb", the function create_data_from_graph(relationship_type) doesn't produce the result expected. Instead, an error occurs with the message Index contains duplicate entries, cannot reshape when calling the function train_tensor_data = create_data_from_graph(["TRAIN"]. Plus, "TRAIN" must be between square brackets "[]".

To Reproduce

graphdatascience library version: X.Y
GDS plugin version: X.Y.Z
Python version: X.Y.Z
Neo4j version: X.Y.Z
Operating system: (for example Windows 95/Ubuntu 16.04)

Steps to reproduce the behavior:

  • Graph/dataset to reproduce the bug. Alternatively describe as much as possible the data, e.g., graph size and other schema/structure details.
  • Python code
  • Other queries not run through the python client

Expected behavior

Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.