Graph Analytics Serverless for Self-Managed Neo4j DB
This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.
The notebook shows how to use the graphdatascience
Python library to
create, manage, and use a GDS Session.
We consider a graph of people and fruits, which we’re using as a simple example to show how to connect your self-managed Neo4j database to a GDS Session, run algorithms, and eventually write back your analytical results to your Neo4j database. We will cover all management operations: creation, listing, and deletion.
If you are using AuraDB, follow this example.
1. Prerequisites
This notebook requires having a Neo4j instance available and that the Graph Analytics Serverless feature is enabled for your Neo4j Aura project.
You also need to have the graphdatascience
Python library installed,
version 1.15
or later.
%pip install "graphdatascience>=1.15"
2. Aura API credentials
The entry point for managing GDS Sessions is the GdsSessions
object,
which requires creating
Aura API credentials.
import os
from graphdatascience.session import AuraAPICredentials, GdsSessions
client_id = os.environ["AURA_API_CLIENT_ID"]
client_secret = os.environ["AURA_API_CLIENT_SECRET"]
# If your account is a member of several projects, you must also specify the project ID to use
project_id = os.environ.get("AURA_API_PROJECT_ID", None)
sessions = GdsSessions(api_credentials=AuraAPICredentials(client_id, client_secret, project_id=project_id))
3. Creating a new session
A new session is created by calling sessions.get_or_create()
with the following parameters:
-
A session name, which lets you reconnect to an existing session by calling
get_or_create
again. -
The
DbmsConnectionInfo
containing the address, user name and password for an accessible self-manged Neo4j DBMS instance -
The session memory.
-
The cloud location.
-
A time-to-live (TTL), which ensures that the session is automatically deleted after being unused for the set time, to avoid incurring costs.
See the API reference documentation or the manual for more details on the parameters.
from graphdatascience.session import AlgorithmCategory, CloudLocation, SessionMemory
# Explicitly define the size of the session
memory = SessionMemory.m_8GB
# Estimate the memory needed for the GDS session
memory = sessions.estimate(
node_count=20,
relationship_count=50,
algorithm_categories=[AlgorithmCategory.CENTRALITY, AlgorithmCategory.NODE_EMBEDDING],
)
print(f"Estimated memory: {memory}")
# Specify your cloud location
cloud_location = CloudLocation("gcp", "europe-west1")
# You can find available cloud locations by calling
cloud_locations = sessions.available_cloud_locations()
print(f"Available locations: {cloud_locations}")
import os
from datetime import timedelta
from graphdatascience.session import DbmsConnectionInfo
# Identify the Neo4j DBMS
db_connection = DbmsConnectionInfo(
uri=os.environ["NEO4J_URI"], username=os.environ["NEO4J_USER"], password=os.environ["NEO4J_PASSWORD"]
)
# Create a GDS session!
gds = sessions.get_or_create(
# give it a representative name
session_name="people-and-fruits-sm",
memory=memory,
db_connection=db_connection,
ttl=timedelta(minutes=30),
cloud_location=cloud_location,
)
4. Listing sessions
You can use sessions.list()
to see the details for each created
session.
from pandas import DataFrame
gds_sessions = sessions.list()
# for better visualization
DataFrame(gds_sessions)
5. Adding a dataset
We assume that the configured Neo4j database instance is empty. You will add the dataset using standard Cypher.
In a more realistic scenario, this step is already done, and you would just connect to the existing database.
data_query = """
CREATE
(dan:Person {name: 'Dan', age: 18, experience: 63, hipster: 0}),
(annie:Person {name: 'Annie', age: 12, experience: 5, hipster: 0}),
(matt:Person {name: 'Matt', age: 22, experience: 42, hipster: 0}),
(jeff:Person {name: 'Jeff', age: 51, experience: 12, hipster: 0}),
(brie:Person {name: 'Brie', age: 31, experience: 6, hipster: 0}),
(elsa:Person {name: 'Elsa', age: 65, experience: 23, hipster: 1}),
(john:Person {name: 'John', age: 4, experience: 100, hipster: 0}),
(apple:Fruit {name: 'Apple', tropical: 0, sourness: 0.3, sweetness: 0.6}),
(banana:Fruit {name: 'Banana', tropical: 1, sourness: 0.1, sweetness: 0.9}),
(mango:Fruit {name: 'Mango', tropical: 1, sourness: 0.3, sweetness: 1.0}),
(plum:Fruit {name: 'Plum', tropical: 0, sourness: 0.5, sweetness: 0.8})
CREATE
(dan)-[:LIKES]->(apple),
(annie)-[:LIKES]->(banana),
(matt)-[:LIKES]->(mango),
(jeff)-[:LIKES]->(mango),
(brie)-[:LIKES]->(banana),
(elsa)-[:LIKES]->(plum),
(john)-[:LIKES]->(plum),
(dan)-[:KNOWS]->(annie),
(dan)-[:KNOWS]->(matt),
(annie)-[:KNOWS]->(matt),
(annie)-[:KNOWS]->(jeff),
(annie)-[:KNOWS]->(brie),
(matt)-[:KNOWS]->(brie),
(brie)-[:KNOWS]->(elsa),
(brie)-[:KNOWS]->(jeff),
(john)-[:KNOWS]->(jeff);
"""
# making sure the database is actually empty
assert gds.run_cypher("MATCH (n) RETURN count(n)").squeeze() == 0, "Database is not empty!"
# let's now write our graph!
gds.run_cypher(data_query)
gds.run_cypher("MATCH (n) RETURN count(n) AS nodeCount")
6. Projecting Graphs
Now that you have imported a graph to the database, you can project it
into our GDS Session. You do that by using the gds.graph.project()
endpoint.
The remote projection query that you are using selects all Person
nodes and their LIKES
relationships, and all Fruit
nodes and their
LIKES
relationships. Additionally, node properties are projected for
illustrative purposes. You can use these node properties as input to
algorithms, although it is note done in this notebook.
G, result = gds.graph.project(
"people-and-fruits",
"""
CALL {
MATCH (p1:Person)
OPTIONAL MATCH (p1)-[r:KNOWS]->(p2:Person)
RETURN
p1 AS source, r AS rel, p2 AS target,
p1 {.age, .experience, .hipster } AS sourceNodeProperties,
p2 {.age, .experience, .hipster } AS targetNodeProperties
UNION
MATCH (f:Fruit)
OPTIONAL MATCH (f)<-[r:LIKES]-(p:Person)
RETURN
p AS source, r AS rel, f AS target,
p {.age, .experience, .hipster } AS sourceNodeProperties,
f { .tropical, .sourness, .sweetness } AS targetNodeProperties
}
RETURN gds.graph.project.remote(source, target, {
sourceNodeProperties: sourceNodeProperties,
targetNodeProperties: targetNodeProperties,
sourceNodeLabels: labels(source),
targetNodeLabels: labels(target),
relationshipType: type(rel)
})
""",
)
str(G)
7. Running Algorithms
You can run algorithms on the constructed graph using the standard GDS Python Client API. See the other tutorials for more examples.
print("Running PageRank ...")
pr_result = gds.pageRank.mutate(G, mutateProperty="pagerank")
print(f"Compute millis: {pr_result['computeMillis']}")
print(f"Node properties written: {pr_result['nodePropertiesWritten']}")
print(f"Centrality distribution: {pr_result['centralityDistribution']}")
print("Running FastRP ...")
frp_result = gds.fastRP.mutate(
G,
mutateProperty="fastRP",
embeddingDimension=8,
featureProperties=["pagerank"],
propertyRatio=0.2,
nodeSelfInfluence=0.2,
)
print(f"Compute millis: {frp_result['computeMillis']}")
# stream back the results
gds.graph.nodeProperties.stream(G, ["pagerank", "fastRP"], separate_property_columns=True, db_node_properties=["name"])
8. Writing back to Neo4j
The GDS Session’s in-memory graph was projected from data in our specified Neo4j database. Write back operations will thus persist the data back to the same Neo4j database. Let’s write back the results of the PageRank and FastRP algorithms to the Neo4j database.
# if this fails once with some error like "unable to retrieve routing table"
# then run it again. this is a transient error with a stale server cache.
gds.graph.nodeProperties.write(G, ["pagerank", "fastRP"])
Of course, you can just use .write
modes as well. Let’s run Louvain in
write mode to show:
gds.louvain.write(G, writeProperty="louvain")
You can now use the gds.run_cypher()
method to query the updated
graph. Note that the run_cypher()
method will run the query on the
Neo4j database.
gds.run_cypher(
"""
MATCH (p:Person)
RETURN p.name, p.pagerank AS rank, p.louvain
ORDER BY rank DESC
"""
)
9. Deleting the session
Now that you have finished the analysis, you can delete the session. The results that you produced were written back to our Neo4j database, and will not be lost. If you computed additional things that you did not write back, those will be lost.
Deleting the session will release all resources associated with it, and stop incurring costs.
# or gds.delete()
sessions.delete(session_name="people-and-fruits-sm")
# let's also make sure the deleted session is truly gone:
sessions.list()
# Lastly, let's clean up the database
gds.run_cypher("MATCH (n:Person|Fruit) DETACH DELETE n")