Graph Analytics Serverless for non-Neo4j data sources
This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.
The notebook shows how to use the graphdatascience
Python library to
create, manage, and use a GDS Session.
We consider a graph of people and fruits, which we’re using as a simple
example to show how to load data from Pandas DataFrame
to a GDS
Session, run algorithms, and inspect the results. We will cover all
management operations: creation, listing, and deletion.
If you are using AuraDB, follow this example. If you are using a self-managed Neo4j instance, follow this example.
1. Prerequisites
This notebook requires having the Graph Analytics Serverless feature enabled for your Neo4j Aura project.
You also need to have the graphdatascience
Python library installed,
version 1.15
or later.
%pip install "graphdatascience>=1.15"
2. Aura API credentials
The entry point for managing GDS Sessions is the GdsSessions
object,
which requires creating
Aura API credentials.
import os
from graphdatascience.session import AuraAPICredentials, GdsSessions
client_id = os.environ["AURA_API_CLIENT_ID"]
client_secret = os.environ["AURA_API_CLIENT_SECRET"]
# If your account is a member of several projects, you must also specify the project ID to use
project_id = os.environ.get("AURA_API_PROJECT_ID", None)
sessions = GdsSessions(api_credentials=AuraAPICredentials(client_id, client_secret, project_id=project_id))
3. Creating a new session
A new session is created by calling sessions.get_or_create()
with the following parameters:
-
A session name, which lets you reconnect to an existing session by calling
get_or_create
again. -
The session memory.
-
The cloud location.
-
A time-to-live (TTL), which ensures that the session is automatically deleted after being unused for the set time, to avoid incurring costs.
See the API reference documentation or the manual for more details on the parameters.
from graphdatascience.session import AlgorithmCategory, CloudLocation, SessionMemory
# Explicitly define the size of the session
memory = SessionMemory.m_4GB
# Estimate the memory needed for the GDS session
memory = sessions.estimate(
node_count=20,
relationship_count=50,
algorithm_categories=[AlgorithmCategory.CENTRALITY, AlgorithmCategory.NODE_EMBEDDING],
)
print(f"Estimated memory: {memory}")
# Specify your cloud location
cloud_location = CloudLocation("gcp", "europe-west1")
# You can find available cloud locations by calling
cloud_locations = sessions.available_cloud_locations()
print(f"Available locations: {cloud_locations}")
from datetime import timedelta
# Create a GDS session!
gds = sessions.get_or_create(
# we give it a representative name
session_name="people-and-fruits-standalone",
memory=memory,
ttl=timedelta(minutes=30),
cloud_location=cloud_location,
)
4. Listing sessions
You can use sessions.list()
to see the details for each created
session.
from pandas import DataFrame
gds_sessions = sessions.list()
# for better visualization
DataFrame(gds_sessions)
5. Adding a dataset
We assume that the configured Neo4j database instance is empty. We will add our dataset using standard Cypher.
In a more realistic scenario, this step is already done, and we would just connect to the existing database.
import pandas as pd
people_df = pd.DataFrame(
[
{"nodeId": 0, "name": "Dan", "age": 18, "experience": 63, "hipster": 0},
{"nodeId": 1, "name": "Annie", "age": 12, "experience": 5, "hipster": 0},
{"nodeId": 2, "name": "Matt", "age": 22, "experience": 42, "hipster": 0},
{"nodeId": 3, "name": "Jeff", "age": 51, "experience": 12, "hipster": 0},
{"nodeId": 4, "name": "Brie", "age": 31, "experience": 6, "hipster": 0},
{"nodeId": 5, "name": "Elsa", "age": 65, "experience": 23, "hipster": 0},
{"nodeId": 6, "name": "Bobby", "age": 38, "experience": 4, "hipster": 1},
{"nodeId": 7, "name": "John", "age": 4, "experience": 100, "hipster": 0},
]
)
people_df["labels"] = "Person"
fruits_df = pd.DataFrame(
[
{"nodeId": 8, "name": "Apple", "tropical": 0, "sourness": 0.3, "sweetness": 0.6},
{"nodeId": 9, "name": "Banana", "tropical": 1, "sourness": 0.1, "sweetness": 0.9},
{"nodeId": 10, "name": "Mango", "tropical": 1, "sourness": 0.3, "sweetness": 1.0},
{"nodeId": 11, "name": "Plum", "tropical": 0, "sourness": 0.5, "sweetness": 0.8},
]
)
fruits_df["labels"] = "Fruit"
like_relationships = [(0, 8), (1, 9), (2, 10), (3, 10), (4, 9), (5, 11), (7, 11)]
likes_df = pd.DataFrame([{"sourceNodeId": src, "targetNodeId": trg} for (src, trg) in like_relationships])
likes_df["relationshipType"] = "LIKES"
knows_relationship = [(0, 1), (0, 2), (1, 2), (1, 3), (1, 4), (2, 5), (7, 3)]
knows_df = pd.DataFrame([{"sourceNodeId": src, "targetNodeId": trg} for (src, trg) in knows_relationship])
knows_df["relationshipType"] = "KNOWS"
6. Construct Graph from DataFrames
Now that we have imported a graph to our database, we create graphs
directly from pandas DataFrame
objects. We do that by using the
gds.graph.construct()
method.
# Dropping `name` column as GDS does not support string properties
nodes = [people_df.drop(columns="name"), fruits_df.drop(columns="name")]
relationships = [likes_df, knows_df]
G = gds.graph.construct("people-fruits", nodes, relationships)
str(G)
7. Running Algorithms
You can run algorithms on the constructed graph using the standard GDS Python Client API. See the other tutorials for more examples.
print("Running PageRank ...")
pr_result = gds.pageRank.mutate(G, mutateProperty="pagerank")
print(f"Compute millis: {pr_result['computeMillis']}")
print(f"Node properties written: {pr_result['nodePropertiesWritten']}")
print(f"Centrality distribution: {pr_result['centralityDistribution']}")
print("Running FastRP ...")
frp_result = gds.fastRP.mutate(
G,
mutateProperty="fastRP",
embeddingDimension=8,
featureProperties=["pagerank"],
propertyRatio=0.2,
nodeSelfInfluence=0.2,
)
print(f"Compute millis: {frp_result['computeMillis']}")
# stream back the results
result = gds.graph.nodeProperties.stream(G, ["pagerank", "fastRP"], separate_property_columns=True)
result
To resolve each nodeId
to name, we can merge it back with the source
data frames.
names = pd.concat([people_df, fruits_df])[["nodeId", "name"]]
result.merge(names, how="left")
8. Deleting the session
After the analysis is done, you can delete the session. As this example is not connected to a Neo4j DB, you need to make sure the algorithm results are persisted on your own.
Deleting the session will release all resources associated with it, and stop incurring costs.
# or gds.delete()
sessions.delete(session_name="people-and-fruits-standalone")
# let's also make sure the deleted session is truly gone:
sessions.list()