适用于非 Neo4j 数据源的图分析无服务器
1. 先决条件
本 notebook 要求为您的 Neo4j Aura 项目启用图分析无服务器功能。
您还需要安装 graphdatascience
Python 库,版本为 1.15
或更高。
%pip install "graphdatascience>=1.15"
2. Aura API 凭据
管理 GDS 会话的入口点是 GdsSessions
对象,这需要创建 Aura API 凭据。
import os
from graphdatascience.session import AuraAPICredentials, GdsSessions
client_id = os.environ["AURA_API_CLIENT_ID"]
client_secret = os.environ["AURA_API_CLIENT_SECRET"]
# If your account is a member of several projects, you must also specify the project ID to use
project_id = os.environ.get("AURA_API_PROJECT_ID", None)
sessions = GdsSessions(api_credentials=AuraAPICredentials(client_id, client_secret, project_id=project_id))
3. 创建新会话
通过使用以下参数调用 sessions.get_or_create()
来创建新会话
-
会话名称,允许您通过再次调用
get_or_create
来重新连接到现有会话。 -
会话内存。
-
云位置。
-
生存时间 (TTL),确保会话在未使用设定时间后自动删除,以避免产生费用。
有关参数的更多详细信息,请参阅 API 参考文档或手册。
from graphdatascience.session import AlgorithmCategory, CloudLocation, SessionMemory
# Explicitly define the size of the session
memory = SessionMemory.m_4GB
# Estimate the memory needed for the GDS session
memory = sessions.estimate(
node_count=20,
relationship_count=50,
algorithm_categories=[AlgorithmCategory.CENTRALITY, AlgorithmCategory.NODE_EMBEDDING],
)
print(f"Estimated memory: {memory}")
# Specify your cloud location
cloud_location = CloudLocation("gcp", "europe-west1")
# You can find available cloud locations by calling
cloud_locations = sessions.available_cloud_locations()
print(f"Available locations: {cloud_locations}")
from datetime import timedelta
# Create a GDS session!
gds = sessions.get_or_create(
# we give it a representative name
session_name="people-and-fruits-standalone",
memory=memory,
ttl=timedelta(minutes=30),
cloud_location=cloud_location,
)
4. 列出会话
您可以使用 sessions.list()
查看每个已创建会话的详细信息。
from pandas import DataFrame
gds_sessions = sessions.list()
# for better visualization
DataFrame(gds_sessions)
5. 添加数据集
我们假设配置的 Neo4j 数据库实例是空的。我们将使用标准 Cypher 添加我们的数据集。
在更真实的场景中,这一步已经完成,我们只需连接到现有数据库。
import pandas as pd
people_df = pd.DataFrame(
[
{"nodeId": 0, "name": "Dan", "age": 18, "experience": 63, "hipster": 0},
{"nodeId": 1, "name": "Annie", "age": 12, "experience": 5, "hipster": 0},
{"nodeId": 2, "name": "Matt", "age": 22, "experience": 42, "hipster": 0},
{"nodeId": 3, "name": "Jeff", "age": 51, "experience": 12, "hipster": 0},
{"nodeId": 4, "name": "Brie", "age": 31, "experience": 6, "hipster": 0},
{"nodeId": 5, "name": "Elsa", "age": 65, "experience": 23, "hipster": 0},
{"nodeId": 6, "name": "Bobby", "age": 38, "experience": 4, "hipster": 1},
{"nodeId": 7, "name": "John", "age": 4, "experience": 100, "hipster": 0},
]
)
people_df["labels"] = "Person"
fruits_df = pd.DataFrame(
[
{"nodeId": 8, "name": "Apple", "tropical": 0, "sourness": 0.3, "sweetness": 0.6},
{"nodeId": 9, "name": "Banana", "tropical": 1, "sourness": 0.1, "sweetness": 0.9},
{"nodeId": 10, "name": "Mango", "tropical": 1, "sourness": 0.3, "sweetness": 1.0},
{"nodeId": 11, "name": "Plum", "tropical": 0, "sourness": 0.5, "sweetness": 0.8},
]
)
fruits_df["labels"] = "Fruit"
like_relationships = [(0, 8), (1, 9), (2, 10), (3, 10), (4, 9), (5, 11), (7, 11)]
likes_df = pd.DataFrame([{"sourceNodeId": src, "targetNodeId": trg} for (src, trg) in like_relationships])
likes_df["relationshipType"] = "LIKES"
knows_relationship = [(0, 1), (0, 2), (1, 2), (1, 3), (1, 4), (2, 5), (7, 3)]
knows_df = pd.DataFrame([{"sourceNodeId": src, "targetNodeId": trg} for (src, trg) in knows_relationship])
knows_df["relationshipType"] = "KNOWS"
6. 从 DataFrames 构建图
现在我们已经将图导入到数据库中,我们可以直接从 pandas DataFrame
对象创建图。我们通过使用 gds.graph.construct()
方法来完成此操作。
# Dropping `name` column as GDS does not support string properties
nodes = [people_df.drop(columns="name"), fruits_df.drop(columns="name")]
relationships = [likes_df, knows_df]
G = gds.graph.construct("people-fruits", nodes, relationships)
str(G)
7. 运行算法
您可以使用标准 GDS Python 客户端 API 在构建的图上运行算法。有关更多示例,请参阅其他教程。
print("Running PageRank ...")
pr_result = gds.pageRank.mutate(G, mutateProperty="pagerank")
print(f"Compute millis: {pr_result['computeMillis']}")
print(f"Node properties written: {pr_result['nodePropertiesWritten']}")
print(f"Centrality distribution: {pr_result['centralityDistribution']}")
print("Running FastRP ...")
frp_result = gds.fastRP.mutate(
G,
mutateProperty="fastRP",
embeddingDimension=8,
featureProperties=["pagerank"],
propertyRatio=0.2,
nodeSelfInfluence=0.2,
)
print(f"Compute millis: {frp_result['computeMillis']}")
# stream back the results
result = gds.graph.nodeProperties.stream(G, ["pagerank", "fastRP"], separate_property_columns=True)
result
要将每个 nodeId
解析为名称,我们可以将其与源数据帧合并。
names = pd.concat([people_df, fruits_df])[["nodeId", "name"]]
result.merge(names, how="left")