基于FastRP嵌入的kNN产品推荐

此 Jupyter Notebook 托管在 Neo4j 图数据科学客户端 Github 仓库的此处。

此 Notebook 示例说明了如何使用 graphdatascience Python 库来操作 Neo4j GDS。它展示了 GDS 手册中 FastRP 和 kNN 端到端示例的改编版本，可在此此处找到。

我们考虑一个由产品和客户构成的图，我们希望为每个客户找到新的产品推荐。我们希望使用K-最近邻算法 (kNN) 来识别相似客户，并以此为基础进行产品推荐。为了能够在 kNN 中利用图的拓扑信息，我们将首先使用FastRP 创建节点嵌入。这些嵌入将作为 kNN 算法的输入。

然后，我们将使用 Cypher 查询为每对相似客户生成推荐，其中一位客户购买的产品将被推荐给另一位客户。

1. 先决条件

运行此 Notebook 需要安装了最新版本 (2.0+) GDS 的 Neo4j 服务器。我们建议使用带有 GDS 的 Neo4j Desktop 或 AuraDS。

还需要安装 graphdatascience Python 库。请参阅以下“设置”部分和客户端安装说明中的示例。

2. 设置

我们首先安装并导入依赖项，然后设置我们的 GDS 客户端与数据库的连接。

或者，您可以使用Aura Graph Analytics Serverless 并跳过下面的整个“设置”部分。

# Install necessary dependencies
%pip install graphdatascience

import os

from graphdatascience import GraphDataScience

# Get Neo4j DB URI and credentials from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://:7687")
NEO4J_AUTH = None
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )

gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)

from graphdatascience import ServerVersion

assert gds.server_version() >= ServerVersion(1, 8, 0)

3. 示例图创建

现在，我们在数据库中创建了一个产品和客户的图。amount 关系属性表示客户每周在给定产品上的平均花费金额。

# The `run_cypher` method can be used to run arbitrary Cypher queries on the database.
_ = gds.run_cypher(
    """
        CREATE
         (dan:Person {name: 'Dan'}),
         (annie:Person {name: 'Annie'}),
         (matt:Person {name: 'Matt'}),
         (jeff:Person {name: 'Jeff'}),
         (brie:Person {name: 'Brie'}),
         (elsa:Person {name: 'Elsa'}),

         (cookies:Product {name: 'Cookies'}),
         (tomatoes:Product {name: 'Tomatoes'}),
         (cucumber:Product {name: 'Cucumber'}),
         (celery:Product {name: 'Celery'}),
         (kale:Product {name: 'Kale'}),
         (milk:Product {name: 'Milk'}),
         (chocolate:Product {name: 'Chocolate'}),

         (dan)-[:BUYS {amount: 1.2}]->(cookies),
         (dan)-[:BUYS {amount: 3.2}]->(milk),
         (dan)-[:BUYS {amount: 2.2}]->(chocolate),

         (annie)-[:BUYS {amount: 1.2}]->(cucumber),
         (annie)-[:BUYS {amount: 3.2}]->(milk),
         (annie)-[:BUYS {amount: 3.2}]->(tomatoes),

         (matt)-[:BUYS {amount: 3}]->(tomatoes),
         (matt)-[:BUYS {amount: 2}]->(kale),
         (matt)-[:BUYS {amount: 1}]->(cucumber),

         (jeff)-[:BUYS {amount: 3}]->(cookies),
         (jeff)-[:BUYS {amount: 2}]->(milk),

         (brie)-[:BUYS {amount: 1}]->(tomatoes),
         (brie)-[:BUYS {amount: 2}]->(milk),
         (brie)-[:BUYS {amount: 2}]->(kale),
         (brie)-[:BUYS {amount: 3}]->(cucumber),
         (brie)-[:BUYS {amount: 0.3}]->(celery),

         (elsa)-[:BUYS {amount: 3}]->(chocolate),
         (elsa)-[:BUYS {amount: 3}]->(milk)
    """
)

4. 投影到 GDS 中

为了能够分析数据库中的数据，我们将其投影到内存中，以便 GDS 可以在其上进行操作。

# We define how we want to project our database into GDS
node_projection = ["Person", "Product"]
relationship_projection = {"BUYS": {"orientation": "UNDIRECTED", "properties": "amount"}}

# Before actually going through with the projection, let's check how much memory is required
result = gds.graph.project.estimate(node_projection, relationship_projection)

print(f"Required memory for native loading: {result['requiredMemory']}")

# For this small graph memory requirement is low. Let us go through with the projection
G, result = gds.graph.project("purchases", node_projection, relationship_projection)

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

5. 创建 FastRP 节点嵌入

接下来，我们使用FastRP 算法生成捕获图拓扑信息的节点嵌入。我们将 embeddingDimension 设置为 4，这对于我们的示例图来说已经足够了，因为它非常小。iterationWeights 是凭经验选择的，以产生合理的结果。有关这些参数的更多信息，请参阅FastRP 文档的语法部分。

由于我们希望在稍后运行 kNN 时使用这些嵌入作为输入，因此我们使用 FastRP 的 mutate 模式。

# We can also estimate memory of running algorithms like FastRP, so let's do that first
result = gds.fastRP.mutate.estimate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=4,
    relationshipWeightProperty="amount",
    iterationWeights=[0.8, 1, 1, 1],
)

print(f"Required memory for running FastRP: {result['requiredMemory']}")

# Now let's run FastRP and mutate our projected graph 'purchases' with the results
result = gds.fastRP.mutate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=4,
    relationshipWeightProperty="amount",
    iterationWeights=[0.8, 1, 1, 1],
)

# Let's make sure we got an embedding for each node
print(f"Number of embedding vectors produced: {result['nodePropertiesWritten']}")

6. 使用 kNN 计算相似度

现在我们可以运行kNN，通过使用我们用 FastRP 生成的节点嵌入作为 nodeProperties 来识别相似节点。由于我们处理的是一个小型图，我们可以将 sampleRate 设置为 1，将 deltaThreshold 设置为 0，而不必担心漫长的计算时间。为了获得确定性结果，concurrency 参数设置为 1（以及固定的 randomSeed）。有关这些参数的更多信息，请参阅kNN 文档的语法部分。

请注意，我们将使用算法的写入模式将属性和关系写回数据库，以便以后可以使用 Cypher 分析它们。

# Run kNN and write back to db (we skip memory estimation this time...)
result = gds.knn.write(
    G,
    topK=2,
    nodeProperties=["embedding"],
    randomSeed=42,
    concurrency=1,
    sampleRate=1.0,
    deltaThreshold=0.0,
    writeRelationshipType="SIMILAR",
    writeProperty="score",
)

print(f"Relationships produced: {result['relationshipsWritten']}")
print(f"Nodes compared: {result['nodesCompared']}")
print(f"Mean similarity: {result['similarityDistribution']['mean']}")

正如我们所看到的，节点之间的平均相似度相当高。这是因为我们有一个小型示例，其中节点之间没有很长的路径，导致许多相似的 FastRP 节点嵌入。

7. 探索结果

现在让我们使用 Cypher 来检查 kNN 调用的结果。我们可以使用 SIMILARITY 关系类型来筛选出我们感兴趣的关系。由于我们只关心人员之间的相似性以用于我们的产品推荐引擎，因此我们确保只匹配带有 Person 标签的节点。

有关如何使用 Cypher 的文档，请参阅Cypher 手册。

gds.run_cypher(
    """
        MATCH (p1:Person)-[r:SIMILAR]->(p2:Person)
        RETURN p1.name AS person1, p2.name AS person2, r.score AS similarity
        ORDER BY similarity DESCENDING, person1, person2
    """
)

我们的 kNN 结果表明，除了其他方面，名为“Annie”和“Matt”的 Person 节点非常相似。查看这两个节点的 BUYS 关系，我们可以看到这样的结论是有道理的。他们都购买了三种产品，其中两种产品（名为“Cucumber”和“Tomatoes”的 Product 节点）对两人来说是相同的，并且数量相似。因此，我们对我们的方法充满信心。

8. 提出建议

利用我们得出的信息，即名为“Annie”和“Matt”的 Person 节点相似，我们可以为他们每个人提出产品推荐。由于他们相似，我们可以假设其中一个人购买而另一个人尚未购买的产品可能也会让另一个人感兴趣。根据这个原则，我们可以使用一个简单的 Cypher 查询为名为“Matt”的 Person 导出产品推荐。

gds.run_cypher(
    """
        MATCH (:Person {name: "Annie"})-[:BUYS]->(p1:Product)
        WITH collect(p1) as products
        MATCH (:Person {name: "Matt"})-[:BUYS]->(p2:Product)
        WHERE not p2 in products
        RETURN p2.name as recommendation
    """
)

确实，“Kale”是名为“Annie”的人购买的唯一产品，而名为“Matt”的人尚未购买。

9. 清理

在结束之前，我们可以清理 GDS 内存状态和数据库中的示例数据。

# Remove our projection from the GDS graph catalog
G.drop()

# Remove all the example data from the database
_ = gds.run_cypher("MATCH (n) DETACH DELETE n")

10. 结论

通过使用两种 GDS 算法和一些基本的 Cypher，我们能够轻松地为我们小示例中的客户得出一些合理的产品推荐。

为了确保使用 kNN 为图中的每个客户获取与其他客户的相似性，我们可以尝试增加 topK 参数。