基于 FastRP 嵌入的 kNN 产品推荐

此 Jupyter Notebook 托管在 Neo4j 图数据科学客户端 Github 存储库中的此处。

该笔记本演示了如何使用graphdatascience Python 库来操作 Neo4j GDS。它展示了 GDS 手册中 FastRP 和 kNN 端到端示例的改编版本，可在此处找到。

我们考虑一个产品和客户的图，我们想要为每个客户找到新的产品推荐。我们想使用K 最近邻算法 (kNN)来识别相似的客户，并以此为基础进行产品推荐。为了能够在 kNN 中利用图的拓扑信息，我们将首先使用FastRP创建节点嵌入。这些嵌入将作为 kNN 算法的输入。

然后，我们将使用 Cypher 查询为每对相似客户生成推荐，其中一个客户购买的产品将推荐给另一个客户。

1. 先决条件

运行此笔记本需要一个 Neo4j 服务器，并安装最新版本 (2.0+) 的 GDS。我们建议使用安装了 GDS 的 Neo4j 桌面版或 AuraDS。

还需要安装graphdatascience Python 库。请参阅下面“设置”部分和客户端安装说明中的示例。

2. 设置

我们首先安装和导入我们的依赖项，并设置我们到数据库的 GDS 客户端连接。

# Install necessary dependencies
%pip install graphdatascience

import os
from graphdatascience import GraphDataScience

# Get Neo4j DB URI and credentials from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )

gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)

from graphdatascience.server_version.server_version import ServerVersion

assert gds.server_version() >= ServerVersion(1, 8, 0)

3. 示例图创建

我们现在在数据库中创建一个产品和客户的图。amount 关系属性表示客户每周在特定产品上花费的平均金额。

# The `run_cypher` method can be used to run arbitrary Cypher queries on the database.
_ = gds.run_cypher(
    """
        CREATE
         (dan:Person {name: 'Dan'}),
         (annie:Person {name: 'Annie'}),
         (matt:Person {name: 'Matt'}),
         (jeff:Person {name: 'Jeff'}),
         (brie:Person {name: 'Brie'}),
         (elsa:Person {name: 'Elsa'}),

         (cookies:Product {name: 'Cookies'}),
         (tomatoes:Product {name: 'Tomatoes'}),
         (cucumber:Product {name: 'Cucumber'}),
         (celery:Product {name: 'Celery'}),
         (kale:Product {name: 'Kale'}),
         (milk:Product {name: 'Milk'}),
         (chocolate:Product {name: 'Chocolate'}),

         (dan)-[:BUYS {amount: 1.2}]->(cookies),
         (dan)-[:BUYS {amount: 3.2}]->(milk),
         (dan)-[:BUYS {amount: 2.2}]->(chocolate),

         (annie)-[:BUYS {amount: 1.2}]->(cucumber),
         (annie)-[:BUYS {amount: 3.2}]->(milk),
         (annie)-[:BUYS {amount: 3.2}]->(tomatoes),

         (matt)-[:BUYS {amount: 3}]->(tomatoes),
         (matt)-[:BUYS {amount: 2}]->(kale),
         (matt)-[:BUYS {amount: 1}]->(cucumber),

         (jeff)-[:BUYS {amount: 3}]->(cookies),
         (jeff)-[:BUYS {amount: 2}]->(milk),

         (brie)-[:BUYS {amount: 1}]->(tomatoes),
         (brie)-[:BUYS {amount: 2}]->(milk),
         (brie)-[:BUYS {amount: 2}]->(kale),
         (brie)-[:BUYS {amount: 3}]->(cucumber),
         (brie)-[:BUYS {amount: 0.3}]->(celery),

         (elsa)-[:BUYS {amount: 3}]->(chocolate),
         (elsa)-[:BUYS {amount: 3}]->(milk)
    """
)

4. 投影到 GDS

为了能够分析数据库中的数据，我们继续将其投影到内存中，以便 GDS 可以对其进行操作。

# We define how we want to project our database into GDS
node_projection = ["Person", "Product"]
relationship_projection = {"BUYS": {"orientation": "UNDIRECTED", "properties": "amount"}}

# Before actually going through with the projection, let's check how much memory is required
result = gds.graph.project.estimate(node_projection, relationship_projection)

print(f"Required memory for native loading: {result['requiredMemory']}")

# For this small graph memory requirement is low. Let us go through with the projection
G, result = gds.graph.project("purchases", node_projection, relationship_projection)

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

5. 创建 FastRP 节点嵌入

接下来，我们使用FastRP 算法生成节点嵌入，以捕获图中的拓扑信息。我们选择将embeddingDimension 设置为 4，这对于我们的示例图来说已经足够了，因为它非常小。iterationWeights 是根据经验选择的，以产生合理的结果。有关这些参数的更多信息，请参阅FastRP 文档的语法部分。

因为我们希望在稍后运行 kNN 时将嵌入用作输入，所以我们使用 FastRP 的变异模式。

# We can also estimate memory of running algorithms like FastRP, so let's do that first
result = gds.fastRP.mutate.estimate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=4,
    relationshipWeightProperty="amount",
    iterationWeights=[0.8, 1, 1, 1],
)

print(f"Required memory for running FastRP: {result['requiredMemory']}")

# Now let's run FastRP and mutate our projected graph 'purchases' with the results
result = gds.fastRP.mutate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=4,
    relationshipWeightProperty="amount",
    iterationWeights=[0.8, 1, 1, 1],
)

# Let's make sure we got an embedding for each node
print(f"Number of embedding vectors produced: {result['nodePropertiesWritten']}")

6. 使用 kNN 计算相似度

现在我们可以运行kNN 通过使用我们使用 FastRP 生成的节点嵌入作为nodeProperties 来识别相似的节点。由于我们正在处理一个小的图，因此我们可以将sampleRate 设置为 1，并将deltaThreshold 设置为 0，而无需担心计算时间过长。为了获得确定性的结果，concurrency 参数设置为 1（以及固定的randomSeed）。有关这些参数的更多信息，请参阅kNN 文档的语法部分。

请注意，我们将使用算法的写入模式将属性和关系写回我们的数据库，以便我们稍后可以使用 Cypher 分析它们。

# Run kNN and write back to db (we skip memory estimation this time...)
result = gds.knn.write(
    G,
    topK=2,
    nodeProperties=["embedding"],
    randomSeed=42,
    concurrency=1,
    sampleRate=1.0,
    deltaThreshold=0.0,
    writeRelationshipType="SIMILAR",
    writeProperty="score",
)

print(f"Relationships produced: {result['relationshipsWritten']}")
print(f"Nodes compared: {result['nodesCompared']}")
print(f"Mean similarity: {result['similarityDistribution']['mean']}")

我们可以看到节点之间的平均相似度相当高。这是因为我们有一个小的示例，其中节点之间没有长路径，导致许多相似的 FastRP 节点嵌入。

7. 探索结果

现在让我们使用 Cypher 检查 kNN 调用的结果。我们可以使用SIMILARITY 关系类型过滤掉我们感兴趣的关系。并且由于我们只关心产品推荐引擎中人员之间的相似性，因此我们确保只匹配带有Person 标签的节点。

有关如何使用 Cypher 的文档，请参阅Cypher 手册。

gds.run_cypher(
    """
        MATCH (p1:Person)-[r:SIMILAR]->(p2:Person)
        RETURN p1.name AS person1, p2.name AS person2, r.score AS similarity
        ORDER BY similarity DESCENDING, person1, person2
    """
)

我们的 kNN 结果表明，除其他外，名为“Annie”和“Matt”的Person 节点非常相似。查看这两个节点的BUYS 关系，我们可以看到这样的结论是有道理的。他们都购买了三种产品，其中两种对于两个人来说是相同的（名为“Cucumber”和“Tomatoes”的Product 节点），并且数量相似。因此，我们可以对我们的方法充满信心。

8. 生成推荐

使用我们推断出的名为“Annie”和“Matt”的Person 节点相似的信息，我们可以为他们每个人生成产品推荐。由于它们相似，我们可以假设仅由一个人购买的产品也可能引起另一个尚未购买该产品的人的兴趣。根据这一原则，我们可以使用简单的 Cypher 查询为名为“Matt”的Person 推导出产品推荐。

gds.run_cypher(
    """
        MATCH (:Person {name: "Annie"})-[:BUYS]->(p1:Product)
        WITH collect(p1) as products
        MATCH (:Person {name: "Matt"})-[:BUYS]->(p2:Product)
        WHERE not p2 in products
        RETURN p2.name as recommendation
    """
)

确实，“Kale”是名为“Annie”的人购买的唯一产品，而名为“Matt”的人没有购买。

9. 清理

在结束之前，我们可以从 GDS 内存状态和数据库中清理示例数据。

# Remove our projection from the GDS graph catalog
G.drop()

# Remove all the example data from the database
_ = gds.run_cypher("MATCH (n) DETACH DELETE n")

10. 结论

使用两个 GDS 算法和一些基本的 Cypher，我们能够轻松地为我们的小示例中的客户推导出一些合理的产品推荐。

为了确保使用 kNN 为图中的每个客户获取其他客户的相似度，我们可以尝试增加topK 参数。