PyG 集成：示例和导出

此 Jupyter 笔记本托管在此处 Neo4j 图数据科学客户端 Github 存储库中。

有关此笔记本的视频演示，请参阅在 NODES 2022 大会上发表的演讲 GNNs at Scale With Graph Data Science Sampling and Python Client Integration。

此笔记本举例说明了如何使用 graphdatascience 和 PyTorch Geometric (PyG) Python 库来

将 CORA 数据集直接导入 GDS
使用 GDS 随机游走重启算法对 CORA 进行部分采样
导出 CORA 采样客户端
在 CORA 采样上定义和训练图卷积神经网络 (GCN)
在测试集上评估 GCN

1. 先决条件

运行此笔记本需要一台安装了最新 GDS 版本 (2.2+) 的 Neo4j 服务器。我们建议使用带有 GDS 的 Neo4j 桌面或 AuraDS。

当然，还需要 Python 库

graphdatascience（有关安装说明，请参见文档）
PyG（有关安装说明，请参见 PyG 文档）

2. 设置

我们首先导入依赖项并设置到数据库的 GDS 客户端连接。

# Install necessary dependencies
%pip install graphdatascience torch torch_scatter torch_sparse torch_geometric

import os
import pandas as pd
from graphdatascience import GraphDataScience
import torch
from torch_geometric.data import Data
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.transforms import RandomNodeSplit
import random
import numpy as np

# Set seeds for consistent results
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)

from graphdatascience.server_version.server_version import ServerVersion

assert gds.server_version() >= ServerVersion(2, 2, 0)

3. 采样 CORA

接下来，我们使用内置的 CORA 加载器将数据导入 GDS。然后，我们将对它进行采样以获得一个更小的图来进行训练。在现实世界中，我们可能会将数据从 Neo4j 数据库投影到 GDS 中。

G = gds.graph.load_cora()

# Let's make sure we constructed the correct graph
print(f"Metadata for our loaded Cora graph `G`: {G}")
print(f"Node labels present in `G`: {G.node_labels()}")

看起来很正确！现在让我们继续对图进行采样。

# We use the random walk with restarts sampling algorithm with default values
G_sample, _ = gds.alpha.graph.sample.rwr("cora_sample", G, randomSeed=42, concurrency=1)

# We should have somewhere around 0.15 * 2708 ~ 406 nodes in our sample
print(f"Number of nodes in our sample: {G_sample.node_count()}")

# And let's see how many relationships we got
print(f"Number of relationships in our sample: {G_sample.relationship_count()}")

4. 导出采样的 CORA

现在，我们可以导出采样图的拓扑结构和节点属性，这些属性是训练模型所需的。

# Get the relationship data from our sample
sample_topology_df = gds.beta.graph.relationships.stream(G_sample)

# Let's see what we got:
display(sample_topology_df)

我们得到了正确的行数，每条预期关系一行。但是，节点 ID 非常大，而 PyG 希望从零开始的连续 ID。现在，我们将开始整理包含关系的数据结构，直到 PyG 可以使用它。

# By using `by_rel_type` we get the topology in a format that can be used as input to several GNN frameworks:
# {"rel_type": [[source_nodes], [target_nodes]]}

sample_topology = sample_topology_df.by_rel_type()

# We should only have the "CITES" keys since there's only one relationship type
print(f"Relationship type keys: {sample_topology.keys()}")
print(f"Number of  {len(sample_topology['CITES'])}")

# How many source nodes do we have?
print(len(sample_topology["CITES"][0]))

太好了，看起来我们有了创建 PyG edge_index 所需的格式。

# We also need to export the node properties corresponding to our node labels and features, represented by the
# "subject" and "features" node properties respectively
sample_node_properties = gds.graph.nodeProperties.stream(
    G_sample,
    ["subject", "features"],
    separate_property_columns=True,
)

# Let's make sure we got the data we expected
display(sample_node_properties)

5. 构建 GCN 输入

现在我们有了客户端所需的所有信息，可以构建 PyG Data 对象，我们将将其用作训练输入。我们将重新映射节点 ID 以使其连续并从零开始。我们使用 sample_node_properties 中的节点 ID 排序作为我们的重新映射，以便索引与节点属性对齐。

# In order for the node ids used in the `topology` to be consecutive and starting from zero,
# we will need to remap them. This way they will also align with the row numbering of the
# `sample_node_properties` data frame
def normalize_topology_index(new_idx_to_old, topology):
    # Create a reverse mapping based on new idx -> old idx
    old_idx_to_new = dict((v, k) for k, v in new_idx_to_old.items())
    return [[old_idx_to_new[node_id] for node_id in nodes] for nodes in topology]


# We use the ordering of node ids in `sample_node_properties` as our remapping
# The result is: [[mapped_source_nodes], [mapped_target_nodes]]
normalized_topology = normalize_topology_index(dict(sample_node_properties["nodeId"]), sample_topology["CITES"])

# We use the ordering of node ids in `sample_node_properties` as our remapping
edge_index = torch.tensor(normalized_topology, dtype=torch.long)

# We specify the node property "features" as the zero-layer node embeddings
x = torch.tensor(sample_node_properties["features"], dtype=torch.float)

# We specify the node property "subject" as class labels
y = torch.tensor(sample_node_properties["subject"], dtype=torch.long)

data = Data(x=x, y=y, edge_index=edge_index)

print(data)

# Do a random split of the data so that ~10% goes into a test set and the rest used for training
transform = RandomNodeSplit(num_test=40, num_val=0)
data = transform(data)

# We can see that our `data` object have been extended with some masks defining the split
print(data)
print(data.train_mask.sum().item())

顺便说一下，如果我们想进行一些超参数调整，将验证集保留一些数据也很有用。

6. 训练和评估 GCN

现在让我们使用 PyG 和我们采样的 CORA 作为输入来定义和训练 GCN。我们改编了 PyG 文档中的 CORA GCN 示例。

在这个例子中，我们在采样 CORA 的测试集上评估了模型。但是请注意，由于 GCN 是一个归纳算法，因此我们也可以在整个 CORA 数据集上进行评估，甚至可以在另一个（类似的）图上进行评估。

num_classes = y.unique().shape[0]


# Define the GCN architecture
class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        # We use log_softmax and nll_loss instead of softmax output and cross entropy loss
        # for reasons for performance and numerical stability.
        # They are mathematically equivalent
        return F.log_softmax(x, dim=1)

# Prepare training by setting up for the chosen device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Let's see what device was chosen
print(device)

# In standard PyTorch fashion we instantiate our model, and transfer it to the memory of the chosen device
model = GCN().to(device)

# Let's inspect our model architecture
print(model)

# Pass our input data to the chosen device too
data = data.to(device)

# Since hyperparameter tuning is out of scope for this small example, we initialize an
# Adam optimizer with some fixed learning rate and weight decay
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

通过检查模型，我们可以看到输出大小为 7，这看起来是正确的，因为 Cora 确实有 7 个不同的论文主题。

# Train the GCN using the CORA sample represented by `data` using the standard PyTorch training loop
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

# Evaluate the trained GCN model on our test set
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())

print(f"Accuracy: {acc:.4f}")

准确率看起来不错。下一步是在我们训练的子样本上运行 GCN 模型以获得完整的 Cora 图。这部分留作练习。

7. 清理

我们从 GDS 图目录中删除 CORA 图。

_ = G_sample.drop()
_ = G.drop()