PyG 集成：采样和导出

此 Jupyter Notebook 托管在 Neo4j 图数据科学客户端 Github 仓库的此处。

有关此 Notebook 的视频演示，请参阅在 NODES 2022 会议上发表的演讲使用图数据科学采样和 Python 客户端集成进行大规模 GNN。

此 Notebook 示例说明了如何使用 graphdatascience 和 PyTorch Geometric (PyG) Python 库来：

将 CORA 数据集直接导入 GDS
使用 GDS 带重启的随机游走算法对 CORA 的一部分进行采样
在客户端导出 CORA 样本
在 CORA 样本上定义和训练图卷积神经网络 (GCN)
在测试集上评估 GCN

1. 先决条件

运行此 Notebook 需要安装了最新 GDS 版本 (2.5+) 的 Neo4j 服务器。我们建议使用带有 GDS 的 Neo4j Desktop 或 AuraDS。

当然，还需要以下 Python 库：

graphdatascience（安装说明请参阅文档）
PyG（安装说明请参阅PyG 文档）

2. 设置

我们首先导入依赖项并设置 GDS 客户端与数据库的连接。

或者，您可以使用Aura 图分析无服务器并跳过下面的整个设置部分。

# Install necessary dependencies
%pip install graphdatascience torch torch_scatter torch_sparse torch_geometric

import os
import random

import numpy as np
import torch
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from torch_geometric.transforms import RandomNodeSplit

from graphdatascience import GraphDataScience

# Set seeds for consistent results
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)

from graphdatascience import ServerVersion

assert gds.server_version() >= ServerVersion(2, 5, 0)

3. 采样 CORA

接下来，我们使用内置的 CORA 加载器将数据导入 GDS。然后我们将对其进行采样，以获得一个较小的图用于训练。在实际场景中，我们可能会从 Neo4j 数据库而不是直接投影数据到 GDS 中。

G = gds.graph.load_cora()

# Let's make sure we constructed the correct graph
print(f"Metadata for our loaded Cora graph `G`: {G}")
print(f"Node labels present in `G`: {G.node_labels()}")

看起来是正确的！现在我们继续采样图。

# We use the random walk with restarts sampling algorithm with default values
G_sample, _ = gds.graph.sample.rwr("cora_sample", G, randomSeed=42, concurrency=1)

# We should have somewhere around 0.15 * 2708 ~ 406 nodes in our sample
print(f"Number of nodes in our sample: {G_sample.node_count()}")

# And let's see how many relationships we got
print(f"Number of relationships in our sample: {G_sample.relationship_count()}")

4. 导出采样的 CORA

现在我们可以导出训练模型所需的采样图的拓扑结构和节点属性。

# Get the relationship data from our sample
sample_topology_df = gds.graph.relationships.stream(G_sample)

# Let's see what we got:
display(sample_topology_df)

我们得到了正确的行数，每行对应一个预期关系。然而，节点 ID 相当大，而 PyG 期望从零开始的连续 ID。现在我们将开始处理包含关系的数据结构，直到 PyG 可以使用它。

# By using `by_rel_type` we get the topology in a format that can be used as input to several GNN frameworks:
# {"rel_type": [[source_nodes], [target_nodes]]}

sample_topology = sample_topology_df.by_rel_type()

# We should only have the "CITES" keys since there's only one relationship type
print(f"Relationship type keys: {sample_topology.keys()}")
print(f"Number of  {len(sample_topology['CITES'])}")

# How many source nodes do we have?
print(len(sample_topology["CITES"][0]))

太好了，看起来我们已经得到了稍后创建 PyG edge_index 所需的格式。

# We also need to export the node properties corresponding to our node labels and features, represented by the
# "subject" and "features" node properties respectively
sample_node_properties = gds.graph.nodeProperties.stream(
    G_sample,
    ["subject", "features"],
    separate_property_columns=True,
)

# Let's make sure we got the data we expected
display(sample_node_properties)

5. 构建 GCN 输入

现在我们已在客户端获取所有所需信息，可以构建用作训练输入的 PyG Data 对象。我们将重新映射节点 ID，使其连续并从零开始。我们使用 sample_node_properties 中节点 ID 的顺序作为我们的重新映射，以便索引与节点属性对齐。

# In order for the node ids used in the `topology` to be consecutive and starting from zero,
# we will need to remap them. This way they will also align with the row numbering of the
# `sample_node_properties` data frame
def normalize_topology_index(new_idx_to_old, topology):
    # Create a reverse mapping based on new idx -> old idx
    old_idx_to_new = dict((v, k) for k, v in new_idx_to_old.items())
    return [[old_idx_to_new[node_id] for node_id in nodes] for nodes in topology]


# We use the ordering of node ids in `sample_node_properties` as our remapping
# The result is: [[mapped_source_nodes], [mapped_target_nodes]]
normalized_topology = normalize_topology_index(dict(sample_node_properties["nodeId"]), sample_topology["CITES"])

# We use the ordering of node ids in `sample_node_properties` as our remapping
edge_index = torch.tensor(normalized_topology, dtype=torch.long)

# We specify the node property "features" as the zero-layer node embeddings
x = torch.tensor(sample_node_properties["features"], dtype=torch.float)

# We specify the node property "subject" as class labels
y = torch.tensor(sample_node_properties["subject"], dtype=torch.long)

data = Data(x=x, y=y, edge_index=edge_index)

print(data)

# Do a random split of the data so that ~10% goes into a test set and the rest used for training
transform = RandomNodeSplit(num_test=40, num_val=0)
data = transform(data)

# We can see that our `data` object have been extended with some masks defining the split
print(data)
print(data.train_mask.sum().item())

顺便一提，如果我们要进行一些超参数调整，那么保留一些数据用于验证集也会很有用。

6. 训练和评估 GCN

现在，让我们使用 PyG 和我们采样的 CORA 作为输入来定义和训练 GCN。我们改编了PyG 文档中的 CORA GCN 示例。

在此示例中，我们根据采样的 CORA 测试集评估模型。但请注意，由于 GCN 是一种归纳算法，我们也可以在完整的 CORA 数据集上，甚至在另一个（类似）图上对其进行评估。

num_classes = y.unique().shape[0]


# Define the GCN architecture
class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        # We use log_softmax and nll_loss instead of softmax output and cross entropy loss
        # for reasons for performance and numerical stability.
        # They are mathematically equivalent
        return F.log_softmax(x, dim=1)

# Prepare training by setting up for the chosen device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Let's see what device was chosen
print(device)

# In standard PyTorch fashion we instantiate our model, and transfer it to the memory of the chosen device
model = GCN().to(device)

# Let's inspect our model architecture
print(model)

# Pass our input data to the chosen device too
data = data.to(device)

# Since hyperparameter tuning is out of scope for this small example, we initialize an
# Adam optimizer with some fixed learning rate and weight decay
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

通过检查模型，我们可以看到输出大小为 7，这看起来是正确的，因为 Cora 确实有 7 种不同的论文主题。

# Train the GCN using the CORA sample represented by `data` using the standard PyTorch training loop
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

# Evaluate the trained GCN model on our test set
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())

print(f"Accuracy: {acc:.4f}")

准确率看起来不错。下一步将是在整个 Cora 图上运行我们训练子样本的 GCN 模型。这部分作为练习留下。

7. 清理

我们从 GDS 图目录中删除 CORA 图。

_ = G_sample.drop()
_ = G.drop()