PyG 集成:示例和导出

Open In Colab

此 Jupyter 笔记本托管在 此处 Neo4j 图数据科学客户端 Github 存储库中。

有关此笔记本的视频演示,请参阅在 NODES 2022 大会上发表的演讲 GNNs at Scale With Graph Data Science Sampling and Python Client Integration

此笔记本举例说明了如何使用 graphdatascience 和 PyTorch Geometric (PyG) Python 库来

  • CORA 数据集 直接导入 GDS

  • 使用 GDS 随机游走重启算法 对 CORA 进行部分采样

  • 导出 CORA 采样客户端

  • 在 CORA 采样上定义和训练图卷积神经网络 (GCN)

  • 在测试集上评估 GCN

1. 先决条件

运行此笔记本需要一台安装了最新 GDS 版本 (2.2+) 的 Neo4j 服务器。我们建议使用带有 GDS 的 Neo4j 桌面或 AuraDS。

当然,还需要 Python 库

  • graphdatascience(有关安装说明,请参见 文档

  • PyG(有关安装说明,请参见 PyG 文档

2. 设置

我们首先导入依赖项并设置到数据库的 GDS 客户端连接。

# Install necessary dependencies
%pip install graphdatascience torch torch_scatter torch_sparse torch_geometric
import os
import pandas as pd
from graphdatascience import GraphDataScience
import torch
from torch_geometric.data import Data
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.transforms import RandomNodeSplit
import random
import numpy as np
# Set seeds for consistent results
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://#:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)
from graphdatascience.server_version.server_version import ServerVersion

assert gds.server_version() >= ServerVersion(2, 2, 0)

3. 采样 CORA

接下来,我们使用内置的 CORA 加载器将数据导入 GDS。然后,我们将对它进行采样以获得一个更小的图来进行训练。在现实世界中,我们可能会将数据从 Neo4j 数据库投影到 GDS 中。

G = gds.graph.load_cora()
# Let's make sure we constructed the correct graph
print(f"Metadata for our loaded Cora graph `G`: {G}")
print(f"Node labels present in `G`: {G.node_labels()}")

看起来很正确!现在让我们继续对图进行采样。

# We use the random walk with restarts sampling algorithm with default values
G_sample, _ = gds.alpha.graph.sample.rwr("cora_sample", G, randomSeed=42, concurrency=1)
# We should have somewhere around 0.15 * 2708 ~ 406 nodes in our sample
print(f"Number of nodes in our sample: {G_sample.node_count()}")

# And let's see how many relationships we got
print(f"Number of relationships in our sample: {G_sample.relationship_count()}")

4. 导出采样的 CORA

现在,我们可以导出采样图的拓扑结构和节点属性,这些属性是训练模型所需的。

# Get the relationship data from our sample
sample_topology_df = gds.beta.graph.relationships.stream(G_sample)
# Let's see what we got:
display(sample_topology_df)

我们得到了正确的行数,每条预期关系一行。但是,节点 ID 非常大,而 PyG 希望从零开始的连续 ID。现在,我们将开始整理包含关系的数据结构,直到 PyG 可以使用它。

# By using `by_rel_type` we get the topology in a format that can be used as input to several GNN frameworks:
# {"rel_type": [[source_nodes], [target_nodes]]}

sample_topology = sample_topology_df.by_rel_type()
# We should only have the "CITES" keys since there's only one relationship type
print(f"Relationship type keys: {sample_topology.keys()}")
print(f"Number of  {len(sample_topology['CITES'])}")

# How many source nodes do we have?
print(len(sample_topology["CITES"][0]))

太好了,看起来我们有了创建 PyG edge_index 所需的格式。

# We also need to export the node properties corresponding to our node labels and features, represented by the
# "subject" and "features" node properties respectively
sample_node_properties = gds.graph.nodeProperties.stream(
    G_sample,
    ["subject", "features"],
    separate_property_columns=True,
)
# Let's make sure we got the data we expected
display(sample_node_properties)

5. 构建 GCN 输入

现在我们有了客户端所需的所有信息,可以构建 PyG Data 对象,我们将将其用作训练输入。我们将重新映射节点 ID 以使其连续并从零开始。我们使用 sample_node_properties 中的节点 ID 排序作为我们的重新映射,以便索引与节点属性对齐。

# In order for the node ids used in the `topology` to be consecutive and starting from zero,
# we will need to remap them. This way they will also align with the row numbering of the
# `sample_node_properties` data frame
def normalize_topology_index(new_idx_to_old, topology):
    # Create a reverse mapping based on new idx -> old idx
    old_idx_to_new = dict((v, k) for k, v in new_idx_to_old.items())
    return [[old_idx_to_new[node_id] for node_id in nodes] for nodes in topology]


# We use the ordering of node ids in `sample_node_properties` as our remapping
# The result is: [[mapped_source_nodes], [mapped_target_nodes]]
normalized_topology = normalize_topology_index(dict(sample_node_properties["nodeId"]), sample_topology["CITES"])
# We use the ordering of node ids in `sample_node_properties` as our remapping
edge_index = torch.tensor(normalized_topology, dtype=torch.long)

# We specify the node property "features" as the zero-layer node embeddings
x = torch.tensor(sample_node_properties["features"], dtype=torch.float)

# We specify the node property "subject" as class labels
y = torch.tensor(sample_node_properties["subject"], dtype=torch.long)

data = Data(x=x, y=y, edge_index=edge_index)

print(data)
# Do a random split of the data so that ~10% goes into a test set and the rest used for training
transform = RandomNodeSplit(num_test=40, num_val=0)
data = transform(data)

# We can see that our `data` object have been extended with some masks defining the split
print(data)
print(data.train_mask.sum().item())

顺便说一下,如果我们想进行一些超参数调整,将验证集保留一些数据也很有用。

6. 训练和评估 GCN

现在让我们使用 PyG 和我们采样的 CORA 作为输入来定义和训练 GCN。我们改编了 PyG 文档 中的 CORA GCN 示例。

在这个例子中,我们在采样 CORA 的测试集上评估了模型。但是请注意,由于 GCN 是一个归纳算法,因此我们也可以在整个 CORA 数据集上进行评估,甚至可以在另一个(类似的)图上进行评估。

num_classes = y.unique().shape[0]


# Define the GCN architecture
class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        # We use log_softmax and nll_loss instead of softmax output and cross entropy loss
        # for reasons for performance and numerical stability.
        # They are mathematically equivalent
        return F.log_softmax(x, dim=1)
# Prepare training by setting up for the chosen device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Let's see what device was chosen
print(device)
# In standard PyTorch fashion we instantiate our model, and transfer it to the memory of the chosen device
model = GCN().to(device)

# Let's inspect our model architecture
print(model)

# Pass our input data to the chosen device too
data = data.to(device)

# Since hyperparameter tuning is out of scope for this small example, we initialize an
# Adam optimizer with some fixed learning rate and weight decay
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

通过检查模型,我们可以看到输出大小为 7,这看起来是正确的,因为 Cora 确实有 7 个不同的论文主题。

# Train the GCN using the CORA sample represented by `data` using the standard PyTorch training loop
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
# Evaluate the trained GCN model on our test set
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())

print(f"Accuracy: {acc:.4f}")

准确率看起来不错。下一步是在我们训练的子样本上运行 GCN 模型以获得完整的 Cora 图。这部分留作练习。

7. 清理

我们从 GDS 图目录中删除 CORA 图。

_ = G_sample.drop()
_ = G.drop()