PyG 集成:示例和导出
此 Jupyter 笔记本托管在 此处 Neo4j 图数据科学客户端 Github 存储库中。
有关此笔记本的视频演示,请参阅在 NODES 2022 大会上发表的演讲 GNNs at Scale With Graph Data Science Sampling and Python Client Integration。
此笔记本举例说明了如何使用 graphdatascience
和 PyTorch Geometric (PyG) Python 库来
-
将 CORA 数据集 直接导入 GDS
-
使用 GDS 随机游走重启算法 对 CORA 进行部分采样
-
导出 CORA 采样客户端
-
在 CORA 采样上定义和训练图卷积神经网络 (GCN)
-
在测试集上评估 GCN
2. 设置
我们首先导入依赖项并设置到数据库的 GDS 客户端连接。
# Install necessary dependencies
%pip install graphdatascience torch torch_scatter torch_sparse torch_geometric
import os
import pandas as pd
from graphdatascience import GraphDataScience
import torch
from torch_geometric.data import Data
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.transforms import RandomNodeSplit
import random
import numpy as np
# Set seeds for consistent results
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://#:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
NEO4J_AUTH = (
os.environ.get("NEO4J_USER"),
os.environ.get("NEO4J_PASSWORD"),
)
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)
from graphdatascience.server_version.server_version import ServerVersion
assert gds.server_version() >= ServerVersion(2, 2, 0)
3. 采样 CORA
接下来,我们使用内置的 CORA 加载器将数据导入 GDS。然后,我们将对它进行采样以获得一个更小的图来进行训练。在现实世界中,我们可能会将数据从 Neo4j 数据库投影到 GDS 中。
G = gds.graph.load_cora()
# Let's make sure we constructed the correct graph
print(f"Metadata for our loaded Cora graph `G`: {G}")
print(f"Node labels present in `G`: {G.node_labels()}")
看起来很正确!现在让我们继续对图进行采样。
# We use the random walk with restarts sampling algorithm with default values
G_sample, _ = gds.alpha.graph.sample.rwr("cora_sample", G, randomSeed=42, concurrency=1)
# We should have somewhere around 0.15 * 2708 ~ 406 nodes in our sample
print(f"Number of nodes in our sample: {G_sample.node_count()}")
# And let's see how many relationships we got
print(f"Number of relationships in our sample: {G_sample.relationship_count()}")
4. 导出采样的 CORA
现在,我们可以导出采样图的拓扑结构和节点属性,这些属性是训练模型所需的。
# Get the relationship data from our sample
sample_topology_df = gds.beta.graph.relationships.stream(G_sample)
# Let's see what we got:
display(sample_topology_df)
我们得到了正确的行数,每条预期关系一行。但是,节点 ID 非常大,而 PyG 希望从零开始的连续 ID。现在,我们将开始整理包含关系的数据结构,直到 PyG 可以使用它。
# By using `by_rel_type` we get the topology in a format that can be used as input to several GNN frameworks:
# {"rel_type": [[source_nodes], [target_nodes]]}
sample_topology = sample_topology_df.by_rel_type()
# We should only have the "CITES" keys since there's only one relationship type
print(f"Relationship type keys: {sample_topology.keys()}")
print(f"Number of {len(sample_topology['CITES'])}")
# How many source nodes do we have?
print(len(sample_topology["CITES"][0]))
太好了,看起来我们有了创建 PyG edge_index
所需的格式。
# We also need to export the node properties corresponding to our node labels and features, represented by the
# "subject" and "features" node properties respectively
sample_node_properties = gds.graph.nodeProperties.stream(
G_sample,
["subject", "features"],
separate_property_columns=True,
)
# Let's make sure we got the data we expected
display(sample_node_properties)
5. 构建 GCN 输入
现在我们有了客户端所需的所有信息,可以构建 PyG Data
对象,我们将将其用作训练输入。我们将重新映射节点 ID 以使其连续并从零开始。我们使用 sample_node_properties
中的节点 ID 排序作为我们的重新映射,以便索引与节点属性对齐。
# In order for the node ids used in the `topology` to be consecutive and starting from zero,
# we will need to remap them. This way they will also align with the row numbering of the
# `sample_node_properties` data frame
def normalize_topology_index(new_idx_to_old, topology):
# Create a reverse mapping based on new idx -> old idx
old_idx_to_new = dict((v, k) for k, v in new_idx_to_old.items())
return [[old_idx_to_new[node_id] for node_id in nodes] for nodes in topology]
# We use the ordering of node ids in `sample_node_properties` as our remapping
# The result is: [[mapped_source_nodes], [mapped_target_nodes]]
normalized_topology = normalize_topology_index(dict(sample_node_properties["nodeId"]), sample_topology["CITES"])
# We use the ordering of node ids in `sample_node_properties` as our remapping
edge_index = torch.tensor(normalized_topology, dtype=torch.long)
# We specify the node property "features" as the zero-layer node embeddings
x = torch.tensor(sample_node_properties["features"], dtype=torch.float)
# We specify the node property "subject" as class labels
y = torch.tensor(sample_node_properties["subject"], dtype=torch.long)
data = Data(x=x, y=y, edge_index=edge_index)
print(data)
# Do a random split of the data so that ~10% goes into a test set and the rest used for training
transform = RandomNodeSplit(num_test=40, num_val=0)
data = transform(data)
# We can see that our `data` object have been extended with some masks defining the split
print(data)
print(data.train_mask.sum().item())
顺便说一下,如果我们想进行一些超参数调整,将验证集保留一些数据也很有用。
6. 训练和评估 GCN
现在让我们使用 PyG 和我们采样的 CORA 作为输入来定义和训练 GCN。我们改编了 PyG 文档 中的 CORA GCN 示例。
在这个例子中,我们在采样 CORA 的测试集上评估了模型。但是请注意,由于 GCN 是一个归纳算法,因此我们也可以在整个 CORA 数据集上进行评估,甚至可以在另一个(类似的)图上进行评估。
num_classes = y.unique().shape[0]
# Define the GCN architecture
class GCN(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv1 = GCNConv(data.num_node_features, 16)
self.conv2 = GCNConv(16, num_classes)
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, training=self.training)
x = self.conv2(x, edge_index)
# We use log_softmax and nll_loss instead of softmax output and cross entropy loss
# for reasons for performance and numerical stability.
# They are mathematically equivalent
return F.log_softmax(x, dim=1)
# Prepare training by setting up for the chosen device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Let's see what device was chosen
print(device)
# In standard PyTorch fashion we instantiate our model, and transfer it to the memory of the chosen device
model = GCN().to(device)
# Let's inspect our model architecture
print(model)
# Pass our input data to the chosen device too
data = data.to(device)
# Since hyperparameter tuning is out of scope for this small example, we initialize an
# Adam optimizer with some fixed learning rate and weight decay
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
通过检查模型,我们可以看到输出大小为 7,这看起来是正确的,因为 Cora 确实有 7 个不同的论文主题。
# Train the GCN using the CORA sample represented by `data` using the standard PyTorch training loop
model.train()
for epoch in range(200):
optimizer.zero_grad()
out = model(data)
loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
# Evaluate the trained GCN model on our test set
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f"Accuracy: {acc:.4f}")
准确率看起来不错。下一步是在我们训练的子样本上运行 GCN 模型以获得完整的 Cora 图。这部分留作练习。