机器学习管道

Python 客户端对链接预测管道和节点属性预测管道具有特殊支持。GDS 管道在 GDS Python 客户端中表示为管道对象。

通过客户端操作管道完全基于这些管道对象。与 Cypher 过程 API 相比，这是一个更方便和 Pythonic 的 API。创建后，管道对象可以作为参数传递给 Python 客户端中的各种方法，例如管道目录操作。此外，管道对象具有便利方法，允许检查表示的管道，而无需显式涉及管道目录。

在下面的示例中，我们假设我们有一个名为 gds 的已实例化的 GraphDataScience 对象。有关此内容的更多信息，请参阅入门。

1. 节点分类

本节概述了如何使用 Python 客户端构建、配置和训练节点分类管道，以及如何使用训练产生的模型进行预测。

1.1. 管道

要创建一个新的节点分类管道，需要进行以下调用

pipe = gds.nc_pipe("my-pipe")

其中 pipe 是一个管道对象。

然后继续构建、配置和训练管道，我们将直接在节点分类管道对象上调用方法。以下是此类对象上的方法描述

表 1. 节点分类管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	向管道添加一个生成节点属性的算法，并带有可选的特定于算法的配置.
`selectFeatures`	`node_properties Union[str, list[str]]`	`Series`	选择用作特征的节点属性.
`configureSplit`	`config: **kwargs`	`Series`	配置训练-测试数据集拆分.
`addLogisticRegression`	`parameter_space dict[str, any]`	`Series`	添加逻辑回归模型配置，以便在模型选择阶段进行训练作为候选模型。 ^[1]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	添加随机森林模型配置，以便在模型选择阶段进行训练作为候选模型。 ^[1]
`addMLP`	`parameter_space dict[str, any]`	`Series`	添加 MLP 模型配置，以便在模型选择阶段进行训练作为候选模型。 ^[1]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调整.
`train`	`G: Graph, config: **kwargs`	`NCPredictionPipeline, Series`	使用给定的关键字参数在给定的输入图上训练管道.
`train_estimate`	`G: Graph, config: **kwargs`	`Series`	使用给定的关键字参数估算在给定的输入图上训练管道.
`feature_properties`	`-`	`Series`	返回管道中选定的特征属性列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中，则为 `True`，否则为 `False`。
`name`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道的类型。
`creation_time`	`-`	`neo4j.time.Datetime`	创建管道的时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征-训练-测试拆分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时为模型选择设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调整设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
1. 范围也可以作为长度为 2 的 Tuple`s 给出。即 `(x, y) 与 `{range: [x, y]}` 相同。

比较上面映射到 Cypher API 过程的方法时，有两个主要区别

由于 Python 方法是在管道对象上调用的，因此在调用它们时不需要提供名称。
Cypher 调用中的配置参数由 Python 方法调用中的命名关键字参数表示。

另一个区别是 train Python 调用接受图对象而不是图名称，并返回一个 NCModel 模型对象，我们可以使用它来运行预测，以及一个 pandas Series，其中包含来自训练的元数据。

有关方法期望的输入类型的更多信息，请参阅节点分类 Cypher 文档。

1.1.1. 示例

下面是一个如何配置和训练非常基本的节点分类管道的示例。请注意，我们没有显式配置拆分，而是使用默认值。

为了举例说明，我们引入了一个小型人员图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", fraudster: 0}),
    (b:Person {name: "Alice", fraudster: 0}),
    (c:Person {name: "Eve", fraudster: 1}),
    (d:Person {name: "Chad", fraudster: 1}),
    (e:Person {name: "Dan", fraudster: 0}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["fraudster"]}}, "KNOWS")

assert G.node_labels() == ["Person"]

pipe, _ = gds.beta.pipeline.nodeClassification.create("my-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over logistic regression
pipe.addLogisticRegression(tolerance=(0.01, 0.1))
pipe.addLogisticRegression(penalty=1.0)

# Train the pipeline targeting node property "class" as label and "ACCURACY" as only metric
fraud_model, train_result = pipe.train(
    G,
    modelName="fraud-model",
    targetProperty="fraudster",
    metrics=["ACCURACY"],
    randomSeed=111
)
assert train_result["trainMillis"] >= 0

生成了一个在GDS 模型目录中称为“fraud-model”的模型。在下一节中，我们将介绍如何使用该模型进行预测。

1.2. 模型

正如我们在上一节中看到的，节点分类模型是在训练节点分类管道时创建的。除了继承所有模型对象共有的方法外，节点分类模型还具有以下方法

表 2. 节点分类模型方法
名称	参数	返回类型	描述
`classes`	`-`	`List[int]`	用于训练分类模型的类列表。
`feature_properties`	`-`	`List[str]`	用作输入模型特征的节点属性。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道用于在训练之前计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定指标的值。
`best_parameters`	`-`	`Series`	在所有验证集上表现最佳的训练方法参数。
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	预测输入图节点的类别并使用预测结果修改图。.
`predict_mutate_estimate`	`G: Graph, config: **kwargs`	`Series`	估计预测输入图节点的类别并使用预测结果修改图。.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	预测输入图节点的类别并流式传输结果。.
`predict_stream_estimate`	`G: Graph, config: **kwargs`	`Series`	估计预测输入图节点的类别并流式传输结果。.
`predict_write`	`G: Graph, config: **kwargs`	`Series`	预测输入图节点的类别并将结果写回数据库。.
`predict_write_estimate`	`G: Graph, config: **kwargs`	`Series`	估计预测输入图节点的类别并将结果写回数据库。.

需要注意的是，这些 predict 方法确实与其 Cypher 对应方法非常相似。三个主要区别是

它们接受一个图对象而不是图名。
它们具有表示配置映射键的 Python 关键字参数。
不需要提供 "modelName"，因为使用的模型对象本身就包含此信息。

1.2.1. 示例（续）

现在，我们将继续使用我们在上面示例中训练的节点分类模型 trained_pipe_model。

# Make sure we indeed obtained an accuracy score
metrics = fraud_model.metrics()
assert "ACCURACY" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = fraud_model.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

2. 链接预测

本节概述了如何使用 Python 客户端构建、配置和训练链接预测管道，以及如何使用训练产生的模型进行预测。

2.1. 管道

要创建一个新的链接预测管道，可以进行以下调用

pipe = gds.lp_pipe("my-pipe")

其中 pipe 是一个管道对象。

然后继续构建、配置和训练管道，我们可以直接在链接预测管道对象上调用方法。以下是此类对象上的方法描述

表 3. 链接预测管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	向管道添加一个生成节点属性的算法，并带有可选的特定于算法的配置.
`addFeature`	`feature_type: str, config: **kwargs`	`Series`	基于节点属性和特征组合器添加用于模型训练的链接特征。.
`configureSplit`	`config: **kwargs`	`Series`	配置特征训练测试数据集分割。.
`addLogisticRegression`	`parameter_space dict[str, any]`	`Series`	添加逻辑回归模型配置，以便在模型选择阶段作为候选模型进行训练。^[2]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	添加随机森林模型配置，以便在模型选择阶段作为候选模型进行训练。^[2]
`addMLP`	`parameter_space dict[str, any]`	`Series`	添加 MLP 模型配置，以便在模型选择阶段作为候选模型进行训练。^[2]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调整.
`train`	`G: Graph, config: **kwargs`	`LPPredictionPipeline, Series`	使用给定的关键字参数在给定的输入图上训练模型。.
`train_estimate`	`G: Graph, config: **kwargs`	`Series`	使用给定的关键字参数估算在给定的输入图上训练管道.
`feature_steps`	`-`	`DataFrame`	返回管道中选择的特征步骤列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中，则为 `True`，否则为 `False`。
`name`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道的类型。
`creation_time`	`-`	`neo4j.time.Datetime`	创建管道的时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征-训练-测试拆分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时为模型选择设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调整设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
2. 范围也可以作为长度为 2 的 `Tuple` 给出。例如，`(x, y)` 与 `{range: [x, y]}` 相同。

比较上面映射到 Cypher API 过程的方法时，有两个主要区别

由于 Python 方法是在管道对象上调用的，因此在调用它们时不需要提供名称。
Cypher 调用中的配置参数由 Python 方法调用中的命名关键字参数表示。

另一个区别是，train Python 调用接受一个图对象而不是图名，并返回一个 LPModel 模型对象，我们可以用它来运行预测，以及一个包含训练元数据的 pandas Series。

有关方法期望的输入类型的信息，请参阅链接预测 Cypher 文档。

2.1.1. 示例

以下是如何配置和训练一个非常基本的链接预测管道的示例。请注意，我们没有明确配置训练参数，而是使用默认值。

为了举例说明，我们引入了一个小型人员图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob"}),
    (b:Person {name: "Alice"}),
    (c:Person {name: "Eve"}),
    (d:Person {name: "Chad"}),
    (e:Person {name: "Dan"}),
    (f:Person {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", "Person", {"KNOWS": {"orientation":"UNDIRECTED"}})

assert G.relationship_types() == ["KNOWS"]

pipe, _ = gds.beta.pipeline.linkPrediction.create("lp-pipe")

# Add FastRP as a property step producing "embedding" node properties
pipe.addNodeProperty("fastRP", embeddingDimension=128, mutateProperty="embedding", randomSeed=1337)

# Combine our "embedding" node properties with Hadamard to create link features for training
pipe.addFeature("hadamard", nodeProperties=["embedding"])

# Verify that the features to be used in model training are what we expect
steps = pipe.feature_steps()
assert len(steps) == 1
assert steps["name"][0] == "HADAMARD"

# Specify the fractions we want for our dataset split
pipe.configureSplit(trainFraction=0.2, testFraction=0.2, validationFolds=2)

# Add a random forest model with tuning over `maxDepth`
pipe.addRandomForest(maxDepth=(2, 20))

# Train the pipeline and produce a model named "friend-recommender"
friend_recommender, train_result = pipe.train(
    G,
    modelName="friend-recommender",
    targetRelationshipType="KNOWS",
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

在GDS 模型目录中生成一个名为“my-model”的模型。在下一节中，我们将介绍如何使用该模型进行预测。

2.2. 模型

如上一节所述，在训练链接预测管道时会创建链接预测模型。除了继承所有模型对象共有的方法外，链接预测模型还有以下方法

表 4. 链接预测模型方法
名称	参数	返回类型	描述
`link_features`	`-`	`List[LinkFeature]`	用于训练模型的输入链接特征。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道用于在训练之前计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定指标的值。
`best_parameters`	`-`	`Series`	在所有验证集上表现最佳的训练方法参数。
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	预测输入图中非相邻节点之间的链接，并使用预测结果修改图。.
`predict_mutate_estimate`	`G: Graph, config: **kwargs`	`Series`	估计预测输入图中非相邻节点之间的链接，并使用预测结果修改图。.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	预测输入图中非相邻节点之间的链接，并流式传输结果。.
`predict_stream_estimate`	`G: Graph, config: **kwargs`	`Series`	估计预测输入图中非相邻节点之间的链接，并流式传输结果。.

需要注意的是，这些 predict 方法确实与其 Cypher 对应方法非常相似。三个主要区别是

它们接受一个图对象而不是图名。
它们具有表示配置映射键的 Python 关键字参数。
不需要提供 "modelName"，因为使用的模型对象本身就包含此信息。

2.2.1. 示例（续）

现在，我们将继续使用我们在上面示例中训练的链接预测模型 trained_pipe_model。

# Make sure we indeed obtained an AUCPR score
metrics = friend_recommender.metrics()
assert "AUCPR" in metrics

# Predict on `G` and mutate it with the relationship predictions
mutate_result = friend_recommender.predict_mutate(G, topN=5, mutateRelationshipType="PRED_REL")
assert mutate_result["relationshipsWritten"] == 5 * 2  # Undirected relationships

3. 节点回归

本节概述了如何使用 Python 客户端构建、配置和训练节点回归管道，以及如何使用训练产生的模型进行预测。

3.1. 管道

要创建一个新的节点回归管道，可以进行以下调用

pipe = gds.nr_pipe("my-pipe")

其中 pipe 是一个管道对象。

然后继续构建、配置和训练管道，我们可以直接在节点回归管道对象上调用方法。以下是此类对象上的方法描述

表 5. 节点回归管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	向管道添加一个生成节点属性的算法，并带有可选的特定于算法的配置.
`selectFeatures`	`node_properties Union[str, list[str]]`	`Series`	选择用作特征的节点属性.
`configureSplit`	`config: **kwargs`	`Series`	配置训练-测试数据集拆分.
`addLinearRegression`	`parameter_space dict[str, any]`	`Series`	添加线性回归模型配置，以便在模型选择阶段作为候选模型进行训练。^[3]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	添加随机森林模型配置，以便在模型选择阶段作为候选模型进行训练。^[3]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调整.
`train`	`G: Graph, config: **kwargs`	`NCPredictionPipeline, Series`	使用给定的关键字参数在给定的输入图上训练管道.
`feature_properties`	`-`	`Series`	返回管道中选定的特征属性列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中，则为 `True`，否则为 `False`。
`name`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道的类型。
`creation_time`	`-`	`neo4j.time.Datetime`	创建管道的时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征-训练-测试拆分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时为模型选择设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调整设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
3. 范围也可以作为长度为 2 的 `Tuple` 给出。例如，`(x, y)` 与 `{range: [x, y]}` 相同。

比较上面映射到 Cypher API 过程的方法时，有两个主要区别

由于 Python 方法是在管道对象上调用的，因此在调用它们时不需要提供名称。
Cypher 调用中的配置参数由 Python 方法调用中的命名关键字参数表示。

另一个区别是，train Python 调用接受一个图对象而不是图名，并返回一个 NRModel 模型对象，我们可以用它来运行预测，以及一个包含训练元数据的 pandas Series。

有关方法期望的输入类型的信息，请参阅节点回归 Cypher 文档。

3.1.1. 示例

以下是如何配置和训练一个非常基本的节点回归管道的示例。请注意，我们没有明确配置分割，而是使用默认值。

为了举例说明，我们引入了一个小型人员图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", age: 22}),
    (b:Person {name: "Alice", age: 5}),
    (c:Person {name: "Eve", age: 53}),
    (d:Person {name: "Chad", age: 44}),
    (e:Person {name: "Dan", age: 60}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["age"]}}, "KNOWS")

assert G.relationship_types() == ["KNOWS"]

pipe, _ = gds.alpha.pipeline.nodeRegression.create("nr-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over linear regression
pipe.addLinearRegression(tolerance=(0.01, 0.1))
pipe.addLinearRegression(penalty=1.0)

# Train the pipeline targeting node property "age" as label and "MEAN_SQUARED_ERROR" as only metric
age_predictor, train_result = pipe.train(
    G,
    modelName="age-predictor",
    targetProperty="age",
    metrics=["MEAN_SQUARED_ERROR"],
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

在GDS 模型目录中生成一个名为“my-model”的模型。在下一节中，我们将介绍如何使用该模型进行预测。

3.2. 模型

如上一节所述，在训练节点回归管道时会创建节点回归模型。除了继承所有模型对象共有的方法外，节点回归模型还有以下方法

表 6. 节点回归模型方法
名称	参数	返回类型	描述
`feature_properties`	`-`	`List[str]`	返回用作输入模型特征的节点属性。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道用于在训练之前计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定指标的值。
`best_parameters`	`-`	`Series`	在所有验证集上表现最佳的训练方法参数。
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	预测输入图节点的属性值，并使用预测结果修改图。.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	预测输入图节点的属性值，并流式传输结果。.

需要注意的是，这些 predict 方法确实与其 Cypher 对应方法非常相似。三个主要区别是

它们接受一个图对象而不是图名。
它们具有表示配置映射键的 Python 关键字参数。
不需要提供 "modelName"，因为使用的模型对象本身就包含此信息。

3.2.1. 示例（续）

现在，我们将继续使用我们在上面示例中训练的节点回归模型 age_predictor。假设我们有一个新的图 H，我们希望对其运行预测。

# Make sure we indeed obtained an MEAN_SQUARED_ERROR score
metrics = age_predictor.metrics()
assert "MEAN_SQUARED_ERROR" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = age_predictor.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

4. 管道目录

使用管道对象的主要方式是训练模型。此外，管道对象可以用作GDS 管道目录操作的输入。例如，假设我们有一个管道对象 pipe，我们可以

exists_result = gds.beta.pipeline.exists(pipe.name())

if exists_result["exists"]:
	gds.beta.pipeline.drop(pipe)  # same as pipe.drop()

已经创建并存在于管道目录中的管道对象可以通过使用其名称调用 get 方法来检索。例如，我们可以从目录中列出并使用找到的第一个管道名称来获取表示该管道的管道对象，这将是我们上面示例中创建的节点分类管道。

list_result = gds.beta.pipeline.list()
first_pipeline_name = list_result["pipelineName"][0]
pipe = gds.pipeline.get(first_pipeline_name)
assert pipe.name() == "my-pipe"