机器学习管道

Python 客户端对链接预测管道和节点属性预测管道有特殊支持。GDS 管道在 GDS Python 客户端中表示为管道对象。

通过客户端操作管道完全基于这些管道对象。与 Cypher 过程 API 相比，这是一种更方便、更符合 Python 风格的 API。创建后，管道对象可以作为参数传递给 Python 客户端中的各种方法，例如管道目录操作。此外，管道对象还具有方便的方法，允许检查所代表的管道，而无需显式涉及管道目录。

在下面的示例中，我们假设我们有一个名为gds的GraphDataScience实例化对象。更多信息请参阅快速入门。

1. 节点分类

本节概述如何使用 Python 客户端构建、配置和训练节点分类管道，以及如何使用训练生成的模型进行预测。

1.1. 管道

要创建一个新的节点分类管道，可以进行以下调用

pipe = gds.nc_pipe("my-pipe")

其中pipe是一个管道对象。

然后，要构建、配置和训练管道，我们将直接在节点分类管道对象上调用方法。以下是此类对象方法的描述

表1. 节点分类管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str， config: **kwargs`	`Series`	向管道添加一个生成节点属性的算法，并可选择算法特定的配置.
`selectFeatures`	`node_properties Union[str, list[str]]`	`Series`	选择要用作特征的节点属性.
`configureSplit`	`config: **kwargs`	`Series`	配置训练-测试数据集划分.
`addLogisticRegression`	`parameter_space dict[str, any]`	`Series`	添加一个逻辑回归模型配置，作为模型选择阶段的候选模型进行训练。^[1]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	添加一个随机森林模型配置，作为模型选择阶段的候选模型进行训练。^[1]
`addMLP`	`parameter_space dict[str, any]`	`Series`	添加一个多层感知器(MLP)模型配置，作为模型选择阶段的候选模型进行训练。^[1]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调优.
`train`	`G: Graph， config: **kwargs`	`NCPredictionPipeline， Series`	使用给定关键字参数在给定输入图上训练管道.
`train_estimate`	`G: Graph， config: **kwargs`	`Series`	使用给定关键字参数估算在给定输入图上训练管道所需资源.
`feature_properties`	`-`	`Series`	返回管道的选定特征属性列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中，则为`True`，否则为`False`。
`name`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道类型。
`creation_time`	`-`	`neo4j.time.Datetime`	管道创建时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征训练测试划分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时为模型选择设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调优设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
1. 范围也可以用长度为二的`Tuple`表示。例如，`(x, y)`与`{range: [x, y]}`相同。

与映射到 Cypher API 过程的上述方法相比，主要有两点不同

由于 Python 方法是在管道对象上调用的，因此调用时无需提供名称。
Cypher 调用中的配置参数在 Python 方法调用中由命名关键字参数表示。

另一个区别是，train Python 调用接受一个图对象而不是图名称，并返回一个NCModel模型对象，我们可以用它运行预测，以及一个包含训练元数据的 pandas Series。

请查阅节点分类 Cypher 文档了解方法期望的输入类型。

1.1.1. 示例

下面是一个小例子，说明如何配置和训练一个非常基本的节点分类管道。请注意，我们没有显式配置划分，而是使用默认设置。

为了说明这一点，我们引入一个小型人物图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", fraudster: 0}),
    (b:Person {name: "Alice", fraudster: 0}),
    (c:Person {name: "Eve", fraudster: 1}),
    (d:Person {name: "Chad", fraudster: 1}),
    (e:Person {name: "Dan", fraudster: 0}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["fraudster"]}}, "KNOWS")

assert G.node_labels() == ["Person"]

pipe, _ = gds.beta.pipeline.nodeClassification.create("my-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over logistic regression
pipe.addLogisticRegression(tolerance=(0.01, 0.1))
pipe.addLogisticRegression(penalty=1.0)

# Train the pipeline targeting node property "class" as label and "ACCURACY" as only metric
fraud_model, train_result = pipe.train(
    G,
    modelName="fraud-model",
    targetProperty="fraudster",
    metrics=["ACCURACY"],
    randomSeed=111
)
assert train_result["trainMillis"] >= 0

在GDS 模型目录中被称为“fraud-model”的模型被生成。在下一节中，我们将介绍如何使用该模型进行预测。

1.2. 模型

正如我们在上一节中看到的，节点分类模型是在训练节点分类管道时创建的。除了继承所有模型对象共有的方法外，节点分类模型还具有以下方法

表2. 节点分类模型方法
名称	参数	返回类型	描述
`classes`	`-`	`List[int]`	用于训练分类模型的类别列表。
`feature_properties`	`-`	`List[str]`	用作输入模型特征的节点属性。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道在训练前用于计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定的指标值。
`best_parameters`	`-`	`Series`	在验证集中表现最佳的训练方法参数。
`predict_mutate`	`G: Graph， config: **kwargs`	`Series`	预测输入图中节点的类别，并用预测结果修改图.
`predict_mutate_estimate`	`G: Graph， config: **kwargs`	`Series`	估算预测输入图中节点的类别并用预测结果修改图所需资源.
`predict_stream`	`G: Graph， config: **kwargs`	`DataFrame`	预测输入图中节点的类别并流式传输结果.
`predict_stream_estimate`	`G: Graph， config: **kwargs`	`Series`	估算预测输入图中节点的类别并流式传输结果所需资源.
`predict_write`	`G: Graph， config: **kwargs`	`Series`	预测输入图中节点的类别并将结果写回数据库.
`predict_write_estimate`	`G: Graph， config: **kwargs`	`Series`	估算预测输入图中节点的类别并将结果写回数据库所需资源.

可以看出，预测方法确实与其 Cypher 对应项非常相似。三个主要区别是

它们接受一个图对象而不是图名称。
它们具有表示配置映射键的 Python 关键字参数。
无需提供“modelName”，因为所使用的模型对象自身就包含此信息。

1.2.1. 示例（续）

我们现在继续上面的示例，使用我们在那里训练的节点分类模型trained_pipe_model。

# Make sure we indeed obtained an accuracy score
metrics = fraud_model.metrics()
assert "ACCURACY" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = fraud_model.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

2. 链接预测

本节概述如何使用 Python 客户端构建、配置和训练一个链接预测管道，以及如何使用训练生成的模型进行预测。

2.1. 管道

要创建一个新的链接预测管道，可以进行以下调用

pipe = gds.lp_pipe("my-pipe")

其中pipe是一个管道对象。

然后，要构建、配置和训练管道，我们将直接在链接预测管道对象上调用方法。以下是此类对象方法的描述

表3. 链接预测管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str， config: **kwargs`	`Series`	向管道添加一个生成节点属性的算法，并可选择算法特定的配置.
`addFeature`	`feature_type: str， config: **kwargs`	`Series`	添加基于节点属性和特征组合器的链接特征用于模型训练.
`configureSplit`	`config: **kwargs`	`Series`	配置特征训练-测试数据集划分.
`addLogisticRegression`	`parameter_space dict[str, any]`	`Series`	添加一个逻辑回归模型配置，作为模型选择阶段的候选模型进行训练。^[2]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	添加一个随机森林模型配置，作为模型选择阶段的候选模型进行训练。^[2]
`addMLP`	`parameter_space dict[str, any]`	`Series`	添加一个多层感知器(MLP)模型配置，作为模型选择阶段的候选模型进行训练。^[2]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调优.
`train`	`G: Graph， config: **kwargs`	`LPPredictionPipeline， Series`	使用给定关键字参数在给定输入图上训练模型.
`train_estimate`	`G: Graph， config: **kwargs`	`Series`	使用给定关键字参数估算在给定输入图上训练管道所需资源.
`feature_steps`	`-`	`DataFrame`	返回管道的选定特征步骤列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中，则为`True`，否则为`False`。
`name`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道类型。
`creation_time`	`-`	`neo4j.time.Datetime`	管道创建时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征训练测试划分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时为模型选择设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调优设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
2. 范围也可以用长度为二的`Tuple`表示。例如，`(x, y)`与`{range: [x, y]}`相同。

与映射到 Cypher API 过程的上述方法相比，主要有两点不同

由于 Python 方法是在管道对象上调用的，因此调用时无需提供名称。
Cypher 调用中的配置参数在 Python 方法调用中由命名关键字参数表示。

另一个区别是，train Python 调用接受一个图对象而不是图名称，并返回一个LPModel模型对象，我们可以用它运行预测，以及一个包含训练元数据的 pandas Series。

请查阅链接预测 Cypher 文档了解方法期望的输入类型。

2.1.1. 示例

下面是一个小例子，说明如何配置和训练一个非常基本的链接预测管道。请注意，我们没有显式配置训练参数，而是使用默认设置。

为了说明这一点，我们引入一个小型人物图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob"}),
    (b:Person {name: "Alice"}),
    (c:Person {name: "Eve"}),
    (d:Person {name: "Chad"}),
    (e:Person {name: "Dan"}),
    (f:Person {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", "Person", {"KNOWS": {"orientation":"UNDIRECTED"}})

assert G.relationship_types() == ["KNOWS"]

pipe, _ = gds.beta.pipeline.linkPrediction.create("lp-pipe")

# Add FastRP as a property step producing "embedding" node properties
pipe.addNodeProperty("fastRP", embeddingDimension=128, mutateProperty="embedding", randomSeed=1337)

# Combine our "embedding" node properties with Hadamard to create link features for training
pipe.addFeature("hadamard", nodeProperties=["embedding"])

# Verify that the features to be used in model training are what we expect
steps = pipe.feature_steps()
assert len(steps) == 1
assert steps["name"][0] == "HADAMARD"

# Specify the fractions we want for our dataset split
pipe.configureSplit(trainFraction=0.2, testFraction=0.2, validationFolds=2)

# Add a random forest model with tuning over `maxDepth`
pipe.addRandomForest(maxDepth=(2, 20))

# Train the pipeline and produce a model named "friend-recommender"
friend_recommender, train_result = pipe.train(
    G,
    modelName="friend-recommender",
    targetRelationshipType="KNOWS",
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

在GDS 模型目录中被称为“my-model”的模型被生成。在下一节中，我们将介绍如何使用该模型进行预测。

2.2. 模型

正如我们在上一节中看到的，链接预测模型是在训练链接预测管道时创建的。除了继承所有模型对象共有的方法外，链接预测模型还具有以下方法

表4. 链接预测模型方法
名称	参数	返回类型	描述
`link_features`	`-`	`List[LinkFeature]`	用于训练模型的输入链接特征。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道在训练前用于计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定的指标值。
`best_parameters`	`-`	`Series`	在验证集中表现最佳的训练方法参数。
`predict_mutate`	`G: Graph， config: **kwargs`	`Series`	预测输入图中非相邻节点之间的链接，并用预测结果修改图.
`predict_mutate_estimate`	`G: Graph， config: **kwargs`	`Series`	估算预测输入图中非相邻节点之间的链接并用预测结果修改图所需资源.
`predict_stream`	`G: Graph， config: **kwargs`	`DataFrame`	预测输入图中非相邻节点之间的链接并流式传输结果.
`predict_stream_estimate`	`G: Graph， config: **kwargs`	`Series`	估算预测输入图中非相邻节点之间的链接并流式传输结果所需资源.

可以看出，预测方法确实与其 Cypher 对应项非常相似。三个主要区别是

它们接受一个图对象而不是图名称。
它们具有表示配置映射键的 Python 关键字参数。
无需提供“modelName”，因为所使用的模型对象自身就包含此信息。

2.2.1. 示例（续）

我们现在继续上面的示例，使用我们在那里训练的链接预测模型trained_pipe_model。

# Make sure we indeed obtained an AUCPR score
metrics = friend_recommender.metrics()
assert "AUCPR" in metrics

# Predict on `G` and mutate it with the relationship predictions
mutate_result = friend_recommender.predict_mutate(G, topN=5, mutateRelationshipType="PRED_REL")
assert mutate_result["relationshipsWritten"] == 5 * 2  # Undirected relationships

3. 节点回归

本节概述如何使用 Python 客户端构建、配置和训练一个节点回归管道，以及如何使用训练生成的模型进行预测。

3.1. 管道

要创建一个新的节点回归管道，可以进行以下调用

pipe = gds.nr_pipe("my-pipe")

其中pipe是一个管道对象。

然后，要构建、配置和训练管道，我们将直接在节点回归管道对象上调用方法。以下是此类对象方法的描述

表5. 节点回归管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str， config: **kwargs`	`Series`	向管道添加一个生成节点属性的算法，并可选择算法特定的配置.
`selectFeatures`	`node_properties Union[str, list[str]]`	`Series`	选择要用作特征的节点属性.
`configureSplit`	`config: **kwargs`	`Series`	配置训练-测试数据集划分.
`addLinearRegression`	`parameter_space dict[str, any]`	`Series`	添加一个线性回归模型配置，作为模型选择阶段的候选模型进行训练。^[3]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	添加一个随机森林模型配置，作为模型选择阶段的候选模型进行训练。^[3]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调优.
`train`	`G: Graph， config: **kwargs`	`NCPredictionPipeline， Series`	使用给定关键字参数在给定输入图上训练管道.
`feature_properties`	`-`	`Series`	返回管道的选定特征属性列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中，则为`True`，否则为`False`。
`name`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道类型。
`creation_time`	`-`	`neo4j.time.Datetime`	管道创建时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征训练测试划分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时为模型选择设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调优设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
3. 范围也可以用长度为二的`Tuple`表示。例如，`(x, y)`与`{range: [x, y]}`相同。

与映射到 Cypher API 过程的上述方法相比，主要有两点不同

由于 Python 方法是在管道对象上调用的，因此调用时无需提供名称。
Cypher 调用中的配置参数在 Python 方法调用中由命名关键字参数表示。

另一个区别是，train Python 调用接受一个图对象而不是图名称，并返回一个NRModel模型对象，我们可以用它运行预测，以及一个包含训练元数据的 pandas Series。

请查阅节点回归 Cypher 文档了解方法期望的输入类型。

3.1.1. 示例

下面是一个小例子，说明如何配置和训练一个非常基本的节点回归管道。请注意，我们没有显式配置划分，而是使用默认设置。

为了说明这一点，我们引入一个小型人物图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", age: 22}),
    (b:Person {name: "Alice", age: 5}),
    (c:Person {name: "Eve", age: 53}),
    (d:Person {name: "Chad", age: 44}),
    (e:Person {name: "Dan", age: 60}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["age"]}}, "KNOWS")

assert G.relationship_types() == ["KNOWS"]

pipe, _ = gds.alpha.pipeline.nodeRegression.create("nr-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over linear regression
pipe.addLinearRegression(tolerance=(0.01, 0.1))
pipe.addLinearRegression(penalty=1.0)

# Train the pipeline targeting node property "age" as label and "MEAN_SQUARED_ERROR" as only metric
age_predictor, train_result = pipe.train(
    G,
    modelName="age-predictor",
    targetProperty="age",
    metrics=["MEAN_SQUARED_ERROR"],
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

在GDS 模型目录中被称为“my-model”的模型被生成。在下一节中，我们将介绍如何使用该模型进行预测。

3.2. 模型

正如我们在上一节中看到的，节点回归模型是在训练节点回归管道时创建的。除了继承所有模型对象共有的方法外，节点回归模型还具有以下方法

表6. 节点回归模型方法
名称	参数	返回类型	描述
`feature_properties`	`-`	`List[str]`	返回用作输入模型特征的节点属性。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道在训练前用于计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定的指标值。
`best_parameters`	`-`	`Series`	在验证集中表现最佳的训练方法参数。
`predict_mutate`	`G: Graph， config: **kwargs`	`Series`	预测输入图中节点的属性值，并用预测结果修改图.
`predict_stream`	`G: Graph， config: **kwargs`	`DataFrame`	预测输入图中节点的属性值并流式传输结果.

可以看出，预测方法确实与其 Cypher 对应项非常相似。三个主要区别是

它们接受一个图对象而不是图名称。
它们具有表示配置映射键的 Python 关键字参数。
无需提供“modelName”，因为所使用的模型对象自身就包含此信息。

3.2.1. 示例（续）

我们现在继续上面的示例，使用我们在那里训练的节点回归模型age_predictor。假设我们有一个新的图H，我们想对其运行预测。

# Make sure we indeed obtained an MEAN_SQUARED_ERROR score
metrics = age_predictor.metrics()
assert "MEAN_SQUARED_ERROR" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = age_predictor.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

4. 管道目录

使用管道对象的主要方式是训练模型。此外，管道对象可以作为输入传递给GDS 管道目录操作。例如，假设我们有一个管道对象pipe，我们可以

exists_result = gds.pipeline.exists(pipe.name())

if exists_result["exists"]:
	gds.pipeline.drop(pipe)  # same as pipe.drop()

一个已经创建并存在于管道目录中的管道对象可以通过调用get方法并传入其名称来检索。例如，我们可以从目录中列出并使用找到的第一个管道名称来获取一个代表该管道的管道对象，这将是我们之前在上面的示例中创建的 NodeClassification 管道。

list_result = gds.pipeline.list()
first_pipeline_name = list_result["pipelineName"][0]
pipe = gds.pipeline.get(first_pipeline_name)
assert pipe.name() == "my-pipe"