Microsoft Azure Cognitive Services

Microsoft Azure 认知服务 API 使用机器学习从文本中发现洞察和关系。本章中的存储过程是对该 API 调用的封装，用于从存储为节点属性的文本中提取实体和关键短语，并提供情感分析。

每个存储过程都有两种模式

流模式 (Stream) - 返回根据 API 返回的 JSON 构建的映射

存储过程概览

存储过程描述如下

限定名称类型发布

限定名称	类型	发布
apoc.nlp.azure.entities.graph `为提供的文本创建（虚拟）实体图`	`存储过程`	`Apoc Extended`
apoc.nlp.azure.entities.stream `为提供的文本提供实体分析`	`存储过程`	`Apoc Extended`
apoc.nlp.azure.keyPhrases.graph `为提供的文本创建（虚拟）关键短语图`	`存储过程`	`Apoc Extended`
apoc.nlp.azure.keyPhrases.stream `为提供的文本提供实体分析`	`存储过程`	`Apoc Extended`
apoc.nlp.azure.sentiment.graph `为提供的文本创建（虚拟）情感图`	`存储过程`	`Apoc Extended`
apoc.nlp.azure.sentiment.stream `为提供的文本提供情感分析`	`存储过程`	`Apoc Extended`

apoc.nlp.azure.entities.graph

为提供的文本创建（虚拟）实体图

存储过程

Apoc Extended

apoc.nlp.azure.entities.stream

为提供的文本提供实体分析

存储过程

Apoc Extended

apoc.nlp.azure.keyPhrases.graph

为提供的文本创建（虚拟）关键短语图

存储过程

Apoc Extended

apoc.nlp.azure.keyPhrases.stream

为提供的文本提供实体分析

存储过程

Apoc Extended

apoc.nlp.azure.sentiment.graph

为提供的文本创建（虚拟）情感图

存储过程

Apoc Extended

apoc.nlp.azure.sentiment.stream

为提供的文本提供情感分析

存储过程

Apoc Extended

目前，Microsoft Azure 认知服务 API 支持 10 多种语言的文本输入。为了获得更好的结果，请确保您的文本是认知服务支持的语言之一。

实体提取

实体提取存储过程 (apoc.nlp.azure.entities.*) 是对 Azure Text Analytics API 实体端点的封装。此 API 方法返回给定文档中的已知实体和通用命名实体（如“人物”、“位置”、“组织”等）列表。

存储过程描述如下

签名
apoc.nlp.azure.entities.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?)
apoc.nlp.azure.entities.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?)

签名

apoc.nlp.azure.entities.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?)

apoc.nlp.azure.entities.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?)

该存储过程支持以下配置参数

表 1. 配置参数
名称	类型	默认值	描述
key	String	null	Microsoft.CognitiveServicesTextAnalytics API 密钥
url	String	null	Microsoft.CognitiveServicesTextAnalytics 端点
nodeProperty	String	text	提供的节点中包含要分析的非结构化文本的属性

此外，apoc.nlp.azure.entities.graph 支持以下配置参数

表 2. 配置参数
名称	类型	默认值	描述
scoreCutoff	Double	0.0	实体分数在图中的最低限制。值必须在 0 到 1 之间。分数是 Amazon Comprehend 对检测准确性置信水平的指标。
write	Boolean	false	持久化实体图
writeRelationshipType	String	ENTITY	源节点到实体节点的关系类型
writeRelationshipProperty	String	score	源节点到实体节点的关系属性

流模式

CALL apoc.nlp.azure.entities.stream(source:Node or List<Node>, {
  key: String,
  url: String,
  nodeProperty: String
})
YIELD value

图模式

CALL apoc.nlp.azure.entities.graph(source:Node or List<Node>, {
  key: String,
  url: String,
  nodeProperty: String,
  scoreCutoff: Double,
  writeRelationshipType: String,
  writeRelationshipProperty: String,
  write: Boolean
})
YIELD graph

关键短语

关键短语存储过程 (apoc.nlp.azure.keyPhrases.*) 是对 Azure Text Analytics API 关键短语端点的封装。关键短语是输入文本中的主要议题。

该存储过程描述如下

签名
apoc.nlp.azure.keyPhrases.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?)
apoc.nlp.azure.keyPhrases.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?)

签名

apoc.nlp.azure.keyPhrases.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?)

apoc.nlp.azure.keyPhrases.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?)

该存储过程支持以下配置参数

表 3. 配置参数
名称	类型	默认值	描述
key	String	null	Microsoft.CognitiveServicesTextAnalytics API 密钥
url	String	null	Microsoft.CognitiveServicesTextAnalytics 端点
nodeProperty	String	text	提供的节点中包含要分析的非结构化文本的属性

此外，apoc.nlp.azure.keyPhrases.graph 支持以下配置参数

表 4. 配置参数
名称	类型	默认值	描述
write	Boolean	false	持久化关键短语图
writeRelationshipType	String	KEY_PHRASE	源节点到关键短语节点的关系类型

流模式

CALL apoc.nlp.azure.keyPhrases.stream(source:Node or List<Node>, {
  key: String,
  url: String,
  nodeProperty: String
})
YIELD value

图模式

CALL apoc.nlp.azure.keyPhrases.graph(source:Node or List<Node>, {
  key: String,
  url: String,
  nodeProperty: String,
  writeRelationshipType: String,
  write: Boolean
})
YIELD graph

情感分析

情感分析存储过程 (apoc.nlp.azure.sentiment.*) 是对 Azure Text Analytics API 情感分析端点的封装。该 API 返回一个介于 0 到 1 之间的数值分数。接近 1 的分数表示积极情感，而接近 0 的分数表示消极情感。分数 0.5 表示缺乏情感（例如事实性陈述）。

存储过程描述如下

签名
apoc.nlp.azure.sentiment.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?)
apoc.nlp.azure.sentiment.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?)

签名

apoc.nlp.azure.sentiment.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?)

apoc.nlp.azure.sentiment.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?)

这些存储过程支持以下配置参数

表 5. 配置参数
名称	类型	默认值	描述
key	String	null	Microsoft.CognitiveServicesTextAnalytics API 密钥
url	String	null	Microsoft.CognitiveServicesTextAnalytics 端点
nodeProperty	String	text	提供的节点中包含要分析的非结构化文本的属性

此外，apoc.nlp.azure.sentiment.graph 支持以下配置参数

表 6. 配置参数
名称	类型	默认值	描述
write	Boolean	false	持久化情感图

流模式

CALL apoc.nlp.azure.sentiment.stream(source:Node or List<Node>, {
  key: String,
  url: String,
  nodeProperty: String
})
YIELD value

图模式

CALL apoc.nlp.azure.sentiment.graph(source:Node or List<Node>, {
  key: String,
  url: String,
  nodeProperty: String,
  write: Boolean
})
YIELD graph

安装依赖项

NLP 存储过程依赖于 Kotlin 和客户端库，这些库不包含在 APOC Extended 库中。

这些依赖项包含在 apoc-nlp-dependencies-5.26.1-all.jar 文件中，可以从发布页面下载。下载该文件后，应将其放置在 plugins 目录下，然后重启 Neo4j 服务器。

设置 API Key 和 URL

我们可以按照快速入门：使用 Text Analytics 客户端库文章中的说明生成 API 密钥和 URL。完成后，您应该会看到一个列出您的凭据的页面，类似于下面的屏幕截图

图 1. Azure Text Analytics 凭据

在这种情况下，我们的 API URL 是 https://neo4j-nlp-text-analytics.cognitiveservices.azure.com/，我们可以使用任一隐藏密钥。

让我们填充并执行以下命令来创建包含这些详细信息的参数。

以下定义了 apiKey 和 apiSecret 参数

:param apiKey => ("<api-key-here>");
:param apiUrl => ("<api-url-here>");

或者，我们可以将这些凭据添加到 apoc.conf 并使用静态值存储函数检索它们。参见静态值存储

apoc.conf

apoc.static.azure.apiKey=<api-key-here>
apoc.static.azure.apiUrl=<api-url-here>

以下从 apoc.conf 检索 AWS 凭据

RETURN apoc.static.getAll("azure") AS azure;

表 7. 结果
azure
{apiKey: "<api-key-here>", apiUrl: "<api-url-here>"}

示例

本节中的示例基于以下示例图

CREATE (:Article {
  uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/",
  body: "These days I’m rarely more than a few feet away from my Nintendo Switch and I play board games, card games and role playing games with friends at least once or twice a week. I’ve even organised lunch-time Mario Kart 8 tournaments between the Neo4j European offices!"
});

CREATE (:Article {
  uri: "https://en.wikipedia.org/wiki/Nintendo_Switch",
  body: "The Nintendo Switch is a video game console developed by Nintendo, released worldwide in most regions on March 3, 2017. It is a hybrid console that can be used as a home console and portable device. The Nintendo Switch was unveiled on October 20, 2016. Nintendo offers a Joy-Con Wheel, a small steering wheel-like unit that a Joy-Con can slot into, allowing it to be used for racing games such as Mario Kart 8."
});

实体提取

让我们开始从一个 Article 节点中提取实体。我们想要分析的文本存储在节点的 body 属性中，因此我们需要通过 nodeProperty 配置参数指定它。

以下是为 Pokemon 文章流式传输实体的示例

MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.azure.entities.stream(a, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body"
})
YIELD value
UNWIND value.entities AS entity
RETURN entity;

表 8. 结果
实体
{name: "Nintendo Switch", wikipediaId: "Nintendo Switch", type: "Other", matches: [{length: 15, text: "Nintendo Switch", wikipediaScore: 0.8339868065025469, offset: 56}], bingId: "b3d617ef-81fc-4188-9a2b-a5cf1f8534b5", wikipediaLanguage: "en", wikipediaUrl: "https://en.wikipedia.org/wiki/Nintendo_Switch"}
{name: "Nintendo Switch", type: "Organization", matches: [{length: 15, entityTypeScore: 0.94, text: "Nintendo Switch", offset: 56}]}
{name: "Oberon Media", wikipediaId: "Oberon Media", type: "Organization", matches: [{length: 6, text: "I play", wikipediaScore: 0.032446316016667254, offset: 76}], bingId: "166f6e0f-33b7-8707-bb8b-5a932c498333", wikipediaLanguage: "en", wikipediaUrl: "https://en.wikipedia.org/wiki/Oberon_Media"}
{name: "a week", subType: "Duration", type: "DateTime", matches: [{length: 6, entityTypeScore: 0.8, text: "a week", offset: 166}]}
{name: "Mario Kart 8", wikipediaId: "Mario Kart 8", type: "Other", matches: [{length: 12, text: "Mario Kart 8", wikipediaScore: 0.7802000593632747, offset: 205}], bingId: "ce6f55ec-d3d7-032a-0bf8-15ad3e8df3f4", wikipediaLanguage: "en", wikipediaUrl: "https://en.wikipedia.org/wiki/Mario_Kart_8"}
{name: "Mario Kart", type: "Organization", matches: [{length: 10, entityTypeScore: 0.72, text: "Mario Kart", offset: 205}]}
{name: "8", subType: "Number", type: "Quantity", matches: [{length: 1, entityTypeScore: 0.8, text: "8", offset: 216}]}
{name: "Neo4j", wikipediaId: "Neo4j", type: "Other", matches: [{length: 5, text: "Neo4j", wikipediaScore: 0.8150388253887939, offset: 242}], bingId: "bc2f436b-8edd-6ba6-b2d3-69901348d653", wikipediaLanguage: "en", wikipediaUrl: "https://en.wikipedia.org/wiki/Neo4j"}
{name: "Europe", wikipediaId: "Europe", type: "Location", matches: [{length: 8, text: "European", wikipediaScore: 0.00591759926701263, offset: 248}], bingId: "501457aa-5b70-cfba-cfd8-be882b4bac1e", wikipediaLanguage: "en", wikipediaUrl: "https://en.wikipedia.org/wiki/Europe"}

我们得到了 9 个不同的实体，尽管我们可以看到其中一些实体指的是相同的事物，只是 type 值不同。然后我们可以应用一个 Cypher 语句，为每个实体创建一个节点，并从这些节点中的每一个创建一条 ENTITY 关系回到 Article 节点。

以下为 Pokemon 文章流式传输实体，然后为每个实体创建节点

MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.azure.entities.stream(a, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body"
})
YIELD value
UNWIND value.entities AS entity
WITH a, entity.name AS entity, collect(entity.type) AS types
MERGE (e:Entity {name: entity})
SET e.type = types
MERGE (a)-[:ENTITY]->(e);

或者，我们可以使用图模式自动创建实体图。除了拥有 Entity 标签外，每个实体节点还将根据 type 属性的值拥有另一个标签。默认情况下，返回一个虚拟图。

以下返回 Pokemon 和 Nintendo Switch 文章的虚拟实体图

MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.azure.entities.graph(articles, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body",
  writeRelationshipType: "ENTITY"
})
YIELD graph AS g
RETURN g

您可以在 Pokemon 和 Nintendo Switch 实体图中查看虚拟图的 Neo4j Browser 可视化效果。

图 2. Pokemon 和 Nintendo Switch 实体图

在此可视化中，我们还可以看到每个实体节点的分数。该分数表示 API 对实体检测准确性的置信水平。我们可以使用 scoreCutoff 属性指定分数的最低截止值。

以下返回 Pokemon 和 Nintendo Switch 文章中分数 >= 0.7 的虚拟实体图

MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.azure.entities.graph(articles, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body",
  scoreCutoff: 0.7,
  writeRelationshipType: "ENTITY"
})
YIELD graph AS g
RETURN g

您可以在置信度 >= 0.7 的 Pokemon 和 Nintendo Switch 实体图中查看虚拟图的 Neo4j Browser 可视化效果。

apoc.nlp.azure.entities multiple.graph cutoff

图 3. 置信度 >= 0.7 的 Pokemon 和 Nintendo Switch 实体图

如果您对该图满意并希望将其持久化到 Neo4j 中，可以通过指定 write: true 配置来实现。

以下创建从文章到每个实体的 HAS_ENTITY 关系

MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.azure.entities.graph(articles, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body",
  scoreCutoff: 0.7,
  writeRelationshipType: "HAS_ENTITY",
  writeRelationshipProperty: "azureEntityScore",
  write: true
})
YIELD graph AS g
RETURN g;

然后我们可以编写查询来返回已创建的实体。

以下返回文章及其实体

MATCH (article:Article)
RETURN article.uri AS article,
       [(article)-[r:HAS_ENTITY]->(e:Entity) | {text: e.text, score: r.azureEntityScore}] AS entities;

表 9. 结果
文章	实体
"https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"	[{score: 0.72, text: "Mario Kart"}, {score: 0.7802000593632747, text: "Mario Kart 8"}, {score: 0.8, text: "8"}, {score: 0.8, text: "a week"}, {score: 0.94, text: "Nintendo Switch"}, {score: 0.8150388253887939, text: "Neo4j"}]
"https://en.wikipedia.org/wiki/Nintendo_Switch"	[{score: 0.9023679924293266, text: "Joy-Con"}, {score: 0.98, text: "Nintendo"}, {score: 0.8, text: "March 3, 2017"}, {score: 0.9355623498560008, text: "Nintendo Switch"}, {score: 0.92, text: "Mario Kart"}, {score: 0.8, text: "8"}, {score: 0.8863202650046607, text: "Mario Kart 8"}, {score: 0.8, text: "October 20, 2016"}]

关键短语

现在让我们从 Article 节点中提取关键短语。我们想要分析的文本存储在节点的 body 属性中，因此我们需要通过 nodeProperty 配置参数指定它。

以下为 Pokemon 文章流式传输关键短语的示例

MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.azure.keyPhrases.stream(a, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body"
})
YIELD value
UNWIND value.keyPhrases AS keyPhrase
RETURN keyPhrase;

表 10. 结果
关键短语
"board games"
"card games"
"tournaments"
"role"
"organised lunch-time Mario Kart"
"Neo4j European offices"
"Nintendo Switch"
"friends"
"feet"
"days"

或者，我们可以使用图模式自动创建一个关键短语图。对于提取的每个关键短语，将创建一个带有 KeyPhrase 标签的节点。

默认情况下，返回一个虚拟图，但可以通过指定 write: true 配置来持久化图。

以下返回 Pokemon 文章的关键短语图

MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.azure.keyPhrases.graph(a, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body",
  writeRelationshipType: "KEY_PHRASE",
  write: true
})
YIELD graph AS g
RETURN g;

您可以在 Pokemon 关键短语图中查看虚拟图的 Neo4j Browser 可视化效果。

图 4. Pokemon 关键短语图

然后我们可以编写查询来返回已创建的关键短语。

以下返回文章及其实体

MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
RETURN a.uri AS article,
       [(a)-[:KEY_PHRASE]->(k:KeyPhrase) | k.text] AS keyPhrases;

表 11. 结果
文章	关键短语
"https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"	["card games", "board games", "friends", "feet", "Nintendo Switch", "days", "organised lunch-time Mario Kart", "tournaments", "Neo4j European offices", "role"]

情感分析

现在让我们提取 Article 节点的情感。我们想要分析的文本存储在节点的 body 属性中，因此我们需要通过 nodeProperty 配置参数指定它。

以下为 Pokemon 文章流式传输关键短语的示例

MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.azure.sentiment.stream(a, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body"
})
YIELD value
RETURN value;

表 12. 结果
值
{score: 0.5, id: "0"}

或者，我们可以使用图模式自动存储情感及其分数。

默认情况下，返回一个虚拟图，但可以通过指定 write: true 配置来持久化图。情感分数存储在 sentimentScore 属性中。

以下返回包含 Pokemon 文章情感的图

MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.azure.sentiment.graph(a, {
  key: $apiKey,
  url: $apiUrl,
  nodeProperty: "body",
  write: true
})
YIELD graph AS g
UNWIND g.nodes AS node
RETURN node {.uri, .sentimentScore} AS node;

表 13. 结果
节点
{uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/", sentimentScore: 0.5}