Google Cloud Platform (GCP)
Google Cloud Platform 的 自然语言 API 允许用户使用 Google 机器学习从非结构化文本中获取见解。本章中的过程充当对对此 API 的调用的包装器,以从存储为节点属性的文本中提取实体、类别或情感。
每个过程有两种模式
-
流 - 返回从 API 返回的 JSON 构造的映射
-
图 - 基于 API 返回的值创建图或虚拟图
本章中描述的过程在调用线程上进行 API 调用和后续对数据库的更新。如果我们想要对 API 发出并行请求并避免在运行写入数据库的过程时在内存中保留过多的事务状态而导致内存不足错误,请参阅 批量请求。 |
目前,GCP 自然语言 API 支持超过 10 种语言的文本输入。为了获得更好的结果,请确保您的文本是 自然语言 API 支持的语言 之一。如果我们输入不支持的语言的文本,您可能会收到“HTTP 响应代码:400”错误。 |
过程概述
过程如下所述
限定名称 | 类型 | 版本 |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
实体提取
实体提取过程 (apoc.nlp.gcp.entities.*
) 是围绕 Google 自然语言 API 的 documents.analyzeEntities
方法的包装器。此 API 方法查找文本中的命名实体(目前是专有名词和普通名词),以及实体类型、显著性、每个实体的提及和其他属性。
过程如下所述
签名 |
---|
apoc.nlp.gcp.entities.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?) |
apoc.nlp.gcp.entities.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?) |
这些过程支持以下配置参数
名称 | 类型 | 默认值 | 描述 |
---|---|---|---|
密钥 |
字符串 |
null |
Google 自然语言 API 的 API 密钥 |
节点属性 |
字符串 |
文本 |
提供的节点上包含要分析的非结构化文本的属性 |
此外,apoc.nlp.gcp.entities.graph
支持以下配置参数
名称 | 类型 | 默认值 | 描述 |
---|---|---|---|
分数截止值 |
双精度 |
0.0 |
实体出现在图中的显著性分数的下限。值必须介于 0 和 1 之间。 显著性是该实体对整个文档文本的重要性或中心性的指标。分数越接近 0,显著性越低,而分数越接近 1.0,显著性越高。 |
写入 |
布尔值 |
false |
持久化实体图 |
写入关系类型 |
字符串 |
实体 |
从源节点到实体节点的关系类型 |
writeRelationshipProperty |
字符串 |
评分 |
源节点到实体节点关系的属性 |
CALL apoc.nlp.gcp.entities.stream(source:Node or List<Node>, {
key: String,
nodeProperty: String
})
YIELD value
CALL apoc.nlp.gcp.entities.graph(source:Node or List<Node>, {
key: String,
nodeProperty: String,
scoreCutoff: Double,
writeRelationshipType: String,
writeRelationshipProperty: String,
write: Boolean
})
YIELD graph
分类
实体提取过程 (apoc.nlp.gcp.classify.*
) 是围绕 Google 自然语言 API 的 documents.classifyText
方法的包装器。此 API 方法将文档分类到各个类别中。
过程如下所述
签名 |
---|
apoc.nlp.gcp.classify.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?) |
apoc.nlp.gcp.classify.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?) |
这些过程支持以下配置参数
名称 | 类型 | 默认值 | 描述 |
---|---|---|---|
密钥 |
字符串 |
null |
Google 自然语言 API 的 API 密钥 |
节点属性 |
字符串 |
文本 |
提供的节点上包含要分析的非结构化文本的属性 |
此外,apoc.nlp.gcp.classify.graph
支持以下配置参数
名称 | 类型 | 默认值 | 描述 |
---|---|---|---|
分数截止值 |
双精度 |
0.0 |
类别在图中出现的置信度评分下限。值必须在 0 到 1 之间。 置信度是一个数字,表示分类器对该类别代表给定文本的确定程度。 |
写入 |
布尔值 |
false |
持久化实体图 |
写入关系类型 |
字符串 |
类别 |
源节点到类别节点关系的关系类型 |
writeRelationshipProperty |
字符串 |
评分 |
源节点到类别节点关系的属性 |
CALL apoc.nlp.gcp.classify.stream(source:Node or List<Node>, {
key: String,
nodeProperty: String
})
YIELD value
CALL apoc.nlp.gcp.classify.graph(source:Node or List<Node>, {
key: String,
nodeProperty: String,
scoreCutoff: Double,
writeRelationshipType: String,
writeRelationshipProperty: String,
write: Boolean
})
YIELD graph
安装依赖项
NLP 过程依赖于 Kotlin 和客户端库,这些库未包含在 APOC 扩展库中。
这些依赖项包含在 apoc-nlp-dependencies-5.21.0-all.jar 中,可以从 发布页面 下载。下载该文件后,应将其放置在 plugins
目录中并重新启动 Neo4j 服务器。
设置 API 密钥
我们可以生成一个可以访问 Cloud Natural Language API 的 API 密钥,方法是访问 console.cloud.google.com/apis/credentials。创建密钥后,我们可以填充并执行以下命令来创建一个包含这些详细信息的参数。
apiKey
参数:param apiKey => ("<api-key-here>")
或者,我们可以将这些凭据添加到 apoc.conf
中,并使用静态值存储函数加载它们。
apoc.static.gcp.apiKey=<api-key-here>
apoc.conf
中检索 GCP 凭据RETURN apoc.static.getAll("gcp") AS gcp;
gcp |
---|
{apiKey: "<api-key-here>"} |
批量请求
可以使用周期性迭代对 GCP API 的请求以及结果的处理进行批处理。如果我们希望对 GCP API 发出并行请求,并在运行写入数据库的过程时减少内存中保留的事务状态,则此方法很有用。
CALL apoc.periodic.iterate("
MATCH (n)
WITH collect(n) as total
CALL apoc.coll.partition(total, 25)
YIELD value as nodes
RETURN nodes", "
CALL apoc.nlp.gcp.entities.graph(nodes, {
key: $apiKey,
nodeProperty: 'body',
writeRelationshipType: 'GCP_ENTITY',
write:true
})
YIELD graph
RETURN distinct 'done'", {
batchSize: 1,
params: { apiKey: $apiKey }
}
);
示例
本节中的示例基于以下示例图
CREATE (:Article {
uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/",
body: "These days I’m rarely more than a few feet away from my Nintendo Switch and I play board games, card games and role playing games with friends at least once or twice a week. I’ve even organised lunch-time Mario Kart 8 tournaments between the Neo4j European offices!"
});
CREATE (:Article {
uri: "https://en.wikipedia.org/wiki/Nintendo_Switch",
body: "The Nintendo Switch is a video game console developed by Nintendo, released worldwide in most regions on March 3, 2017. It is a hybrid console that can be used as a home console and portable device. The Nintendo Switch was unveiled on October 20, 2016. Nintendo offers a Joy-Con Wheel, a small steering wheel-like unit that a Joy-Con can slot into, allowing it to be used for racing games such as Mario Kart 8."
});
实体提取
让我们从提取 Article 节点的实体开始。我们想要分析的文本存储在节点的 body
属性中,因此我们需要通过 nodeProperty
配置参数指定它。
MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.entities.stream(a, {
key: $apiKey,
nodeProperty: "body"
})
YIELD value
UNWIND value.entities AS entity
RETURN entity;
实体 |
---|
{name: "card games", salience: 0.17967656, metadata: {}, type: "CONSUMER_GOOD", mentions: [{type: "COMMON", text: {content: "card games", beginOffset: -1}}]} |
{name: "role playing games", salience: 0.16441391, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "role playing games", beginOffset: -1}}]} |
{name: "Switch", salience: 0.143287, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "Switch", beginOffset: -1}}]} |
{name: "friends", salience: 0.13336793, metadata: {}, type: "PERSON", mentions: [{type: "COMMON", text: {content: "friends", beginOffset: -1}}]} |
{name: "Nintendo", salience: 0.12601112, metadata: {mid: "/g/1ymzszlpz"}, type: "ORGANIZATION", mentions: [{type: "PROPER", text: {content: "Nintendo", beginOffset: -1}}]} |
{name: "board games", salience: 0.08861496, metadata: {}, type: "CONSUMER_GOOD", mentions: [{type: "COMMON", text: {content: "board games", beginOffset: -1}}]} |
{name: "tournaments", salience: 0.0603245, metadata: {}, type: "EVENT", mentions: [{type: "COMMON", text: {content: "tournaments", beginOffset: -1}}]} |
{name: "offices", salience: 0.034420907, metadata: {}, type: "LOCATION", mentions: [{type: "COMMON", text: {content: "offices", beginOffset: -1}}]} |
{name: "Mario Kart 8", salience: 0.029095741, metadata: {wikipedia_url: "https://en.wikipedia.org/wiki/Mario_Kart_8", mid: "/m/0119mf7q"}, type: "PERSON", mentions: [{type: "PROPER", text: {content: "Mario Kart 8", beginOffset: -1}}]} |
{name: "European", salience: 0.020393685, metadata: {mid: "/m/02j9z", wikipedia_url: "https://en.wikipedia.org/wiki/Europe"}, type: "LOCATION", mentions: [{type: "PROPER", text: {content: "European", beginOffset: -1}}]} |
{name: "Neo4j", salience: 0.020393685, metadata: {mid: "/m/0b76t3s", wikipedia_url: "https://en.wikipedia.org/wiki/Neo4j"}, type: "ORGANIZATION", mentions: [{type: "PROPER", text: {content: "Neo4j", beginOffset: -1}}]} |
{name: "8", salience: 0, metadata: {value: "8"}, type: "NUMBER", mentions: [{type: "TYPE_UNKNOWN", text: {content: "8", beginOffset: -1}}]} |
我们得到了 12 个不同的实体。然后,我们可以应用一个 Cypher 语句,为每个实体创建一个节点,并从这些节点中的每一个节点返回到 Article
节点创建一个 ENTITY
关系。
MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.entities.stream(a, {
key: $apiKey,
nodeProperty: "body"
})
YIELD value
UNWIND value.entities AS entity
MERGE (e:Entity {name: entity.name})
SET e.type = entity.type
MERGE (a)-[:ENTITY]->(e)
或者,我们可以使用图模式自动创建实体图。除了具有 Entity
标签之外,每个实体节点还将根据 type
属性的值具有另一个标签。默认情况下,将返回一个虚拟图。
MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.entities.graph(a, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "ENTITY"
})
YIELD graph AS g
RETURN g;
我们可以在 Pokemon 实体图 中看到虚拟图的 Neo4j 浏览器可视化。
我们可以通过将节点列表传递给过程来计算多个节点的实体。
MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.gcp.entities.graph(articles, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "ENTITY"
})
YIELD graph AS g
RETURN g;
我们可以在 Pokemon 和 Nintendo Switch 实体图 中看到虚拟图的 Neo4j 浏览器可视化。
在此可视化中,我们还可以看到每个实体节点的得分。此得分表示该实体在整个文档中的重要性。我们可以使用 scoreCutoff
属性指定分数的最小截止值。
MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.gcp.entities.graph(articles, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "ENTITY",
scoreCutoff: 0.01
})
YIELD graph AS g
RETURN g;
我们可以在 重要性 >= 0.01 的 Pokemon 和 Nintendo Switch 实体图 中看到虚拟图的 Neo4j 浏览器可视化。
如果我们对该图感到满意并希望将其持久化到 Neo4j 中,我们可以通过指定 write: true
配置来实现。
HAS_ENTITY
关系MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.gcp.entities.graph(articles, {
key: $apiKey,
nodeProperty: "body",
scoreCutoff: 0.01,
writeRelationshipType: "HAS_ENTITY",
writeRelationshipProperty: "gcpEntityScore",
write: true
})
YIELD graph AS g
RETURN g;
然后,我们可以编写一个查询来返回已创建的实体。
MATCH (article:Article)
RETURN article.uri AS article,
[(article)-[r:HAS_ENTITY]->(e) | {entity: e.text, score: r.gcpEntityScore}] AS entities;
文章 | 实体 |
---|---|
"https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/" |
[{score: 0.020393685, entity: "Neo4j"}, {score: 0.034420907, entity: "offices"}, {score: 0.0603245, entity: "tournaments"}, {score: 0.020393685, entity: "European"}, {score: 0.029095741, entity: "Mario Kart 8"}, {score: 0.12601112, entity: "Nintendo"}, {score: 0.13336793, entity: "friends"}, {score: 0.08861496, entity: "board games"}, {score: 0.143287, entity: "Switch"}, {score: 0.16441391, entity: "role playing games"}, {score: 0.17967656, entity: "card games"}] |
"https://en.wikipedia.org/wiki/Nintendo_Switch" |
[{score: 0.76108575, entity: "Nintendo Switch"}, {score: 0.07424594, entity: "Nintendo"}, {score: 0.015900765, entity: "home console"}, {score: 0.012772448, entity: "device"}, {score: 0.038113687, entity: "regions"}, {score: 0.07299799, entity: "Joy-Con Wheel"}] |
分类
现在让我们从 Article 节点提取类别。我们想要分析的文本存储在节点的 body
属性中,因此我们需要通过 nodeProperty
配置参数指定它。
MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.classify.stream(a, {
key: $apiKey,
nodeProperty: "body"
})
YIELD value
UNWIND value.categories AS category
RETURN category;
类别 |
---|
{name: "/Games", confidence: 0.91} |
我们只得到一个类别。然后,我们可以应用一个 Cypher 语句,为每个类别创建一个节点,并从这些节点中的每一个节点返回到 Article
节点创建一个 CATEGORY
关系。
MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.classify.stream(a, {
key: $apiKey,
nodeProperty: "body"
})
YIELD value
UNWIND value.categories AS category
MERGE (c:Category {name: category.name})
MERGE (a)-[:CATEGORY]->(c)
或者,我们可以使用图模式自动创建类别图。除了具有 Category
标签之外,每个类别节点还将根据 type
属性的值具有另一个标签。默认情况下,将返回一个虚拟图。
MATCH (a:Article {uri: "https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.classify.graph(a, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "CATEGORY"
})
YIELD graph AS g
RETURN g;
我们可以在 Pokemon 类别图 中看到虚拟图的 Neo4j 浏览器可视化。
HAS_CATEGORY
关系MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.gcp.classify.graph(articles, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "HAS_CATEGORY",
writeRelationshipProperty: "gcpCategoryScore",
write: true
})
YIELD graph AS g
RETURN g;
然后,我们可以编写一个查询来返回已创建的实体。
MATCH (article:Article)
RETURN article.uri AS article,
[(article)-[r:HAS_CATEGORY]->(c) | {category: c.text, score: r.gcpCategoryScore}] AS categories;
文章 | 类别 |
---|---|
"https://neo4j.ac.cn/blog/pokegraph-gotta-graph-em-all/" |
[{category: "/Games", score: 0.91}] |
"https://en.wikipedia.org/wiki/Nintendo_Switch" |
[{category: "/Computers & Electronics/Consumer Electronics/Game Systems & Consoles", score: 0.99}, {category: "/Games/Computer & Video Games", score: 0.99}] |