教程:使用自然语言处理和本体构建知识图谱
引言
在本教程中,我们将基于以下内容构建一个软件知识图谱:
-
来自 dev.to(一个开发者博客平台)的文章,以及从这些文章中提取的实体(使用自然语言处理技术)。
-
从 Wikidata(一个自由开放的知识库,作为维基百科结构化数据的中央存储库)提取的软件本体。
完成这些后,我们将学习如何查询知识图谱,以找到结合自然语言处理和本体所实现的有趣洞察。
本指南中使用的查询和数据可在 neo4j-examples/nlp-knowledge-graph GitHub 仓库中找到。 |
视频
Jesús Barrasa 和 Mark Needham 在 2020 年 8 月 25 日的 Neo4j Connections: Knowledge Graphs 活动上发表了基于本教程的演讲。演讲视频如下所示:
工具
在本教程中,我们将使用一些插件库,如果您想跟随示例操作,则需要安装这些库。
neosemantics (n10s)
neosemantics 是一个插件,可以在 Neo4j 中使用 RDF 及其相关词汇表,如 OWL、RDFS、SKOS 等。我们将使用此工具将本体导入 Neo4j。
neosemantics 仅支持 Neo4j 4.0.x 和 3.5.x 系列。它尚不支持 Neo4j 4.1.x 系列。 |
我们可以按照项目安装指南中的说明安装 neosemantics。
这两个工具都可以在 Docker 环境中安装,项目仓库中包含一个 docker-compose.yml 文件,展示了如何操作。
什么是知识图谱?
知识图谱有许多不同的定义。知识图谱在本教程中的定义是包含以下内容的图:
- 事实
-
实例数据。这包括从任何数据源导入的图数据,可以是结构化的(例如 JSON/XML)或半结构化的(例如 HTML)。
- 显式知识
-
显式描述实例数据如何关联。这来自本体、分类法或任何类型的元数据定义。
导入 Wikidata 本体
Wikidata 是一个自由开放的知识库,可以由人和机器读取和编辑。它作为其维基媒体姊妹项目(包括维基百科、维基导游、维基词典、维基文库等)结构化数据的中央存储库。
Wikidata SPARQL API
Wikidata 提供了一个 SPARQL API,允许用户直接查询数据。下面的截图展示了一个 SPARQL 查询示例以及运行该查询的结果:

这个查询从实体 Q2429814 (软件系统) 开始,然后传递性地找到该实体的子项,尽可能远。如果我们运行查询,我们将获得三元组流(主语、谓语、宾语)。
现在我们将学习如何使用 neosemantics 将 Wikidata 导入 Neo4j。
CREATE CONSTRAINT n10s_unique_uri ON (r:Resource) ASSERT r.uri IS UNIQUE;
CALL n10s.graphconfig.init({handleVocabUris: "MAP"});
call n10s.nsprefixes.add('neo','neo4j://voc#');
CALL n10s.mapping.add("neo4j://voc#subCatOf","SUB_CAT_OF");
CALL n10s.mapping.add("neo4j://voc#about","ABOUT");
现在我们将导入 Wikidata 分类法。我们可以直接从 Wikidata SPARQL API 获取可导入的 URL,方法是点击 Code
按钮:
然后我们将该 URL 传递给 n10s.rdf.import.fetch
过程,该过程将把三元组流导入 Neo4j。
下面的示例包含从软件系统、编程语言和数据格式开始导入分类法的查询。
WITH "https://query.wikidata.org/sparql?query=prefix%20neo%3A%20%3Cneo4j%3A%2F%2Fvoc%23%3E%20%0A%23Cats%0A%23SELECT%20%3Fitem%20%3Flabel%20%0ACONSTRUCT%20%7B%0A%3Fitem%20a%20neo%3ACategory%20%3B%20neo%3AsubCatOf%20%3FparentItem%20.%20%20%0A%20%20%3Fitem%20neo%3Aname%20%3Flabel%20.%0A%20%20%3FparentItem%20a%20neo%3ACategory%3B%20neo%3Aname%20%3FparentLabel%20.%0A%20%20%3Farticle%20a%20neo%3AWikipediaPage%3B%20neo%3Aabout%20%3Fitem%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%0A%7D%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20(wdt%3AP31%7Cwdt%3AP279)*%20wd%3AQ2429814%20.%0A%20%20%3Fitem%20wdt%3AP31%7Cwdt%3AP279%20%3FparentItem%20.%0A%20%20%3Fitem%20rdfs%3Alabel%20%3Flabel%20.%0A%20%20filter(lang(%3Flabel)%20%3D%20%22en%22)%0A%20%20%3FparentItem%20rdfs%3Alabel%20%3FparentLabel%20.%0A%20%20filter(lang(%3FparentLabel)%20%3D%20%22en%22)%0A%20%20%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%20%20%3Farticle%20schema%3Aabout%20%3Fitem%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20schema%3AinLanguage%20%22en%22%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E%20.%0A%20%20%20%20%7D%0A%20%20%0A%7D" AS softwareSystemsUri
CALL n10s.rdf.import.fetch(softwareSystemsUri, 'Turtle' , { headerParams: { Accept: "application/x-turtle" } })
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, callParams
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, callParams;
terminationStatus | triplesLoaded | triplesParsed | namespaces | callParams |
---|---|---|---|---|
"OK" |
1630 |
1630 |
NULL |
{headerParams: {Accept: "application/x-turtle"}} |
WITH "https://query.wikidata.org/sparql?query=prefix%20neo%3A%20%3Cneo4j%3A%2F%2Fvoc%23%3E%20%0A%23Cats%0A%23SELECT%20%3Fitem%20%3Flabel%20%0ACONSTRUCT%20%7B%0A%3Fitem%20a%20neo%3ACategory%20%3B%20neo%3AsubCatOf%20%3FparentItem%20.%20%20%0A%20%20%3Fitem%20neo%3Aname%20%3Flabel%20.%0A%20%20%3FparentItem%20a%20neo%3ACategory%3B%20neo%3Aname%20%3FparentLabel%20.%0A%20%20%3Farticle%20a%20neo%3AWikipediaPage%3B%20neo%3Aabout%20%3Fitem%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%0A%7D%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20(wdt%3AP31%7Cwdt%3AP279)*%20wd%3AQ9143%20.%0A%20%20%3Fitem%20wdt%3AP31%7Cwdt%3AP279%20%3FparentItem%20.%0A%20%20%3Fitem%20rdfs%3Alabel%20%3Flabel%20.%0A%20%20filter(lang(%3Flabel)%20%3D%20%22en%22)%0A%20%20%3FparentItem%20rdfs%3Alabel%20%3FparentLabel%20.%0A%20%20filter(lang(%3FparentLabel)%20%3D%20%22en%22)%0A%20%20%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%20%20%3Farticle%20schema%3Aabout%20%3Fitem%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20schema%3AinLanguage%20%22en%22%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E%20.%0A%20%20%20%20%7D%0A%20%20%0A%7D" AS programmingLanguagesUri
CALL n10s.rdf.import.fetch(programmingLanguagesUri, 'Turtle' , { headerParams: { Accept: "application/x-turtle" } })
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, callParams
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, callParams;
terminationStatus | triplesLoaded | triplesParsed | namespaces | callParams |
---|---|---|---|---|
"OK" |
9376 |
9376 |
NULL |
{headerParams: {Accept: "application/x-turtle"}} |
WITH "https://query.wikidata.org/sparql?query=prefix%20neo%3A%20%3Cneo4j%3A%2F%2Fvoc%23%3E%20%0A%23Cats%0A%23SELECT%20%3Fitem%20%3Flabel%20%0ACONSTRUCT%20%7B%0A%3Fitem%20a%20neo%3ACategory%20%3B%20neo%3AsubCatOf%20%3FparentItem%20.%20%20%0A%20%20%3Fitem%20neo%3Aname%20%3Flabel%20.%0A%20%20%3FparentItem%20a%20neo%3ACategory%3B%20neo%3Aname%20%3FparentLabel%20.%0A%20%20%3Farticle%20a%20neo%3AWikipediaPage%3B%20neo%3Aabout%20%3Fitem%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%0A%7D%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20(wdt%3AP31%7Cwdt%3AP279)*%20wd%3AQ24451526%20.%0A%20%20%3Fitem%20wdt%3AP31%7Cwdt%3AP279%20%3FparentItem%20.%0A%20%20%3Fitem%20rdfs%3Alabel%20%3Flabel%20.%0A%20%20filter(lang(%3Flabel)%20%3D%20%22en%22)%0A%20%20%3FparentItem%20rdfs%3Alabel%20%3FparentLabel%20.%0A%20%20filter(lang(%3FparentLabel)%20%3D%20%22en%22)%0A%20%20%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%20%20%3Farticle%20schema%3Aabout%20%3Fitem%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20schema%3AinLanguage%20%22en%22%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E%20.%0A%20%20%20%20%7D%0A%20%20%0A%7D" AS dataFormatsUri
CALL n10s.rdf.import.fetch(dataFormatsUri, 'Turtle' , { headerParams: { Accept: "application/x-turtle" } })
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, callParams
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, callParams;
terminationStatus | triplesLoaded | triplesParsed | namespaces | callParams |
---|---|---|---|---|
"OK" |
514 |
514 |
NULL |
{headerParams: {Accept: "application/x-turtle"}} |
查看分类法
让我们看看导入了什么。通过运行以下查询,我们可以获取数据库内容的概览:
CALL apoc.meta.stats()
YIELD labels, relTypes, relTypesCount
RETURN labels, relTypes, relTypesCount;
labels | relTypes | relTypesCount |
---|---|---|
{Category: 2308, _NsPrefDef: 1, _MapNs: 1, Resource: 3868, _MapDef: 2, WikipediaPage: 1560, _GraphConfig: 1} |
{ |
{SUB_CAT_OF: 7272, _IN: 2, ABOUT: 3120} |
任何带有 _prefix
的标签或关系类型都可以忽略,因为它们代表了 n10s 库创建的元数据。
我们可以看到,我们导入了超过 2,000 个 Category
节点和 1,700 个 WikipediaPage
节点。使用 n10s 创建的每个节点都将有一个 Resource
标签,这就是为什么我们有超过 4,000 个带有此标签的节点。
我们还有超过 7,000 个 SUB_CAT_OF
关系类型连接着 Category
节点,以及 3,000 个 ABOUT
关系类型连接着 WikipediaPage
节点与 Category
节点。
现在让我们看看一些实际导入的数据。通过运行以下查询,我们可以查看版本控制节点的子类别:
MATCH path = (c:Category {name: "version control system"})<-[:SUB_CAT_OF*]-(child)
RETURN path
LIMIT 25;
目前为止一切顺利!
导入 dev.to 文章
dev.to 是一个开发者博客平台,包含各种主题的文章,包括 NoSQL 数据库、JavaScript 框架、最新的 AWS API、聊天机器人等等。主页截图如下所示:
我们将从 dev.to 导入一些文章到 Neo4j。articles.csv
包含 30 篇感兴趣的文章列表。我们可以使用 Cypher 的 LOAD CSV
子句查询此文件:
LOAD CSV WITH HEADERS FROM 'https://github.com/neo4j-examples/nlp-knowledge-graph/raw/master/import/articles.csv' AS row
RETURN row
LIMIT 10;
row |
---|
{uri: "https://dev.to/lirantal/securing-a-nodejs—rethinkdb—tls-setup-on-docker-containers"} |
{uri: "https://dev.to/setevoy/neo4j-running-in-kubernetes-e4p"} |
{uri: "https://dev.to/divyanshutomar/introduction-to-redis-3m2a"} |
{uri: "https://dev.to/zaiste/15-git-commands-you-may-not-know-4a8j"} |
{uri: "https://dev.to/alexjitbit/removing-files-from-mercurial-history-1b15"} |
{uri: "https://dev.to/michelemauro/atlassian-sunsetting-mercurial-support-in-bitbucket-2ga9"} |
{uri: "https://dev.to/shirou/back-up-prometheus-records-to-s3-via-kinesis-firehose-54l4"} |
{uri: "https://dev.to/ionic/farewell-phonegap-reflections-on-my-hybrid-app-development-journey-10dh"} |
{uri: "https://dev.to/rootsami/rancher-kubernetes-on-openstack-using-terraform-1ild"} |
{uri: "https://dev.to/jignesh_simform/comparing-mongodb—mysql-bfa"} |
我们将使用 APOC 的 apoc.load.html
过程从这些 URI 中抓取感兴趣的信息。首先让我们看看如何在单篇文章上使用此过程,如下面的查询所示:
(1)
MERGE (a:Article {uri: "https://dev.to/lirantal/securing-a-nodejs--rethinkdb--tls-setup-on-docker-containers"})
WITH a
(2)
CALL apoc.load.html(a.uri, {
body: 'body div.spec__body p',
title: 'h1',
time: 'time'
})
YIELD value
UNWIND value.body AS item
(3)
WITH a,
apoc.text.join(collect(item.text), '') AS body,
value.title[0].text AS title,
value.time[0].attributes.datetime AS date
(4)
SET a.body = body , a.title = title, a.datetime = datetime(date)
RETURN a;
1 | 如果不存在,则创建带 Article 标签和 uri 属性的节点 |
2 | 使用提供的 CSS 选择器从 URI 抓取数据 |
3 | 抓取 URI 返回的值的后处理 |
4 | 使用 body 、title 和 datetime 属性更新节点 |
a |
---|
(:Article {processed: TRUE, datetime: 2017-08-21T18:41:06Z, title: "Securing a Node.js + RethinkDB + TLS setup on Docker containers", body: "We use RethinkDB at work across different projects. It isn’t used for any sort of big-data applications, but rather as a NoSQL database, which spices things up with real-time updates, and relational tables support.RethinkDB features an officially supported Node.js driver, as well as a community-maintained driver as well called rethinkdbdash which is promises-based, and provides connection pooling. There is also a database migration tool called rethinkdb-migrate that aids in managing database changes such as schema changes, database seeding, tear up and tear down capabilities.We’re going to use the official RethinkDB docker image from the docker hub and make use of docker-compose.yml to spin it up (later on you can add additional services to this setup).A fair example for docker-compose.yml:The compose file mounts a local tls directory as a mapped volume inside the container. The tls/ directory will contain our cert files, and the compose file is reflecting this.To setup a secure connection we need to facilitate it using certificates so an initial technical step:Important notes:Update the compose file to include a command configuration that starts the RethinkDB process with all the required SSL configurationImportant notes:You’ll notice there isn’t any cluster related configuration but you can add them as well if you need to so they can join the SSL connection: — cluster-tls — cluster-tls-key /tls/key.pem — cluster-tls-cert /tls/cert.pem — cluster-tls-ca /tls/ca.pemThe RethinkDB drivers support an ssl optional object which either sets the certificate using the ca property, or sets the rejectUnauthorized property to accept or reject self-signed certificates when connecting. A snippet for the ssl configuration to pass to the driver:Now that the connection is secured, it only makes sense to connect using a user/password which are not the default.To set it up, update the compose file to also include the — initial-password argument so you can set the default admin user’s password. For example:Of course you need to append this argument to the rest of the command line options in the above compose file.Now, update the Node.js driver settings to use a user and password to connect:Congratulations! You’re now eligible to “Ready for Production stickers.Don’t worry, I already mailed them to your address.", uri: "https://dev.to/lirantal/securing-a-nodejs—rethinkdb—tls-setup-on-docker-containers"}) |
现在我们将导入其他文章。
我们将使用 apoc.periodic.iterate
过程,以便并行处理此过程。此过程接受一个数据驱动语句和一个操作语句:
-
数据驱动语句包含要处理的项流,这将是 URI 流。
-
操作语句定义了对每个项要执行的操作,即调用
apoc.load.html
并创建带有Article
标签的节点。
最后一个参数用于提供配置。我们将告诉过程以 5 个为一批并行处理这些项。
我们可以在以下示例中看到对该过程的调用:
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM 'https://github.com/neo4j-examples/nlp-knowledge-graph/raw/master/import/articles.csv' AS row
RETURN row",
"MERGE (a:Article {uri: row.uri})
WITH a
CALL apoc.load.html(a.uri, {
body: 'body div.spec__body p',
title: 'h1',
time: 'time'
})
YIELD value
UNWIND value.body AS item
WITH a,
apoc.text.join(collect(item.text), '') AS body,
value.title[0].text AS title,
value.time[0].attributes.datetime AS date
SET a.body = body , a.title = title, a.datetime = datetime(date)",
{batchSize: 5, parallel: true}
)
YIELD batches, total, timeTaken, committedOperations
RETURN batches, total, timeTaken, committedOperations;
batches | total | timeTaken | committedOperations |
---|---|---|---|
7 |
32 |
15 |
32 |
我们现在有两个不连接的子图,如下图所示:
左侧是 Wikidata 分类图,它代表了我们知识图谱中的显式知识。右侧是文章图,它代表了我们知识图谱中的事实。我们希望将这两个图连接起来,我们将使用自然语言处理技术来完成。
文章实体提取
2020 年 4 月,APOC 标准库增加了封装各大云提供商(AWS、GCP 和 Azure)NLP API 的过程。这些过程从节点属性中提取文本,然后将该文本发送到用于提取实体、关键短语、类别或情感的 API。
我们将对文章使用 GCP 实体提取过程。GCP NLP API 会为存在对应页面的实体返回维基百科页面。
在此之前,我们需要创建一个具有访问自然语言 API 权限的 API 密钥。假设您已经创建了一个 GCP 账户,您可以按照 console.cloud.google.com/apis/credentials 上的说明生成一个密钥。创建密钥后,我们将创建一个包含该密钥的参数:
:params key => ("<insert-key-here>")
我们将使用 apoc.nlp.gcp.entities.stream
过程,它将返回节点属性中包含的文本内容所找到的实体流。在对所有文章运行此过程之前,让我们先对其中一篇运行,看看返回哪些数据:
MATCH (a:Article {uri: "https://dev.to/lirantal/securing-a-nodejs--rethinkdb--tls-setup-on-docker-containers"})
CALL apoc.nlp.gcp.entities.stream(a, {
nodeProperty: 'body',
key: $key
})
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
RETURN entity
LIMIT 5;
entity |
---|
{name: "RethinkDB", salience: 0.47283632, metadata: {mid: "/m/0134hdhv", wikipedia_url: "https://en.wikipedia.org/wiki/RethinkDB"}, type: "ORGANIZATION", mentions: [{type: "PROPER", text: {content: "RethinkDB", beginOffset: -1}}, {type: "PROPER", text: {content: "RethinkDB", beginOffset: -1}}, {type: "PROPER", text: {content: "RethinkDB", beginOffset: -1}}, {type: "PROPER", text: {content: "RethinkDB", beginOffset: -1}}, {type: "PROPER", text: {content: "pemThe RethinkDB", beginOffset: -1}}]} |
{name: "connection", salience: 0.04166339, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "connection", beginOffset: -1}}, {type: "COMMON", text: {content: "connection", beginOffset: -1}}]} |
{name: "work", salience: 0.028608896, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "work", beginOffset: -1}}]} |
{name: "projects", salience: 0.028608896, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "projects", beginOffset: -1}}]} |
{name: "database", salience: 0.01957906, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "database", beginOffset: -1}}]} |
每一行都包含一个描述实体的 name
属性。salience
是衡量该实体对整个文档文本的重要性或中心性的指标。
一些实体也包含维基百科 URL,可以通过 metadata.wikipedia_url
键找到。列表中的第一个实体 RethinkDB 是唯一具有此类 URL 的实体。我们将过滤返回的行,只包含具有维基百科 URL 的实体,然后我们将 Article
节点连接到具有该 URL 的 WikipediaPage
节点。
让我们看看如何为一篇文章执行此操作:
MATCH (a:Article {uri: "https://dev.to/lirantal/securing-a-nodejs--rethinkdb--tls-setup-on-docker-containers"})
CALL apoc.nlp.gcp.entities.stream(a, {
nodeProperty: 'body',
key: $key
})
(1)
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
(2)
WITH entity, node
WHERE not(entity.metadata.wikipedia_url is null)
(3)
MERGE (page:Resource {uri: entity.metadata.wikipedia_url})
SET page:WikipediaPage
(4)
MERGE (node)-[:HAS_ENTITY]->(page)
1 | node 是文章,value 包含提取的实体 |
2 | 仅包含具有维基百科 URL 的实体 |
3 | 查找与维基百科 URL 匹配的节点。如果不存在,则创建一个。 |
4 | 在 Article 节点和 WikipediaPage 之间创建 HAS_ENTITY 关系 |
通过查看以下 Neo4j Browser 可视化,我们可以看到运行此查询如何连接文章子图和分类子图:
现在我们可以再次借助 apoc.periodic.iterate
过程对其余文章运行实体提取技术。
CALL apoc.periodic.iterate(
"MATCH (a:Article)
WHERE not(exists(a.processed))
RETURN a",
"CALL apoc.nlp.gcp.entities.stream([item in $_batch | item.a], {
nodeProperty: 'body',
key: $key
})
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
WITH entity, node
WHERE not(entity.metadata.wikipedia_url is null)
MERGE (page:Resource {uri: entity.metadata.wikipedia_url})
SET page:WikipediaPage
MERGE (node)-[:HAS_ENTITY]->(page)",
{batchMode: "BATCH_SINGLE", batchSize: 10, params: {key: $key}})
YIELD batches, total, timeTaken, committedOperations
RETURN batches, total, timeTaken, committedOperations;
batches | total | timeTaken | committedOperations |
---|---|---|---|
4 |
31 |
29 |
31 |
查询知识图谱
现在是时候查询知识图谱了。
语义搜索
我们将执行的第一个查询是语义搜索。n10s.inference.nodesInCategory
过程允许我们从顶级类别开始搜索,找到其所有传递性子类别,然后返回附加到这些类别中的任何节点的节点。
在我们的图中,连接到类别节点的节点是 WikipediaPage
节点。因此,我们需要在查询中添加一个额外的 MATCH
子句,通过 HAS_ENTITY
关系类型查找连接的文章。我们可以在以下查询中看到如何执行此操作:
MATCH (c:Category {name: "NoSQL database management system"})
CALL n10s.inference.nodesInCategory(c, {
inCatRel: "ABOUT",
subCatRel: "SUB_CAT_OF"
})
YIELD node
MATCH (node)<-[:HAS_ENTITY]-(article)
RETURN article.uri AS uri, article.title AS title, article.datetime AS date,
collect(n10s.rdf.getIRILocalName(node.uri)) as explicitTopics
ORDER BY date DESC
LIMIT 5;
uri | title | date | explicitTopics |
---|---|---|---|
"https://dev.to/arthurolga/newsql-an-implementation-with-google-spanner-2a86" |
"NewSQL: An Implementation with Google Spanner" |
2020-08-10T16:01:25Z |
["NoSQL"] |
"https://dev.to/goaty92/designing-tinyurl-it's-more-complicated-than-you-think-2a48" |
"Designing TinyURL: it’s more complicated than you think" |
2020-08-10T10:21:05Z |
["Apache_ZooKeeper"] |
"https://dev.to/nipeshkc7/dynamodb-the-basics-360g" |
"DynamoDB: the basics" |
2020-06-02T04:09:36Z |
["NoSQL", "Amazon_DynamoDB"] |
"https://dev.to/subhransu/realtime-chat-app-using-kafka-springboot-reactjs-and-websockets-lc" |
"Realtime Chat app using Kafka, SpringBoot, ReactJS, and WebSockets" |
2020-04-25T23:17:22Z |
["Apache_ZooKeeper"] |
"https://dev.to/codaelux/running-dynamodb-offline-4k1b" |
"How to run DynamoDB Offline" |
2020-03-23T21:48:31Z |
["NoSQL", "Amazon_DynamoDB"] |
虽然我们搜索了 NoSQL,但从结果可以看出,有几篇文章并未直接链接到该类别。例如,我们有几篇关于 Apache Zookeeper 的文章。通过编写以下查询,我们可以看到该类别如何连接到 NoSQL:
match path = (c:WikipediaPage)-[:ABOUT]->(category)-[:SUB_CAT_OF*]->(:Category {name: "NoSQL database management system"})
where c.uri contains "Apache_ZooKeeper"
RETURN path;

所以 Apache Zookeeper 实际上距离 NoSQL 类别还有几级。
相似文章
使用我们的知识图谱可以做的另一件事是根据文章共有的实体查找相似文章。此查询的最简单版本是查找共享共同实体的其他文章,如下面的查询所示:
MATCH (a:Article {uri: "https://dev.to/qainsights/performance-testing-neo4j-database-using-bolt-protocol-in-apache-jmeter-1oa9"}),
path = (a)-[:HAS_ENTITY]->(wiki)-[:ABOUT]->(cat),
otherPath = (wiki)<-[:HAS_ENTITY]-(other)
return path, otherPath;
Neo4j 性能测试文章是关于 Neo4j 的,还有另外两篇 Neo4j 文章可以推荐给喜欢这篇文章的读者。
我们还可以在查询中使用类别分类法。通过编写以下查询,我们可以找到共享共同父类别的文章:
MATCH (a:Article {uri: "https://dev.to/qainsights/performance-testing-neo4j-database-using-bolt-protocol-in-apache-jmeter-1oa9"}),
entityPath = (a)-[:HAS_ENTITY]->(wiki)-[:ABOUT]->(cat),
path = (cat)-[:SUB_CAT_OF]->(parent)<-[:SUB_CAT_OF]-(otherCat),
otherEntityPath = (otherCat)<-[:ABOUT]-(otherWiki)<-[:HAS_ENTITY]-(other)
RETURN other.title, other.uri,
[(other)-[:HAS_ENTITY]->()-[:ABOUT]->(entity) | entity.name] AS otherCategories,
collect([node in nodes(path) | node.name]) AS pathToOther;
other.title | other.uri | otherCategories | pathToOther |
---|---|---|---|
"Couchbase GeoSearch with ASP.NET Core" |
"https://dev.to/ahmetkucukoglu/couchbase-geosearch-with-asp-net-core-i04" |
["ASP.NET", "Couchbase Server"] |
[["Neo4j", "proprietary software", "Couchbase Server"], ["Neo4j", "free software", "ASP.NET"], ["Neo4j", "free software", "Couchbase Server"]] |
"The Ultimate Postgres vs MySQL Blog Post" |
"https://dev.to/dmfay/the-ultimate-postgres-vs-mysql-blog-post-1l5f" |
["YAML", "Python", "JavaScript", "NoSQL database management system", "Structured Query Language", "JSON", "Extensible Markup Language", "comma-separated values", "PostgreSQL", "MySQL", "Microsoft SQL Server", "MongoDB", "MariaDB"] |
[["Neo4j", "proprietary software", "Microsoft SQL Server"], ["Neo4j", "free software", "PostgreSQL"]] |
"5 Best courses to learn Apache Kafka in 2020" |
"https://dev.to/javinpaul/5-best-courses-to-learn-apache-kafka-in-2020-584h" |
["Java", "Scratch", "Scala", "Apache ZooKeeper"] |
[["Neo4j", "free software", "Scratch"], ["Neo4j", "free software", "Apache ZooKeeper"]] |
"Building a Modern Web Application with Neo4j and NestJS" |
"https://dev.to/adamcowley/building-a-modern-web-application-with-neo4j-and-nestjs-38ih" |
["TypeScript", "JavaScript", "Neo4j"] |
[["Neo4j", "free software", "TypeScript"]] |
"Securing a Node.js + RethinkDB + TLS setup on Docker containers" |
"https://dev.to/lirantal/securing-a-nodejs—rethinkdb—tls-setup-on-docker-containers" |
["NoSQL database management system", "RethinkDB"] |
[["Neo4j", "free software", "RethinkDB"]] |
请注意,在此查询中,我们还将返回从初始文章到其他文章的路径。因此,对于“Couchbase GeoSearch with ASP.NET Core”,存在一条路径,从初始文章到 Neo4j 类别,然后到专有软件类别,该类别也是 Couchbase Server 类别(“Couchbase GeoSearch with ASP.NET Core”文章连接到的类别)的父级。
这展示了知识图谱的另一个不错特性——除了给出推荐,也很容易解释推荐的原因。
添加自定义本体
我们可能认为专有软件并不是衡量两个技术产品之间相似性的一个很好的度量标准。我们不太可能基于这种类型的相似性来寻找相似文章。
但是软件产品连接的一种常见方式是通过技术栈。因此,我们可以创建自己的本体,其中包含一些这样的技术栈。
nsmntx.org/2020/08/swStacks 包含 GRANDstack、MEAN Stack 和 LAMP Stack 的本体。在导入此本体之前,让我们在 n10s 中设置一些映射:
CALL n10s.nsprefixes.add('owl','http://www.w3.org/2002/07/owl#');
CALL n10s.nsprefixes.add('rdfs','http://www.w3.org/2000/01/rdf-schema#');
CALL n10s.mapping.add("http://www.w3.org/2000/01/rdf-schema#subClassOf","SUB_CAT_OF");
CALL n10s.mapping.add("http://www.w3.org/2000/01/rdf-schema#label","name");
CALL n10s.mapping.add("http://www.w3.org/2002/07/owl#Class","Category");
现在我们可以通过运行以下查询来预览本体的导入情况:
CALL n10s.rdf.preview.fetch("http://www.nsmntx.org/2020/08/swStacks","Turtle");
看起来不错,所以让我们通过运行以下查询来导入它:
CALL n10s.rdf.import.fetch("http://www.nsmntx.org/2020/08/swStacks","Turtle")
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, callParams
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, callParams;
terminationStatus | triplesLoaded | triplesParsed | namespaces | callParams |
---|---|---|---|---|
"OK" |
58 |
58 |
NULL |
{} |
现在我们可以重新运行相似性查询,它将返回以下结果:
other.title | other.uri | otherCategories | pathToOther |
---|---|---|---|
"GraphQL 初学者指南" |
"https://dev.to/leonardomso/a-beginners-guide-to-graphql-3kjj" |
["GraphQL", "JavaScript"] |
[["Neo4j", "GRAND Stack", "GraphQL"]] |
"学习如何在微服务架构之上构建无服务器 GraphQL API,第一部分" |
"https://dev.to/azure/learn-how-you-can-build-a-serverless-graphql-api-on-top-of-a-microservice-architecture-233g" |
["Node.js", "GraphQL"] |
[["Neo4j", "GRAND Stack", "GraphQL"]] |
"The Ultimate Postgres vs MySQL Blog Post" |
"https://dev.to/dmfay/the-ultimate-postgres-vs-mysql-blog-post-1l5f" |
["Structured Query Language", "Extensible Markup Language", "PostgreSQL", "MariaDB", "JSON", "MySQL", "Microsoft SQL Server", "MongoDB", "comma-separated values", "JavaScript", "YAML", "Python", "NoSQL database management system"] |
[["Neo4j", "proprietary software", "Microsoft SQL Server"], ["Neo4j", "free software", "PostgreSQL"]] |
"Couchbase GeoSearch with ASP.NET Core" |
"https://dev.to/ahmetkucukoglu/couchbase-geosearch-with-asp-net-core-i04" |
["ASP.NET", "Couchbase Server"] |
[["Neo4j", "proprietary software", "Couchbase Server"], ["Neo4j", "free software", "ASP.NET"], ["Neo4j", "free software", "Couchbase Server"]] |
"Building a Modern Web Application with Neo4j and NestJS" |
"https://dev.to/adamcowley/building-a-modern-web-application-with-neo4j-and-nestjs-38ih" |
["JavaScript", "TypeScript", "Neo4j"] |
[["Neo4j", "free software", "TypeScript"]] |
这次我们在顶部又多出了几篇关于 GraphQL 的文章,GraphQL 是 GRANDstack 中的一个工具,而 Neo4j 也是其中的一部分。