图数据库、RDF 和关联数据
RDF vs LPG:数据模型
RDF 数据集中的每个语句表示图中的一条边,但在 LPG 中,节点可以有内部结构,因此我们可以决定什么是属性,什么是关系。
一小部分 RDF 语句。你可以尝试将它们插入你喜欢的三元组存储中(为什么不是 rdf4j 服务器?[https://rdf4j.org/documentation/tools/server-workbench/])
INSERT DATA { <https://g.co/kg/m/0567wt> <https://schema.org/name> "Sketches of Spain" . <https://g.co/kg/m/0567wt> <https://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/MusicAlbum> . <https://g.co/kg/m/0567wt> <https://schema.org/description> "Album by Miles Davis" . <https://g.co/kg/m/0567wt> <https://schema.org/genre> "Jazz" . <https://g.co/kg/m/0567wt> <https://schema.googleapis.com/detailedDescription> _:genid1 . _:genid1 <https://schema.org/license> "https://en.wikipedia.org/wiki/Wikipedia:Creative_Commons_Attribution-ShareAlike_3.0_License" . _:genid1 <https://schema.org/url> "https://en.wikipedia.org/wiki/Sketches_of_Spain". _:genid1 <https://schema.org/articleBody> "...between November 1959 and March 1960 at the Columbia 30th Street Studio in NY City" . <https://g.co/kg/m/0567wt> <https://schema.org/award> <https://g.co/kg/m/018xpp> . <https://g.co/kg/m/018xpp> <https://schema.org/name> "Grammy Hall of Fame" . <https://g.co/kg/m/0567wt> <https://schema.org/byArtist> <https://g.co/kg/m/053yx> . <https://g.co/kg/m/053yx> <https://schema.org/name> "Miles Davis" . <https://g.co/kg/m/0567wt> <https://schema.org/producer> <https://g.co/kg/m/01v1m8b> . <https://g.co/kg/m/01v1m8b> <https://schema.org/name> "Teo Macero" . <https://g.co/kg/m/0567wt> <https://schema.org/producer> <https://g.co/kg/m/02wvrn5> . <https://g.co/kg/m/02wvrn5> <https://schema.org/name> "Irving Townsend" . }
同样的信息,这次以 Cypher 中的属性图形式表达
CREATE (sos:MusicAlbum { name: "Sketches of Spain",
description: "Album by Miles Davis",
genre: "Jazz"})
CREATE (dd:DetailedDescription { license: "https://en.wikipedia.org/wiki/Wikipedia:Creative_Commons_Attribution-ShareAlike_3.0_License",
articleBody: "...between November 1959 and March 1960 at the Columbia 30th Street Studio in NY City"})
CREATE (sos)-[:goog_detailedDescription]->(dd)
CREATE (sos)-[:award]-> (:Award { name: "Grammy Hall of Fame" })
CREATE (sos)-[:byArtist]->(:Person { name: "Miles Davis" })
CREATE (sos)-[:producer]->(:Person { name: "Teo Macero" })
CREATE (sos)-[:producer]->(:Person { name: "Irving Townsend" })
RDF vs LPG:SPARQL 和 Cypher 查询
RDF vs LPG:SPARQL 和 Cypher 更新
使用 SPARQL 更新 RDF 图
我们已经了解了如何使用 INSERT DATA 在 RDF 存储中插入三元组,那么更新呢?让我们尝试将所有制作人的姓名转换为大写。
请注意,在这种特殊情况下,我们不是按类型识别制作人,而是通过他们通过“producer”关系链接到专辑这一事实来识别。
PREFIX sc: <https://schema.org/> DELETE { ?prod sc:name ?name } INSERT { ?prod sc:name ?newValue } WHERE { ?prod sc:name ?name . ?musalb sc:producer ?prod . BIND (UCASE(?name) AS ?newValue) }
RDF vs LPG:模型差异 #1
属性图中两个节点之间存在多个相同类型的关系
CREATE (d {name: "Dan"})-[:LIKES]->(a {name: "Ann"})
CREATE (d)-[:LIKES]->(a)
CREATE (d)-[:LIKES]->(a)
当我们查询它时……
MATCH (d {name: "Dan"})-[l:LIKES]->(a {name: "Ann"})
RETURN COUNT(l)
-
我们会得到三个类型为 'LIKES' 的独立关系。
这是因为属性图中的每条关系都有唯一标识符。
RDF 中两个节点之间存在多个相同类型的关系
prefix sc: <https://schema.org/> INSERT DATA { <https://dan> sc:name "Dan" . <https://ann> sc:name "Ann" . <https://dan> sc:likes <https://ann> . <https://dan> sc:likes <https://ann> . <https://dan> sc:likes <https://ann> . }
但当我们查询它时……
PREFIX sc: <https://schema.org/> SELECT (COUNT(?x) AS ?count) where { <https://dan> sc:likes ?x . FILTER (?x = <https://ann>) }
这是因为 RDF 中相同类型的关系代表完全相同的语句(三元组)。如果我们需要多个,则需要使用重化(reification)。
RDF vs LPG:模型差异 #2
在属性图中……
关系中的属性是很自然的事情
CREATE ( {name: "NYC"})-[:CONNECTION { distanceKm : 4100, costUSD: 300}]->( {name: "SFO"})
而且我们可以轻松地查询它们……
MATCH ( {name: "NYC"})-[c:CONNECTION]->( {name: "SFO"})
RETURN c.costUSD, c.distanceKm
在 RDF 中……
类似的方法将不起作用。
prefix sc: <https://schema.org/> INSERT DATA { <https://nyc> sc:name "NYC" . <https://sfo> sc:name "SFO" . <https://nyc> sc:connection <https://sfo> . sc:connection sc:distanceKm 4100 }
我们可以认为添加一个包含距离的三元组就能解决问题……但实际上,这将是把距离属性添加到关系类型上,而不是添加到这个特定的实例上。
prefix sc: <https://schema.org/> SELECT ?distanceKm { ?nyc sc:name "NYC" . ?sfo sc:name "SFO" . ?nyc ?p ?sfo . filter(?p = sc:connection) ?p sc:distanceKm ?distanceKm }
因此,当我们查询它时,只有一个实例时看起来没问题……但一旦我们添加了更多相同关系类型的实例,事情就会出错。
prefix sc: <https://schema.org/> INSERT DATA { <https://nyc> sc:name "NYC" . <https://lhr> sc:name "LHR" . <https://nyc> sc:connection <https://lhr> . sc:connection sc:distanceKm 5600 }
RDF 中一种可能的替代方法:使用中间节点进行建模的变通方案
prefix sc: <https://schema.org/> INSERT DATA { <https://nyc> sc:name "NYC" . <https://sfo> sc:name "SFO" . <https://nyc-sfo> sc:from <https://nyc> . <https://nyc-sfo> sc:to <https://sfo> . <https://nyc-sfo> sc:distanceKm 4100 . <https://nyc-sfo> sc:costUSD 300 . }
RDF vs LPG:模型差异 #2
多值属性在属性图中存储为数组
CREATE (s:Album { name: "Sketches of Spain",
genre: [ "Jazz","Orchestral Jazz" ] } )
可以作为数组进行查询和返回……
MATCH (a:Album)
WHERE a.name= "Sketches of Spain"
RETURN a.genre
……或者作为单独的结果
MATCH (a:Album) WHERE a.name =
"Sketches of Spain"
UNWIND a.genre as genre
RETURN genre
多值属性在 RDF 中是简单的独立语句(三元组)
不需要特别处理,它们是两个独立的三元组
prefix schema: <https://schema.org/> INSERT DATA { <https://g.co/kg/m/0567wt> schema:name "Sketches of Spain" . <https://g.co/kg/m/0567wt> schema:genre "Jazz" . <https://g.co/kg/m/0567wt> schema:genre "Orchestral Jazz" . }
可以进行查询并返回多个不同的绑定
prefix schema: <https://schema.org/> SELECT ?genre { ?album schema:name "Sketches of Spain" . ?album schema:genre ?genre . }
集成 #1:将 RDF 数据加载到 Neo4j 中
查询 SPARQL 端点并通过 LOAD CSV 导入
数据存在于提供 SPARQL 端点的三元组存储中
一个流行的(尽管有点乱)公共 SPARQL 端点是 dbpedia:https://dbpedia.org/sparql
这是一个返回 Gene Hackman 电影的 SPARQL 查询
prefix dbpedia-owl: <https://dbpedia.org/ontology/> SELECT ?movie ?title ?dir ?name WHERE { ?movie dbpedia-owl:starring ?actor . ?actor rdfs:label "Gene Hackman"@en . ?movie rdfs:label ?title . ?movie dbpedia-owl:director ?dir . ?dir rdfs:label ?name . FILTER LANGMATCHES(LANG(?title), "EN") FILTER LANGMATCHES(LANG(?name), "EN") }
我们可以直接使用 LOAD CSV 探索数据集
WITH "https://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=prefix+dbpedia-owl%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E+%0D%0A%0D%0ASELECT+%3Fmovie+%3Ftitle+%3Fdir+%3Fname%0D%0AWHERE+%7B%0D%0A++%3Fmovie+dbpedia-owl%3Astarring+%5B+rdfs%3Alabel+%22Gene+Hackman%22%40en+%5D%3B%0D%0A+++++++++rdfs%3Alabel+%3Ftitle%3B%0D%0A+++++++++dbpedia-owl%3Adirector+%3Fdir+.%0D%0A++%3Fdir+rdfs%3Alabel+%3Fname+.%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Ftitle%29%2C+%22EN%22%29%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Fname%29%2C++%22EN%22%29%0D%0A%7D&format=text%2Fcsv&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on" AS url
LOAD CSV WITH HEADERS FROM url AS row
RETURN row
如果数据看起来不错,我们可以完善查询以在 Neo4j 中创建节点和关系……
WITH "https://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=prefix+dbpedia-owl%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E+%0D%0A%0D%0ASELECT+%3Fmovie+%3Ftitle+%3Fdir+%3Fname%0D%0AWHERE+%7B%0D%0A++%3Fmovie+dbpedia-owl%3Astarring+%5B+rdfs%3Alabel+%22Gene+Hackman%22%40en+%5D%3B%0D%0A+++++++++rdfs%3Alabel+%3Ftitle%3B%0D%0A+++++++++dbpedia-owl%3Adirector+%3Fdir+.%0D%0A++%3Fdir+rdfs%3Alabel+%3Fname+.%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Ftitle%29%2C+%22EN%22%29%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Fname%29%2C++%22EN%22%29%0D%0A%7D&format=text%2Fcsv&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on" AS url
LOAD CSV WITH HEADERS FROM url AS row
MERGE (m:Movie { id: row.movie, title: row.title })
MERGE (d:Director { id: row.dir, name : row.name })
MERGE (m)-[db:DIRECTED_BY]->(d)
RETURN m, db, d
集成 #2:将 RDF 数据加载到 Neo4j 中
通过 neosemantics (n10s) 导入 RDF
RDF 中的 DESCRIBE 查询返回三元组
DESCRIBE <https://dbpedia.org/resource/Air_Jamaica>
我们可以借助 n10s 在 Cypher 中使用此功能
call n10s.rdf.import.fetch("https://dbpedia.org/data/Air_Jamaica.ttl","Turtle")
Air Jamaica 连接到的事物之一是……
MATCH (aj:Resource { uri: "https://dbpedia.org/resource/Air_Jamaica" }),
(aj)<-[r:ns2__subsidiary]-(what)
RETURN what.uri
……是 Caribbean Airlines
现在我们可以以类似的方式加载与 Caribbean Airlines 相关联的三元组。
call n10s.rdf.import.fetch("https://dbpedia.org/data/Caribbean_Airlines.ttl","Turtle")
此页面有帮助吗?