GraphConnect 旧金山 2016 日程图
太棒了!又到了这个时候!我们正在全力以赴准备这场也被称为 GraphConnect 的两年一度图谱盛会。旧金山为我们准备了另一场精彩的会议。整个 Neo4j 团队将全员到场 - 当然我们不得不为了好玩再创建一个日程图。我在狭窄的飞机座位上与一个和我体型差不多的人坐了14小时,这完全无关。一点关系都没有。
将 Google 表格作为主要仓库
我当然必须从 GraphConnect 网站上的日程表开始,并将其转换为包含所有数据的 Google 表格。这次工作量稍微大了一些(感谢,HTML!),但是嘿 - 别忘了飞机上那14小时。
有了这些数据,我就可以使用这个模型轻松地添加数据了

非常简单 - 这个版本日程图唯一不同之处在于,我们不再有叫做“轨道”或类似的东西,而是有了“主题标签”。“主题标签”基本上就像你在 Gmail 或 Evernote 等应用中使用的标签或标记,用来表示某事物属于一个或多个类别。
当然,当你能够使其具有交互性并将其加载到 Neo4j 中时,会更加精彩。让我们来做吧。让我们将数据加载到这个图谱概要中。我为每一步都添加了一些注释……
//add the days - all two of them
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
merge (d:Day {date: toInt(csv.Day)});
//connect the days to one another
match (d:Day), (d2:Day)
where d.date = d2.date-1
merge (d)-[:PRECEDES]-(d2);
//add the rooms, topics, speakers and speaker's companies. Connect the speakers to their Companies.
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with csv
where csv.Title is not null
merge (r:Room {name: csv.Room})
merge (p:Person {name: csv.Speaker, title: csv.Title})
set p.URL = csv.URL
merge (c:Company {name: csv.Company})
merge (p)-[:WORKS_FOR]->(c);
//add the start- and end-timeslots to each day
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
match (d:Day {date: toInt(csv.Day)})
merge (t1:Time {time: toInt(csv.Starttime)})-[:PART_OF]->(d)
merge (t2:Time {time: toInt(csv.Endtime)})-[:PART_OF]->(d);
现在,我们要添加所有不同的主题标签。这有点特别,因为你会看到所有主题标签都在 CSV 文件的同一“列”中,并且它们由一个“特殊字符”§§ 分隔。所以我基本上需要:. 从 CSV 文件列中提取它们,. 将它们按会话分割成单独的集合,. 将它们展开为独立的主题标签,. 删除“空的”标签(因为 CSV 列以 §§ 结尾),. 将它们添加到图谱中。
那么我们继续。
//add all the different topictags
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with split(csv.Type,"§§") as topictagcollection
unwind topictagcollection as topictags
with distinct topictags as topictag
where not topictag = ""
merge (tt:TopicTag {name: topictag})
return tt.name as First10TopicTags
order by tt.name ASC
limit 10;
如你所见,这给了我正确的结果
现在我们将继续连接事物。
//add the sessions and connect them up
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
match (t2:Time {time: toInt(csv.Endtime)})-[:PART_OF]->(d:Day {date: toInt(csv.Day)})<-[:PART_OF]-(t1:Time {time: toInt(csv.Starttime)}), (r:Room {name: csv.Room}), (p:Person {name: csv.Speaker, title: csv.Title})
merge (s:Session {name: csv.Topic})
set s.description = csv.Comments
merge (s)<-[:SPEAKS_IN]-(p)
merge (s)-[:IN_ROOM]->(r)
merge (s)-[:STARTS_AT]->(t1)
merge (s)-[:ENDS_AT]->(t2);
//connect the sessions to topictags
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with split(csv.Type,"§§") as topictagcollection, csv.Topic as session
unwind topictagcollection as topictag
with session, topictag
where not( topictag = "" )
match (s:Session {name: session}), (tt:TopicTag {name: topictag})
merge (s)-[:HAS_TOPIC]->(tt);
//Connecting the timeslots
match (t:Time)--(d:Day {date:20161013})
with t
order by t.time ASC
with collect(t) as times
foreach (i in range(0,length(times)-2) |
foreach (t1 in [times[i]] |
foreach (t2 in [times[i+1]] |
merge (t1)-[:FOLLOWED_BY]->(t2))));
match (t:Time)--(d:Day {date:20161014})
with t
order by t.time ASC
with collect(t) as times
foreach (i in range(0,length(times)-2) |
foreach (t1 in [times[i]] |
foreach (t2 in [times[i+1]] |
merge (t1)-[:FOLLOWED_BY]->(t2))));
让我们用下面的查询来看一下我们现在有什么,取一个小的样本
MATCH (n) where rand() <= 0.1
MATCH (n)-[r]->(m)
WITH n, type(r) as via, m
RETURN labels(n) as from,
reduce(keys = [], keys_n in collect(keys(n)) | keys + filter(k in keys_n WHERE NOT k IN keys)) as props_from,
via,
labels(m) as to,
reduce(keys = [], keys_m in collect(keys(m)) | keys + filter(k in keys_m WHERE NOT k IN keys)) as props_to,
count(*) as freq
好的 - 这给了我们一些了解。那么让我们尝试稍微放大一点,并在我们的图谱上运行一个简单的查询:让我们找到第一天的几个会话
match (d:Day {date:20161013})<--(t:Time)<--(s:Session)--(connections)
return d,t,s,connections
limit 50
这是图谱的一个样本
让我们再做一个查询。这是我亲爱的朋友 Jim Webber 和《金融时报》的 Dan Murphy 之间的路径
match path = allshortestpaths( (p1:Person)-[*]-(p2:Person) )
where p1.name contains "MURPHY"
and p2.name contains "WEBBER"
return path
并显示结果
现在让我们看看一个人(因 Neo 而闻名的 Jim Webber)和一个组织(因巴拿马文件而闻名的 ICIJ)之间的链接。
match (c:Company {name:"ICIJ"}), (p:Person {name:"JIM WEBBER"}),
path = allshortestpaths( (c)-[*]-(p) )
return path
并再次显示结果
最后一个是为了好玩:让我们看看有多个演讲者的会话
match (s:Session)-[r:SPEAKS_IN]-(p:Person)
with s, collect(p) as person, count(p) as count
where count > 1
return s,person
并显示它
这仅仅是个开始……
还有很多其他事情我们可以看看。如果你感兴趣,可以使用下面的控制台进行探索。
希望这个图谱概要对你来说很有趣,并且我们很快会再见。
此图谱概要由 Rik Van Bruggen 创建
此页面有帮助吗?