GraphGists

GraphConnect 旧金山 2016 日程图

太棒了!又到了这个时候!我们正在全力以赴准备这场也被称为 GraphConnect 的两年一度图谱盛会。旧金山为我们准备了另一场精彩的会议。整个 Neo4j 团队将全员到场 - 当然我们不得不为了好玩再创建一个日程图。我在狭窄的飞机座位上与一个和我体型差不多的人坐了14小时,这完全无关。一点关系都没有。

将 Google 表格作为主要仓库

我当然必须从 GraphConnect 网站上的日程表开始,并将其转换为包含所有数据的 Google 表格。这次工作量稍微大了一些(感谢,HTML!),但是嘿 - 别忘了飞机上那14小时。

有了这些数据,我就可以使用这个模型轻松地添加数据了

model

非常简单 - 这个版本日程图唯一不同之处在于,我们不再有叫做“轨道”或类似的东西,而是有了“主题标签”。“主题标签”基本上就像你在 GmailEvernote 等应用中使用的标签或标记,用来表示某事物属于一个或多个类别。

当然,当你能够使其具有交互性并将其加载到 Neo4j 中时,会更加精彩。让我们来做吧。让我们将数据加载到这个图谱概要中。我为每一步都添加了一些注释……​

//add the days - all two of them
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
merge (d:Day {date: toInt(csv.Day)});

//connect the days to one another
match (d:Day), (d2:Day)
where d.date = d2.date-1
merge (d)-[:PRECEDES]-(d2);

//add the rooms, topics, speakers and speaker's companies. Connect the speakers to their Companies.
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with csv
where csv.Title is not null
merge (r:Room {name: csv.Room})
merge (p:Person {name: csv.Speaker, title: csv.Title})
set p.URL = csv.URL
merge (c:Company {name: csv.Company})
merge (p)-[:WORKS_FOR]->(c);

//add the start- and end-timeslots to each day
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
match (d:Day {date: toInt(csv.Day)})
merge (t1:Time {time: toInt(csv.Starttime)})-[:PART_OF]->(d)
merge (t2:Time {time: toInt(csv.Endtime)})-[:PART_OF]->(d);

现在,我们要添加所有不同的主题标签。这有点特别,因为你会看到所有主题标签都在 CSV 文件的同一“列”中,并且它们由一个“特殊字符”§§ 分隔。所以我基本上需要:. 从 CSV 文件列中提取它们,. 将它们按会话分割成单独的集合,. 将它们展开为独立的主题标签,. 删除“空的”标签(因为 CSV 列以 §§ 结尾),. 将它们添加到图谱中。

那么我们继续。

//add all the different topictags
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with split(csv.Type,"§§") as topictagcollection
unwind topictagcollection as topictags
with distinct topictags as topictag
where not topictag = ""
merge (tt:TopicTag {name: topictag})
return tt.name as First10TopicTags
order by tt.name ASC
limit 10;

如你所见,这给了我正确的结果

现在我们将继续连接事物。

//add the sessions and connect them up
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
match (t2:Time {time: toInt(csv.Endtime)})-[:PART_OF]->(d:Day {date: toInt(csv.Day)})<-[:PART_OF]-(t1:Time {time: toInt(csv.Starttime)}), (r:Room {name: csv.Room}), (p:Person {name: csv.Speaker, title: csv.Title})
merge (s:Session {name: csv.Topic})
set s.description = csv.Comments
merge (s)<-[:SPEAKS_IN]-(p)
merge (s)-[:IN_ROOM]->(r)
merge (s)-[:STARTS_AT]->(t1)
merge (s)-[:ENDS_AT]->(t2);

//connect the sessions to topictags
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with split(csv.Type,"§§") as topictagcollection, csv.Topic as session
unwind topictagcollection as topictag
with session, topictag
where not( topictag = "" )
match (s:Session {name: session}), (tt:TopicTag {name: topictag})
merge (s)-[:HAS_TOPIC]->(tt);

//Connecting the timeslots
match (t:Time)--(d:Day {date:20161013})
with t
order by t.time ASC
with collect(t) as times
  foreach (i in range(0,length(times)-2) |
    foreach (t1 in [times[i]] |
      foreach (t2 in [times[i+1]] |
        merge (t1)-[:FOLLOWED_BY]->(t2))));

match (t:Time)--(d:Day {date:20161014})
with t
order by t.time ASC
with collect(t) as times
  foreach (i in range(0,length(times)-2) |
    foreach (t1 in [times[i]] |
      foreach (t2 in [times[i+1]] |
        merge (t1)-[:FOLLOWED_BY]->(t2))));

让我们用下面的查询来看一下我们现在有什么,取一个小的样本

MATCH (n) where rand() <= 0.1
MATCH (n)-[r]->(m)
WITH n, type(r) as via, m
RETURN labels(n) as from,
   reduce(keys = [], keys_n in collect(keys(n)) | keys + filter(k in keys_n WHERE NOT k IN keys)) as props_from,
   via,
   labels(m) as to,
   reduce(keys = [], keys_m in collect(keys(m)) | keys + filter(k in keys_m WHERE NOT k IN keys)) as props_to,
   count(*) as freq

好的 - 这给了我们一些了解。那么让我们尝试稍微放大一点,并在我们的图谱上运行一个简单的查询:让我们找到第一天的几个会话

match (d:Day {date:20161013})<--(t:Time)<--(s:Session)--(connections)
return d,t,s,connections
limit 50

这是图谱的一个样本

让我们再做一个查询。这是我亲爱的朋友 Jim Webber《金融时报》Dan Murphy 之间的路径

match path = allshortestpaths( (p1:Person)-[*]-(p2:Person) )
where p1.name contains "MURPHY"
and p2.name contains "WEBBER"
return path

并显示结果

现在让我们看看一个人(因 Neo 而闻名的 Jim Webber)和一个组织(因巴拿马文件而闻名的 ICIJ)之间的链接。

match (c:Company {name:"ICIJ"}), (p:Person {name:"JIM WEBBER"}),
path = allshortestpaths( (c)-[*]-(p) )
return path

并再次显示结果

最后一个是为了好玩:让我们看看有多个演讲者的会话

match (s:Session)-[r:SPEAKS_IN]-(p:Person)
with s, collect(p) as person, count(p) as count
where count > 1
return s,person

并显示它

这仅仅是个开始……​

还有很多其他事情我们可以看看。如果你感兴趣,可以使用下面的控制台进行探索。

希望这个图谱概要对你来说很有趣,并且我们很快会再见。

此图谱概要由 Rik Van Bruggen 创建

© . All rights reserved.