GraphGists

GraphConnect 旧金山 2016 时间表图

耶!一年一度的双年一度图节又到了!我们正在全力以赴准备双年一度的图节,也就是 GraphConnect。在旧金山,又有一场很棒的会议在等着我们。整个 Neo4j 团队都将全员到场 - 当然,我们也必须创建另一个时间表图 - 只是为了好玩。我们坐了 14 个小时的飞机,座位很紧,旁边坐着一个和我 **一样大** 的人,这与创建时间表图 **没有任何** 关系。一点都没有。

谷歌表格作为主要存储库

我必须从 GraphConnect 网站 上的时间表开始,并将其转换为 包含所有数据的谷歌表格。这次看来工作量更大了一些(感谢,HTML!),但嘿 - 飞机上坐了 14 个小时,记得吗。

有了表格后,我就可以使用这个模型轻松地添加数据了。

model

非常简单 - 这个时间表图版本与之前版本的唯一区别是,我们不再使用“轨道”或类似的东西,而是使用“TopicTags”。“TopicTags”基本上就像你在 GmailEvernote 等工具中使用的标签或标签一样 - 用于表明某件事属于一个或多个类别。

当然,当你能够将其变为交互式并将其加载到 Neo4j 中时,体验会更好。让我们来做一下。让我们将这些数据加载到这个图gist 中。我已经为每一步添加了一些注释……

//add the days - all two of them
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
merge (d:Day {date: toInt(csv.Day)});

//connect the days to one another
match (d:Day), (d2:Day)
where d.date = d2.date-1
merge (d)-[:PRECEDES]-(d2);

//add the rooms, topics, speakers and speaker's companies. Connect the speakers to their Companies.
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with csv
where csv.Title is not null
merge (r:Room {name: csv.Room})
merge (p:Person {name: csv.Speaker, title: csv.Title})
set p.URL = csv.URL
merge (c:Company {name: csv.Company})
merge (p)-[:WORKS_FOR]->(c);

//add the start- and end-timeslots to each day
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
match (d:Day {date: toInt(csv.Day)})
merge (t1:Time {time: toInt(csv.Starttime)})-[:PART_OF]->(d)
merge (t2:Time {time: toInt(csv.Endtime)})-[:PART_OF]->(d);

现在,我们要添加所有不同的 TopicTags。这有点特殊,你会看到所有的 TopicTags 都在 CSV 文件的同一“列”中,并且它们以“特殊字符” **§§** 分隔。因此,我基本上需要 . 从 CSV 文件列中提取这些标签,. 将它们拆分为每个会话的单独集合,. 将它们 *展开* 为单独的 TopicTags,. 删除“空”标签(因为 CSV 列以 **§§** 结尾),. 将它们添加到图中。

所以让我们继续吧。

//add all the different topictags
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with split(csv.Type,"§§") as topictagcollection
unwind topictagcollection as topictags
with distinct topictags as topictag
where not topictag = ""
merge (tt:TopicTag {name: topictag})
return tt.name as First10TopicTags
order by tt.name ASC
limit 10;

如你所见,这给了我正确的结果。

现在,我们将继续将各个部分连接起来。

//add the sessions and connect them up
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
match (t2:Time {time: toInt(csv.Endtime)})-[:PART_OF]->(d:Day {date: toInt(csv.Day)})<-[:PART_OF]-(t1:Time {time: toInt(csv.Starttime)}), (r:Room {name: csv.Room}), (p:Person {name: csv.Speaker, title: csv.Title})
merge (s:Session {name: csv.Topic})
set s.description = csv.Comments
merge (s)<-[:SPEAKS_IN]-(p)
merge (s)-[:IN_ROOM]->(r)
merge (s)-[:STARTS_AT]->(t1)
merge (s)-[:ENDS_AT]->(t2);

//connect the sessions to topictags
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4/export?format=csv&id=1DfPZY4gpK1LjNhD8DSzPBBbHsQN1bqjnQQQZaEA1yQ4&gid=0" as csv
with split(csv.Type,"§§") as topictagcollection, csv.Topic as session
unwind topictagcollection as topictag
with session, topictag
where not( topictag = "" )
match (s:Session {name: session}), (tt:TopicTag {name: topictag})
merge (s)-[:HAS_TOPIC]->(tt);

//Connecting the timeslots
match (t:Time)--(d:Day {date:20161013})
with t
order by t.time ASC
with collect(t) as times
  foreach (i in range(0,length(times)-2) |
    foreach (t1 in [times[i]] |
      foreach (t2 in [times[i+1]] |
        merge (t1)-[:FOLLOWED_BY]->(t2))));

match (t:Time)--(d:Day {date:20161014})
with t
order by t.time ASC
with collect(t) as times
  foreach (i in range(0,length(times)-2) |
    foreach (t1 in [times[i]] |
      foreach (t2 in [times[i+1]] |
        merge (t1)-[:FOLLOWED_BY]->(t2))));

让我们使用以下查询来查看我们现在所拥有的内容,并获取一个小的样本。

MATCH (n) where rand() <= 0.1
MATCH (n)-[r]->(m)
WITH n, type(r) as via, m
RETURN labels(n) as from,
   reduce(keys = [], keys_n in collect(keys(n)) | keys + filter(k in keys_n WHERE NOT k IN keys)) as props_from,
   via,
   labels(m) as to,
   reduce(keys = [], keys_m in collect(keys(m)) | keys + filter(k in keys_m WHERE NOT k IN keys)) as props_to,
   count(*) as freq

好的 - 这给了我们一些见解。所以让我们尝试放大一点,并在我们的图上运行一个简单的查询:让我们在第一天找到几个会话。

match (d:Day {date:20161013})<--(t:Time)<--(s:Session)--(connections)
return d,t,s,connections
limit 50

这是图的样本。

让我们再进行一个查询。这是我亲爱的朋友 Jim WebberDan Murphy(来自 《金融时报》)之间的路径。

match path = allshortestpaths( (p1:Person)-[*]-(p2:Person) )
where p1.name contains "MURPHY"
and p2.name contains "WEBBER"
return path

并显示结果。

现在,让我们看一下一个人(Jim Webber,Neo 的知名人物)和一个组织(ICIJ巴拿马文件 的知名人物)之间的链接。

match (c:Company {name:"ICIJ"}), (p:Person {name:"JIM WEBBER"}),
path = allshortestpaths( (c)-[*]-(p) )
return path

并再次显示结果。

最后一个为了好玩:让我们看一下有多个演讲者的会话。

match (s:Session)-[r:SPEAKS_IN]-(p:Person)
with s, collect(p) as person, count(p) as count
where count > 1
return s,person

并显示它。

只是一个开始……

我们还有很多其他的东西可以查看。如果你有兴趣了解更多,请使用下面的控制台进行探索。

我希望这个 gist 对你来说很有趣,并且我们很快就会再见。

这个 gist 由 Rik Van Bruggen 创建。