使用 Stattleship Sports API 和 Neo4j 的 NHL 球队排名模型
介绍
作为体育赛道的一部分,这篇文章将使用他们的 R 包 从 Stattleship 的 API 收集数据,并将数据存储在 Neo4j 中。我想展示使用 Cypher 来根据 NHL 球队的赛季表现及其所参加的比赛来对它们进行排名是多么容易。查询 API 中的数据需要一个令牌,但您可以在 这里 注册一个。
为了便于练习,我们只使用一小部分可用数据。更具体地说,我只限制了大西洋赛区内的球队和比赛。
用于大学收集数据的 R 代码包含在文章末尾。在该脚本中,我将数据保存到本地 CSV 文件。为了使此练习可重复,我只需将数据集提交到 Github。
注意: 这篇文章中的数据包括截至 2016 年 1 月 15 日在大西洋赛区内完成的所有比赛。
数据模型
下图表示数据模型。

简而言之,存在两种类型的节点。Game
节点表示两个 Team
节点之间的比赛。每个球队通过 HOME_TEAM
或 AWAY_TEAM
关系类型连接到比赛。在这两种情况下,节点和关系都具有属性。Game
节点具有有关比赛的各种信息,而关系具有有关该球队在比赛中的表现的信息。
将数据加载到 Neo4j
首先,为球队的 id
属性设置约束(这是一个由 API 返回的值)
CREATE CONSTRAINT ON (n:Team) ASSERT n.id IS UNIQUE;
现在导入球队。
// Import the teams
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/teams.csv" as row
MERGE (t:Team {id:row.id, nickname:row.nickname, slug:row.slug})
现在让我们设置比赛约束。
CREATE CONSTRAINT ON (n:Game) ASSERT n.id IS UNIQUE;
并加载比赛。
// Import the games
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/games.csv" as row
MERGE (g:Game {id:row.id,
attendance:toInt(row.attendance),
year:toInt(row.year),
month:toInt(row.month),
day:toInt(row.day)});
最后,加入球队和比赛,并设置关系上的一些属性,这些属性我们将在后面使用。
// Relate the teams and games
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/games.csv" as row
WITH row
MATCH (g:Game {id:row.id})
MATCH (a:Team {id:row.away_team_id})
MATCH (h:Team {id:row.home_team_id})
WITH row, g, a, h
MERGE (a) -[:AWAY_TEAM {goals:toInt(row.away_team_score),
outcome:row.away_team_outcome,
win: CASE WHEN row.away_team_outcome='win' THEN 1 ELSE 0 END,
points: CASE WHEN row.away_team_outcome='win' THEN 2
WHEN row.away_team_outcome='overtime_loss' THEN 1
ELSE 0
END}]-> (g)
MERGE (h) -[:HOME_TEAM {goals:toInt(row.home_team_score),
outcome:row.home_team_outcome,
win: CASE WHEN row.home_team_outcome='win' THEN 1 ELSE 0 END,
points: CASE WHEN row.home_team_outcome='win' THEN 2
WHEN row.home_team_outcome='overtime_loss' THEN 1
ELSE 0
END}]-> (g);
探索数据
为了确保一切正常加载,让我们简单地按类型对节点进行计数。
MATCH (n)
RETURN labels(n) AS `Node Type`, count(*) AS total
因为有时图片更容易理解,所以这里是一小部分图。
MATCH (n)
RETURN n
LIMIT 15
我们甚至可以查看特定球队参加的比赛。下面,我将过滤布鲁因斯。
MATCH (t1:Team {nickname:'Bruins'})-[r1]->(g1)<-[r2]-(t2)
RETURN t1, r1, g1, r2, t2
计算球队的胜率很容易。
// win loss percentage for each team
MATCH (n:Team)-[r]->(g:Game)
WITH n.nickname as team, sum(r.win) as wins, count(r) as games
RETURN team, wins, games, (wins*1.0/games*1.0) as wpc
因为我们将在第二部分考虑主场和客场的表现,所以下面的查询按位置查看比赛。
MATCH (t:Team)-[r]->(x:Game)
WITH t.nickname as team, CASE TYPE(r) WHEN 'AWAY_TEAM' THEN 1 ELSE 0 END as away_game, CASE TYPE(r) WHEN 'HOME_TEAM' THEN 1 ELSE 0 END as home_game, r
RETURN team, sum(away_game) as away_games, sum(home_game) as home_games, count(r) as total_games
ORDER BY total_games DESC
最后,查看球队的净胜球数轻而易举。
MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)
WITH t1.nickname as team, sum(r1.goals) as score, sum(r2.goals) AS opp_score
RETURN team, score, opp_score, score-opp_score as score_diff
ORDER BY score_diff DESC
对球队进行排名
您可以通过多种方法对球队进行排名,但为了便于说明,我将使用与 这里描述的方法类似的方法。唯一的重大区别是,我将使用球队的净胜球数而不是胜负场次。
基本上,排名是 3 个因素的加权函数
-
25% = 球队净胜球数的加权组合,基于主场比赛和客场比赛(GD)。更多内容见下文。
-
50% = 球队的对手净胜球数(OGD)
-
25% = 球队的对手对手净胜球数的简单平均值(OOGD)
净胜球数 GD 还有一个额外的组成部分,即客场比赛(AGD)的权重为 1.4
,主场比赛(HGD)的权重为 .6
。
下面的查询通过遍历图来获取我们需要的信息,从而对球队进行排名。
MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)-[r3]->(g2:Game)<-[r4]-(t2:Team)-[r5]->(g3:Game)<-[r6]-(t3:Team)
WHERE NOT t1=t2 AND NOT t1=t3
WITH t1, sum(r6.goals) AS oop_gf, sum(r5.goals) as oop_ga
// goal differentials for team, and the teams they played (OPP)
MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)-[r3]->(g2:Game)<-[r4]-(t2:Team)
WHERE NOT t1=t2
WITH t1, sum(r3.goals) AS opp_gf, sum(r4.goals) as opp_ga, oop_gf, oop_ga
// goal differential for the team's opponents
MATCH (t1)-[r1]->(g1:Game)<-[r2]-(opponent:Team)
WHERE NOT t1=opponent
WITH t1.nickname as team,
sum(CASE WHEN type(r1) = "HOME_TEAM" THEN r1.goals ELSE 0 END) as home_gf,
sum(CASE WHEN type(r1) = "HOME_TEAM" THEN r2.goals ELSE 0 END) as home_ga,
sum(CASE WHEN type(r1) = "AWAY_TEAM" THEN r1.goals ELSE 0 END) as away_gf,
sum(CASE WHEN type(r1) = "AWAY_TEAM" THEN r2.goals ELSE 0 END) as away_ga,
sum(r1.goals) as gf,
sum(r2.goals) as ga,
opp_gf,
opp_ga,
oop_gf,
oop_ga
WITH team,
home_gf,
home_ga,
away_gf,
away_ga,
gf,
ga,
opp_gf,
opp_ga,
home_gf-home_ga as HGD,
away_gf-away_ga as AGD,
gf-ga as GD,
opp_gf-opp_ga as OGD,
oop_gf,
oop_ga,
oop_gf-oop_ga as OOGD
RETURN team, .25*(1.4*AGD + .6*HGD) + .5*(OGD) + .25*(OOGD) as rpi
ORDER BY rpi DESC
在上面,我们正在遍历图以隔离球队的主场和客场净胜球数、所有对手的净胜球数,甚至对手所参加的球队的净胜球数。只需进行几次遍历,再加上神奇的 WITH
子句,这种排名就变得易如反掌。
虽然排名相对于今天的实际排名来说没有直观的意义,但我们必须记住,我们只使用了大西洋赛区内进行的比赛,并且忽略了联盟中其他球队的实力。除了包括整个 NHL 之外,另一个修改可以只包括“最近”的比赛,或者只是简单地修改上面应用的权重。
注意: 在查询的结尾,您会注意到我使用 WITH
子句将结果链接在一起。我选择了一个更详细的过程,以便可以返回某些计算结果,这对调试查询很有帮助。
附录 1:从 Stattleship API 获取数据的 R 代码
###############################################################################
## Use Stattleship and Neo4j to Rank NHL Teams for the 2015-16 Season
## @brocktibert
###############################################################################
## factors are the devil
options(stringsAsFactors = FALSE)
## packages -- if errors, install with ?install.packages
library(stattleshipR)
library(dplyr)
library(lubridate)
library(stringr)
## set the token from an environment variable
set_token("STATTLE_TOKEN_HERE")
## parse out entries from ss_get_result when walk=T and length > 1
parse_stattle <- function(stattle_list, entry) {
x <- do.call("rbind", lapply(stattle_list, function(x) x[[entry]]))
stopifnot(is.data.frame(x))
return(x)
}
## The Atlantic Division id
atlantic_div <- "15ca4e46-2b49-4f37-84ea-befb62de28c5"
## get the NHL teams in the atlantic division and keep just the key fields
teams <- hockey_teams() %>%
filter(division_id == atlantic_div) %>%
select(id, nickname, slug)
## get the finished games, keep only those between atlantic division teams
all_games <- hockey_games(team_id="")
games <- filter(all_games,
!is.na(ended_at) &
home_team_id %in% teams$id &
away_team_id %in% teams$id)
games <- select(games,
id,
started_at,
scoreline,
home_team_id,
away_team_id,
winning_team_id,
attendance,
duration,
home_team_score,
away_team_score,
score_differential,
home_team_outcome,
away_team_outcome)
## extract dateparts from started date
games <- transform(games,
start_date = strptime(started_at, "%Y-%m-%dT%H:%M:%S"))
games <- transform(games,
year = year(start_date),
month = month(start_date),
day = day(start_date))
games$start_date <- NULL
games$started_at <- NULL
## save the csvs for import -- put on google drive for public access
write.table(teams, file="teams.csv", sep=",", row.names=F)
write.table(games, file="games.csv", sep=",", row.names=F)
此页面是否有用?