使用 Stattleship Sports API 和 Neo4j 的 NHL 球队排名模型
引言
作为体育方向的一部分,本文将使用Stattleship 的 API 及其R 包收集数据,并将数据存储在 Neo4j 中。我想演示使用 Cypher 根据 NHL 球队的赛季表现以及它们对阵过的球队来轻松排名球队。从 API 查询数据需要令牌,你可以在此处注册一个。
出于本次练习的目的,我们将只使用可用数据的一小部分。更具体地说,我将球队和比赛限制在 大西洋赛区 内。
收集数据的 R 代码包含在文章末尾。在该脚本中,我将数据保存到本地 CSV 文件。为了使本次练习可重现,我简单地将数据集提交到 Github。
注意:本文数据包含截至 2016 年 1 月 15 日在大西洋赛区内完成的所有比赛。
数据模型
下图展示了数据模型。

简而言之,节点有两种类型。Game
节点代表两个 Team
节点之间的比赛。每支球队通过 HOME_TEAM
或 AWAY_TEAM
关系类型连接到比赛。在这两种情况下,节点和关系都具有属性。Game
节点包含比赛的各种信息,而关系则包含该球队在比赛中的表现信息。
将数据加载到 Neo4j
首先,为球队的 id
属性设置约束(这是 API 返回的值)
CREATE CONSTRAINT ON (n:Team) ASSERT n.id IS UNIQUE;
现在导入球队。
// Import the teams
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/teams.csv" as row
MERGE (t:Team {id:row.id, nickname:row.nickname, slug:row.slug})
现在设置比赛约束。
CREATE CONSTRAINT ON (n:Game) ASSERT n.id IS UNIQUE;
并加载比赛。
// Import the games
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/games.csv" as row
MERGE (g:Game {id:row.id,
attendance:toInt(row.attendance),
year:toInt(row.year),
month:toInt(row.month),
day:toInt(row.day)});
最后,连接球队和比赛,并在稍后将使用的关系上设置一些属性。
// Relate the teams and games
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/games.csv" as row
WITH row
MATCH (g:Game {id:row.id})
MATCH (a:Team {id:row.away_team_id})
MATCH (h:Team {id:row.home_team_id})
WITH row, g, a, h
MERGE (a) -[:AWAY_TEAM {goals:toInt(row.away_team_score),
outcome:row.away_team_outcome,
win: CASE WHEN row.away_team_outcome='win' THEN 1 ELSE 0 END,
points: CASE WHEN row.away_team_outcome='win' THEN 2
WHEN row.away_team_outcome='overtime_loss' THEN 1
ELSE 0
END}]-> (g)
MERGE (h) -[:HOME_TEAM {goals:toInt(row.home_team_score),
outcome:row.home_team_outcome,
win: CASE WHEN row.home_team_outcome='win' THEN 1 ELSE 0 END,
points: CASE WHEN row.home_team_outcome='win' THEN 2
WHEN row.home_team_outcome='overtime_loss' THEN 1
ELSE 0
END}]-> (g);
探索数据
为了确保一切正常加载,让我们按类型简单计数节点。
MATCH (n)
RETURN labels(n) AS `Node Type`, count(*) AS total
因为有时图片更容易理解,这里是图的一个小子集。
MATCH (n)
RETURN n
LIMIT 15
我们甚至可以查看特定球队参加的比赛。下面,我将过滤 波士顿棕熊队 (Bruins)。
MATCH (t1:Team {nickname:'Bruins'})-[r1]->(g1)<-[r2]-(t2)
RETURN t1, r1, g1, r2, t2
计算球队的胜率很容易。
// win loss percentage for each team
MATCH (n:Team)-[r]->(g:Game)
WITH n.nickname as team, sum(r.win) as wins, count(r) as games
RETURN team, wins, games, (wins*1.0/games*1.0) as wpc
因为稍后我们将考虑主场和客场的表现,下面的查询按地点查看比赛。
MATCH (t:Team)-[r]->(x:Game)
WITH t.nickname as team, CASE TYPE(r) WHEN 'AWAY_TEAM' THEN 1 ELSE 0 END as away_game, CASE TYPE(r) WHEN 'HOME_TEAM' THEN 1 ELSE 0 END as home_game, r
RETURN team, sum(away_game) as away_games, sum(home_game) as home_games, count(r) as total_games
ORDER BY total_games DESC
最后,查看球队的净胜球数轻而易举。
MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)
WITH t1.nickname as team, sum(r1.goals) as score, sum(r2.goals) AS opp_score
RETURN team, score, opp_score, score-opp_score as score_diff
ORDER BY score_diff DESC
球队排名
有很多方法可以对球队进行排名,但出于本文的目的,我将使用一种类似于此处描述的方法。唯一的实质区别是,我将使用球队的净胜球数,而不是胜负场数。
基本上,排名是以下 3 个因素的加权函数
-
25% = 基于主场和客场比赛的球队净胜球数的加权组合(GD)。更多详情如下。
-
50% = 球队对手的净胜球数(OGD)
-
25% 球队对手的对手的净胜球数的简单平均值 (OOGD)
净胜球数 GD 还有一个附加组成部分,即客场比赛(AGD)权重为 1.4
,主场比赛(HGD)权重为 .6
。
下面的查询通过遍历图来捕获我们需要的信息,从而对球队进行排名。
MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)-[r3]->(g2:Game)<-[r4]-(t2:Team)-[r5]->(g3:Game)<-[r6]-(t3:Team)
WHERE NOT t1=t2 AND NOT t1=t3
WITH t1, sum(r6.goals) AS oop_gf, sum(r5.goals) as oop_ga
// goal differentials for team, and the teams they played (OPP)
MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)-[r3]->(g2:Game)<-[r4]-(t2:Team)
WHERE NOT t1=t2
WITH t1, sum(r3.goals) AS opp_gf, sum(r4.goals) as opp_ga, oop_gf, oop_ga
// goal differential for the team's opponents
MATCH (t1)-[r1]->(g1:Game)<-[r2]-(opponent:Team)
WHERE NOT t1=opponent
WITH t1.nickname as team,
sum(CASE WHEN type(r1) = "HOME_TEAM" THEN r1.goals ELSE 0 END) as home_gf,
sum(CASE WHEN type(r1) = "HOME_TEAM" THEN r2.goals ELSE 0 END) as home_ga,
sum(CASE WHEN type(r1) = "AWAY_TEAM" THEN r1.goals ELSE 0 END) as away_gf,
sum(CASE WHEN type(r1) = "AWAY_TEAM" THEN r2.goals ELSE 0 END) as away_ga,
sum(r1.goals) as gf,
sum(r2.goals) as ga,
opp_gf,
opp_ga,
oop_gf,
oop_ga
WITH team,
home_gf,
home_ga,
away_gf,
away_ga,
gf,
ga,
opp_gf,
opp_ga,
home_gf-home_ga as HGD,
away_gf-away_ga as AGD,
gf-ga as GD,
opp_gf-opp_ga as OGD,
oop_gf,
oop_ga,
oop_gf-oop_ga as OOGD
RETURN team, .25*(1.4*AGD + .6*HGD) + .5*(OGD) + .25*(OOGD) as rpi
ORDER BY rpi DESC
如上所示,我们正在遍历图以分离球队在主场和客场的净胜球数,所有对手的净胜球数,甚至对手对阵过的球队的净胜球数。只需几次遍历,以及出色的 WITH
子句,这种排名就很容易实现。
虽然这些排名相对于当前的实际排名可能不那么直观,但我们必须记住,我们只使用了大西洋赛区内的比赛,并且忽略了联盟中其他球队的实力。除了包含整个 NHL 外,另一种修改方式可能只包括“最近”的比赛,或者仅仅修改上面应用的权重。
注意:在查询的末尾,你会注意到我使用 WITH
子句将结果链接在一起。我选择了更详细的过程,以便返回某些计算结果,这有助于调试查询。
附录 1:从 Stattleship API 获取数据的 R 代码
###############################################################################
## Use Stattleship and Neo4j to Rank NHL Teams for the 2015-16 Season
## @brocktibert
###############################################################################
## factors are the devil
options(stringsAsFactors = FALSE)
## packages -- if errors, install with ?install.packages
library(stattleshipR)
library(dplyr)
library(lubridate)
library(stringr)
## set the token from an environment variable
set_token("STATTLE_TOKEN_HERE")
## parse out entries from ss_get_result when walk=T and length > 1
parse_stattle <- function(stattle_list, entry) {
x <- do.call("rbind", lapply(stattle_list, function(x) x[[entry]]))
stopifnot(is.data.frame(x))
return(x)
}
## The Atlantic Division id
atlantic_div <- "15ca4e46-2b49-4f37-84ea-befb62de28c5"
## get the NHL teams in the atlantic division and keep just the key fields
teams <- hockey_teams() %>%
filter(division_id == atlantic_div) %>%
select(id, nickname, slug)
## get the finished games, keep only those between atlantic division teams
all_games <- hockey_games(team_id="")
games <- filter(all_games,
!is.na(ended_at) &
home_team_id %in% teams$id &
away_team_id %in% teams$id)
games <- select(games,
id,
started_at,
scoreline,
home_team_id,
away_team_id,
winning_team_id,
attendance,
duration,
home_team_score,
away_team_score,
score_differential,
home_team_outcome,
away_team_outcome)
## extract dateparts from started date
games <- transform(games,
start_date = strptime(started_at, "%Y-%m-%dT%H:%M:%S"))
games <- transform(games,
year = year(start_date),
month = month(start_date),
day = day(start_date))
games$start_date <- NULL
games$started_at <- NULL
## save the csvs for import -- put on google drive for public access
write.table(teams, file="teams.csv", sep=",", row.names=F)
write.table(games, file="games.csv", sep=",", row.names=F)
此页面有帮助吗?