GraphGists

使用 Stattleship Sports API 和 Neo4j 的 NHL 球队排名模型

引言

作为体育方向的一部分,本文将使用Stattleship 的 API 及其R 包收集数据,并将数据存储在 Neo4j 中。我想演示使用 Cypher 根据 NHL 球队的赛季表现以及它们对阵过的球队来轻松排名球队。从 API 查询数据需要令牌,你可以在此处注册一个。

出于本次练习的目的,我们将只使用可用数据的一小部分。更具体地说,我将球队和比赛限制在 大西洋赛区 内。

收集数据的 R 代码包含在文章末尾。在该脚本中,我将数据保存到本地 CSV 文件。为了使本次练习可重现,我简单地将数据集提交到 Github。

注意:本文数据包含截至 2016 年 1 月 15 日在大西洋赛区内完成的所有比赛。

数据模型

下图展示了数据模型。

NHL Games data model

简而言之,节点有两种类型。Game 节点代表两个 Team 节点之间的比赛。每支球队通过 HOME_TEAMAWAY_TEAM 关系类型连接到比赛。在这两种情况下,节点和关系都具有属性。Game 节点包含比赛的各种信息,而关系则包含该球队在比赛中的表现信息。

将数据加载到 Neo4j

首先,为球队的 id 属性设置约束(这是 API 返回的值)

CREATE CONSTRAINT ON (n:Team) ASSERT n.id IS UNIQUE;

现在导入球队。

// Import the teams
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM  "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/teams.csv" as row
MERGE (t:Team {id:row.id, nickname:row.nickname, slug:row.slug})

现在设置比赛约束。

CREATE CONSTRAINT ON (n:Game) ASSERT n.id IS UNIQUE;

并加载比赛。

// Import the games
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/games.csv" as row
MERGE (g:Game {id:row.id,
	           attendance:toInt(row.attendance),
	           year:toInt(row.year),
	           month:toInt(row.month),
	           day:toInt(row.day)});

最后,连接球队和比赛,并在稍后将使用的关系上设置一些属性。

// Relate the teams and games
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/games.csv" as row
WITH row
MATCH (g:Game {id:row.id})
MATCH (a:Team {id:row.away_team_id})
MATCH (h:Team {id:row.home_team_id})
WITH row, g, a, h
MERGE (a) -[:AWAY_TEAM {goals:toInt(row.away_team_score),
	                    outcome:row.away_team_outcome,
	                    win: CASE WHEN row.away_team_outcome='win' THEN 1 ELSE 0 END,
	                    points: CASE WHEN row.away_team_outcome='win' THEN 2
	                                 WHEN row.away_team_outcome='overtime_loss' THEN 1
	                                 ELSE 0
	                             END}]-> (g)
MERGE (h) -[:HOME_TEAM {goals:toInt(row.home_team_score),
	                    outcome:row.home_team_outcome,
	                    win: CASE WHEN row.home_team_outcome='win' THEN 1 ELSE 0 END,
	                    points: CASE WHEN row.home_team_outcome='win' THEN 2
	                                 WHEN row.home_team_outcome='overtime_loss' THEN 1
	                                 ELSE 0
	                             END}]-> (g);

探索数据

为了确保一切正常加载,让我们按类型简单计数节点。

MATCH (n)
RETURN labels(n) AS `Node Type`, count(*) AS total

因为有时图片更容易理解,这里是图的一个小子集。

MATCH (n)
RETURN n
LIMIT 15

我们甚至可以查看特定球队参加的比赛。下面,我将过滤 波士顿棕熊队 (Bruins)。

MATCH (t1:Team {nickname:'Bruins'})-[r1]->(g1)<-[r2]-(t2)
RETURN t1, r1, g1, r2, t2

计算球队的胜率很容易。

// win loss percentage for each team
MATCH (n:Team)-[r]->(g:Game)
WITH n.nickname as team, sum(r.win) as wins, count(r) as games
RETURN team, wins, games, (wins*1.0/games*1.0) as wpc

因为稍后我们将考虑主场和客场的表现,下面的查询按地点查看比赛。

MATCH (t:Team)-[r]->(x:Game)
WITH t.nickname as team, CASE TYPE(r) WHEN 'AWAY_TEAM' THEN 1 ELSE 0 END as away_game, CASE TYPE(r) WHEN 'HOME_TEAM' THEN 1 ELSE 0 END as home_game, r
RETURN team, sum(away_game) as away_games, sum(home_game) as home_games, count(r) as total_games
ORDER BY total_games DESC

最后,查看球队的净胜球数轻而易举。

MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)
WITH t1.nickname as team, sum(r1.goals) as score, sum(r2.goals) AS opp_score
RETURN team, score, opp_score, score-opp_score as score_diff
ORDER BY score_diff DESC

球队排名

有很多方法可以对球队进行排名,但出于本文的目的,我将使用一种类似于此处描述的方法。唯一的实质区别是,我将使用球队的净胜球数,而不是胜负场数。

基本上,排名是以下 3 个因素的加权函数

  • 25% = 基于主场和客场比赛的球队净胜球数的加权组合(GD)。更多详情如下。

  • 50% = 球队对手的净胜球数(OGD

  • 25% 球队对手的对手的净胜球数的简单平均值 (OOGD)

净胜球数 GD 还有一个附加组成部分,即客场比赛(AGD)权重为 1.4,主场比赛(HGD)权重为 .6

下面的查询通过遍历图来捕获我们需要的信息,从而对球队进行排名。

MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)-[r3]->(g2:Game)<-[r4]-(t2:Team)-[r5]->(g3:Game)<-[r6]-(t3:Team)
WHERE NOT t1=t2 AND NOT t1=t3
WITH t1, sum(r6.goals) AS oop_gf, sum(r5.goals) as oop_ga
// goal differentials for team, and the teams they played (OPP)
MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)-[r3]->(g2:Game)<-[r4]-(t2:Team)
WHERE NOT t1=t2
WITH t1, sum(r3.goals) AS opp_gf, sum(r4.goals) as opp_ga, oop_gf, oop_ga
// goal differential for the team's opponents
MATCH (t1)-[r1]->(g1:Game)<-[r2]-(opponent:Team)
WHERE NOT t1=opponent
WITH t1.nickname as team,
     sum(CASE WHEN type(r1) = "HOME_TEAM" THEN r1.goals ELSE 0 END) as home_gf,
     sum(CASE WHEN type(r1) = "HOME_TEAM" THEN r2.goals ELSE 0 END) as home_ga,
     sum(CASE WHEN type(r1) = "AWAY_TEAM" THEN r1.goals ELSE 0 END) as away_gf,
     sum(CASE WHEN type(r1) = "AWAY_TEAM" THEN r2.goals ELSE 0 END) as away_ga,
     sum(r1.goals) as gf,
     sum(r2.goals) as ga,
     opp_gf,
     opp_ga,
     oop_gf,
     oop_ga
WITH team,
     home_gf,
     home_ga,
     away_gf,
     away_ga,
     gf,
     ga,
     opp_gf,
     opp_ga,
     home_gf-home_ga as HGD,
     away_gf-away_ga as AGD,
     gf-ga as GD,
     opp_gf-opp_ga as OGD,
     oop_gf,
     oop_ga,
     oop_gf-oop_ga as OOGD
RETURN team, .25*(1.4*AGD + .6*HGD) + .5*(OGD) + .25*(OOGD) as rpi
ORDER BY rpi DESC

如上所示,我们正在遍历图以分离球队在主场和客场的净胜球数,所有对手的净胜球数,甚至对手对阵过的球队的净胜球数。只需几次遍历,以及出色的 WITH 子句,这种排名就很容易实现。

虽然这些排名相对于当前的实际排名可能不那么直观,但我们必须记住,我们只使用了大西洋赛区内的比赛,并且忽略了联盟中其他球队的实力。除了包含整个 NHL 外,另一种修改方式可能只包括“最近”的比赛,或者仅仅修改上面应用的权重。

注意:在查询的末尾,你会注意到我使用 WITH 子句将结果链接在一起。我选择了更详细的过程,以便返回某些计算结果,这有助于调试查询。

附录 1:从 Stattleship API 获取数据的 R 代码

###############################################################################
## Use Stattleship and Neo4j to Rank NHL Teams for the 2015-16 Season
## @brocktibert
###############################################################################

## factors are the devil
options(stringsAsFactors = FALSE)

## packages -- if errors, install with ?install.packages
library(stattleshipR)
library(dplyr)
library(lubridate)
library(stringr)

## set the token from an environment variable
set_token("STATTLE_TOKEN_HERE")

## parse out entries from ss_get_result when walk=T and length > 1
parse_stattle <- function(stattle_list, entry) {
  x <- do.call("rbind", lapply(stattle_list, function(x) x[[entry]]))
  stopifnot(is.data.frame(x))
  return(x)
}

## The Atlantic Division id
atlantic_div <- "15ca4e46-2b49-4f37-84ea-befb62de28c5"

## get the NHL teams in the atlantic division and keep just the key fields
teams <- hockey_teams() %>%
  filter(division_id == atlantic_div) %>%
  select(id, nickname, slug)

## get the finished games, keep only those between atlantic division teams
all_games <- hockey_games(team_id="")
games <- filter(all_games,
                !is.na(ended_at) &
                  home_team_id %in% teams$id &
                  away_team_id %in% teams$id)
games <- select(games,
                id,
                started_at,
                scoreline,
                home_team_id,
                away_team_id,
                winning_team_id,
                attendance,
                duration,
                home_team_score,
                away_team_score,
                score_differential,
                home_team_outcome,
                away_team_outcome)

## extract dateparts from started date
games <- transform(games,
                   start_date = strptime(started_at, "%Y-%m-%dT%H:%M:%S"))
games <- transform(games,
                   year = year(start_date),
                   month = month(start_date),
                   day = day(start_date))
games$start_date <- NULL
games$started_at <- NULL

## save the csvs for import -- put on google drive for public access
write.table(teams, file="teams.csv", sep=",", row.names=F)
write.table(games, file="games.csv", sep=",", row.names=F)