GraphGists

使用 Stattleship Sports API 和 Neo4j 的 NHL 球队排名模型

介绍

作为体育赛道的一部分,这篇文章将使用他们的 R 包Stattleship 的 API 收集数据,并将数据存储在 Neo4j 中。我想展示使用 Cypher 来根据 NHL 球队的赛季表现及其所参加的比赛来对它们进行排名是多么容易。查询 API 中的数据需要一个令牌,但您可以在 这里 注册一个。

为了便于练习,我们只使用一小部分可用数据。更具体地说,我只限制了大西洋赛区内的球队和比赛。

用于大学收集数据的 R 代码包含在文章末尾。在该脚本中,我将数据保存到本地 CSV 文件。为了使此练习可重复,我只需将数据集提交到 Github。

注意: 这篇文章中的数据包括截至 2016 年 1 月 15 日在大西洋赛区内完成的所有比赛。

数据模型

下图表示数据模型。

NHL Games data model

简而言之,存在两种类型的节点。Game 节点表示两个 Team 节点之间的比赛。每个球队通过 HOME_TEAMAWAY_TEAM 关系类型连接到比赛。在这两种情况下,节点和关系都具有属性。Game 节点具有有关比赛的各种信息,而关系具有有关该球队在比赛中的表现的信息。

将数据加载到 Neo4j

首先,为球队的 id 属性设置约束(这是一个由 API 返回的值)

CREATE CONSTRAINT ON (n:Team) ASSERT n.id IS UNIQUE;

现在导入球队。

// Import the teams
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM  "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/teams.csv" as row
MERGE (t:Team {id:row.id, nickname:row.nickname, slug:row.slug})

现在让我们设置比赛约束。

CREATE CONSTRAINT ON (n:Game) ASSERT n.id IS UNIQUE;

并加载比赛。

// Import the games
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/games.csv" as row
MERGE (g:Game {id:row.id,
	           attendance:toInt(row.attendance),
	           year:toInt(row.year),
	           month:toInt(row.month),
	           day:toInt(row.day)});

最后,加入球队和比赛,并设置关系上的一些属性,这些属性我们将在后面使用。

// Relate the teams and games
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/Btibert3/neo4j-stattleship-nhl-team-ranking/master/data/games.csv" as row
WITH row
MATCH (g:Game {id:row.id})
MATCH (a:Team {id:row.away_team_id})
MATCH (h:Team {id:row.home_team_id})
WITH row, g, a, h
MERGE (a) -[:AWAY_TEAM {goals:toInt(row.away_team_score),
	                    outcome:row.away_team_outcome,
	                    win: CASE WHEN row.away_team_outcome='win' THEN 1 ELSE 0 END,
	                    points: CASE WHEN row.away_team_outcome='win' THEN 2
	                                 WHEN row.away_team_outcome='overtime_loss' THEN 1
	                                 ELSE 0
	                             END}]-> (g)
MERGE (h) -[:HOME_TEAM {goals:toInt(row.home_team_score),
	                    outcome:row.home_team_outcome,
	                    win: CASE WHEN row.home_team_outcome='win' THEN 1 ELSE 0 END,
	                    points: CASE WHEN row.home_team_outcome='win' THEN 2
	                                 WHEN row.home_team_outcome='overtime_loss' THEN 1
	                                 ELSE 0
	                             END}]-> (g);

探索数据

为了确保一切正常加载,让我们简单地按类型对节点进行计数。

MATCH (n)
RETURN labels(n) AS `Node Type`, count(*) AS total

因为有时图片更容易理解,所以这里是一小部分图。

MATCH (n)
RETURN n
LIMIT 15

我们甚至可以查看特定球队参加的比赛。下面,我将过滤布鲁因斯。

MATCH (t1:Team {nickname:'Bruins'})-[r1]->(g1)<-[r2]-(t2)
RETURN t1, r1, g1, r2, t2

计算球队的胜率很容易。

// win loss percentage for each team
MATCH (n:Team)-[r]->(g:Game)
WITH n.nickname as team, sum(r.win) as wins, count(r) as games
RETURN team, wins, games, (wins*1.0/games*1.0) as wpc

因为我们将在第二部分考虑主场和客场的表现,所以下面的查询按位置查看比赛。

MATCH (t:Team)-[r]->(x:Game)
WITH t.nickname as team, CASE TYPE(r) WHEN 'AWAY_TEAM' THEN 1 ELSE 0 END as away_game, CASE TYPE(r) WHEN 'HOME_TEAM' THEN 1 ELSE 0 END as home_game, r
RETURN team, sum(away_game) as away_games, sum(home_game) as home_games, count(r) as total_games
ORDER BY total_games DESC

最后,查看球队的净胜球数轻而易举。

MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)
WITH t1.nickname as team, sum(r1.goals) as score, sum(r2.goals) AS opp_score
RETURN team, score, opp_score, score-opp_score as score_diff
ORDER BY score_diff DESC

对球队进行排名

您可以通过多种方法对球队进行排名,但为了便于说明,我将使用与 这里描述的方法类似的方法。唯一的重大区别是,我将使用球队的净胜球数而不是胜负场次。

基本上,排名是 3 个因素的加权函数

  • 25% = 球队净胜球数的加权组合,基于主场比赛和客场比赛(GD)。更多内容见下文。

  • 50% = 球队的对手净胜球数(OGD

  • 25% = 球队的对手对手净胜球数的简单平均值(OOGD

净胜球数 GD 还有一个额外的组成部分,即客场比赛(AGD)的权重为 1.4,主场比赛(HGD)的权重为 .6

下面的查询通过遍历图来获取我们需要的信息,从而对球队进行排名。

MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)-[r3]->(g2:Game)<-[r4]-(t2:Team)-[r5]->(g3:Game)<-[r6]-(t3:Team)
WHERE NOT t1=t2 AND NOT t1=t3
WITH t1, sum(r6.goals) AS oop_gf, sum(r5.goals) as oop_ga
// goal differentials for team, and the teams they played (OPP)
MATCH (t1:Team)-[r1]->(g1:Game)<-[r2]-(opponent:Team)-[r3]->(g2:Game)<-[r4]-(t2:Team)
WHERE NOT t1=t2
WITH t1, sum(r3.goals) AS opp_gf, sum(r4.goals) as opp_ga, oop_gf, oop_ga
// goal differential for the team's opponents
MATCH (t1)-[r1]->(g1:Game)<-[r2]-(opponent:Team)
WHERE NOT t1=opponent
WITH t1.nickname as team,
     sum(CASE WHEN type(r1) = "HOME_TEAM" THEN r1.goals ELSE 0 END) as home_gf,
     sum(CASE WHEN type(r1) = "HOME_TEAM" THEN r2.goals ELSE 0 END) as home_ga,
     sum(CASE WHEN type(r1) = "AWAY_TEAM" THEN r1.goals ELSE 0 END) as away_gf,
     sum(CASE WHEN type(r1) = "AWAY_TEAM" THEN r2.goals ELSE 0 END) as away_ga,
     sum(r1.goals) as gf,
     sum(r2.goals) as ga,
     opp_gf,
     opp_ga,
     oop_gf,
     oop_ga
WITH team,
     home_gf,
     home_ga,
     away_gf,
     away_ga,
     gf,
     ga,
     opp_gf,
     opp_ga,
     home_gf-home_ga as HGD,
     away_gf-away_ga as AGD,
     gf-ga as GD,
     opp_gf-opp_ga as OGD,
     oop_gf,
     oop_ga,
     oop_gf-oop_ga as OOGD
RETURN team, .25*(1.4*AGD + .6*HGD) + .5*(OGD) + .25*(OOGD) as rpi
ORDER BY rpi DESC

在上面,我们正在遍历图以隔离球队的主场和客场净胜球数、所有对手的净胜球数,甚至对手所参加的球队的净胜球数。只需进行几次遍历,再加上神奇的 WITH 子句,这种排名就变得易如反掌。

虽然排名相对于今天的实际排名来说没有直观的意义,但我们必须记住,我们只使用了大西洋赛区内进行的比赛,并且忽略了联盟中其他球队的实力。除了包括整个 NHL 之外,另一个修改可以只包括“最近”的比赛,或者只是简单地修改上面应用的权重。

注意: 在查询的结尾,您会注意到我使用 WITH 子句将结果链接在一起。我选择了一个更详细的过程,以便可以返回某些计算结果,这对调试查询很有帮助。

附录 1:从 Stattleship API 获取数据的 R 代码

###############################################################################
## Use Stattleship and Neo4j to Rank NHL Teams for the 2015-16 Season
## @brocktibert
###############################################################################

## factors are the devil
options(stringsAsFactors = FALSE)

## packages -- if errors, install with ?install.packages
library(stattleshipR)
library(dplyr)
library(lubridate)
library(stringr)

## set the token from an environment variable
set_token("STATTLE_TOKEN_HERE")

## parse out entries from ss_get_result when walk=T and length > 1
parse_stattle <- function(stattle_list, entry) {
  x <- do.call("rbind", lapply(stattle_list, function(x) x[[entry]]))
  stopifnot(is.data.frame(x))
  return(x)
}

## The Atlantic Division id
atlantic_div <- "15ca4e46-2b49-4f37-84ea-befb62de28c5"

## get the NHL teams in the atlantic division and keep just the key fields
teams <- hockey_teams() %>%
  filter(division_id == atlantic_div) %>%
  select(id, nickname, slug)

## get the finished games, keep only those between atlantic division teams
all_games <- hockey_games(team_id="")
games <- filter(all_games,
                !is.na(ended_at) &
                  home_team_id %in% teams$id &
                  away_team_id %in% teams$id)
games <- select(games,
                id,
                started_at,
                scoreline,
                home_team_id,
                away_team_id,
                winning_team_id,
                attendance,
                duration,
                home_team_score,
                away_team_score,
                score_differential,
                home_team_outcome,
                away_team_outcome)

## extract dateparts from started date
games <- transform(games,
                   start_date = strptime(started_at, "%Y-%m-%dT%H:%M:%S"))
games <- transform(games,
                   year = year(start_date),
                   month = month(start_date),
                   day = day(start_date))
games$start_date <- NULL
games$started_at <- NULL

## save the csvs for import -- put on google drive for public access
write.table(teams, file="teams.csv", sep=",", row.names=F)
write.table(games, file="games.csv", sep=",", row.names=F)