GraphGists

三月疯狂推荐引擎

5nlEMrk

(三月疯狂)←[:MADE_SANE_WITH]-(Neo4j {version:2.2})

三月疯狂是社交环境中数据极客能够发挥作用的少数时刻之一。它是一个罕见的、记录良好的数据和流行文化相结合的事件。 沃伦·巴菲特的十亿美元赌注吸引了从华尔街量化师到硅谷工程师以及所有热爱棒球的人的关注。

一切都是相对的

对我来说,篮球与关系息息相关——当然,有些球队明显比其他球队强。但是,几乎总是存在某种相对的性能偏差。其中,球队表现比他们平均表现的预期要好或要差,这是由于多种因素的共同影响造成的,无论是球迷狂热程度令人恐惧的球队,还是一个能够解开你联盟领先球队的区域防守的控球后卫,或者一个长达数十年的竞争对手,激发了你的球员更加努力地拼搏。表现是相对的。 这些数据很难在一个赛季内追踪,而且经常难以跨越时间追踪。

其次,能够迭代该模型在编写查询和保持合理性能方面都很费力。我积累了前四个赛季(大约 50,000 场比赛)的数据。包括得分、地点、日期等等。我们可以轻松添加更细致的信息或更多历史数据——尽管出于统计原因,我们并没有这么做,只是因为这让我更轻松,所以我决定在我的模型中,这些“获胜能力关系”应该几乎每四年更新一次(因为现役球员毕业并离开)。我们稍后将详细介绍这些球队之间的“获胜能力”关系,这是根据毕达哥拉斯期望模型得出的。

步骤

步骤 1:从想法到图模型

我并不聪明。

但是,我有一些聪明的工具可以使用。其中最主要的是 Neo4j(Neo Technology 的图数据库)。因此,我开始我的所有图项目的方式——从我最常问的问题和一块白板(或者在这种情况下是纸)开始。

表 1. 想法
eT6VsHM

经过一些努力,它变成了

表 2. 想法 v2
zvTPnin

步骤 2:及时出现!

在将任何数据加载到 Neo4j 之前,我首先需要创建上面模型中显示的时间树。Neo4j 的一位天才工程师为我完成了大部分工作,并编写了一个简短的 Cypher 代码片段来生成我需要的时间模型。

我从他的博客文章中获取了这个代码片段,该文章介绍了如何创建时间树。

WITH range(2013, 2015) AS years, range(1,12) as months
FOREACH(year IN years |
  MERGE (y:Year {year: year})
  FOREACH(month IN months |
    CREATE (m:Month {month: month})
    MERGE (y)-[:HAS_MONTH]->(m)
    FOREACH(day IN (CASE
                      WHEN month IN [1,3,5,7,8,10,12] THEN range(1,31)
                      WHEN month = 2 THEN
                        CASE
                          WHEN year % 4 <> 0 THEN range(1,28)
                          WHEN year % 100 <> 0 THEN range(1,29)
                          WHEN year % 400 <> 0 THEN range(1,29)
                          ELSE range(1,28)
                        END
                      ELSE range(1,30)
                    END) |
      CREATE (d:Day {day: day})
      MERGE (m)-[:HAS_DAY]->(d))))
WITH *
MATCH (year:Year)-[:HAS_MONTH]->(month)-[:HAS_DAY]->(day)
WITH year,month,day
ORDER BY year.year, month.month, day.day
WITH collect(day) as days
FOREACH(i in RANGE(0, length(days)-2) |
    FOREACH(day1 in [days[i]] |
        FOREACH(day2 in [days[i+1]] |
            CREATE UNIQUE (day1)-[:NEXT]->(day2))));

结果如下

表 3. 我的时间模型
lTG2TKf

步骤 3:my.csv 到 Graph.db

Neo4j 附带一个非常强大的 ETL 工具,称为LOAD CSV。我们将使用它。我下载了一堆 NCAA 比分,然后偷偷地将从 Excel 电子表格中下载的数据转换为 CSV 格式。我已经将它们托管在一个公共 Dropbox 中,链接可以在上面链接的 GitHub 存储库中找到。我们将导入多个 CSV 文件,每个文件代表一个给定的赛季,然后根据球队名称将它们缝合在一起。因为这只是一个 Gist,所以我们只加载大约一千场比赛,而不是我们过去四个赛季的 50,000 场比赛。

//Loading 750 games from the 2015 Regular Season
LOAD CSV WITH HEADERS from 'https://dl.dropboxusercontent.com/u/313565755/ncaa2015.csv' AS line
WITH line, toINT(line.Year) as Year, toINT(line.Month) as Month, toINT(line.Day) as Day
LIMIT 750
WHERE line.ignore IS NOT NULL
MATCH (:Year {year:Year})-[:HAS_MONTH]->(:Month {month:Month})-[:HAS_DAY]->(t:Day {day:Day})
CREATE (game:Game {winner:line.winnerName})-[:OCCURED_ON]->(t)
WITH line, game
MERGE (team:Team {name:line.Team})
WITH line, game, team
MERGE(opp:Team {name:line.Opponent})
WITH line, game, team, opp
WITH line, toINT(line.opponentScore) as oppScore, toINT(line.teamScore) as teamScore, team, opp, game, toFLOAT(line.teamDiff) as teamDiff, toFLOAT(line.oppDiff) as oppDiff
CREATE (team)-[:PLAYED_IN {scored:teamScore, differential:teamDiff, location:line.teamLocation}]->(game)<-[:PLAYED_IN {scored:oppScore, differential:oppDiff, location:line.oppLocation}]-(opp)

步骤 4:历史、胜利和一点数学

我决定在每支球队之间创建一种名为:WINPOWER的关系,它基于棒球中的一个概念,称为毕达哥拉斯期望。它本质上是根据得分与失分来分配获胜概率。我在其中添加了一个衰减因子,使最近的比赛的权重高于很久以前的比赛。

表 4. 毕达哥拉斯期望
b0da4b28fff6eb1edcdc59da5cff7934
//Assigning Pythagorean Expectation
MATCH (a:Team)-[aa:PLAYED_IN]->(game)<-[bb:PLAYED_IN]-(b:Team)
WITH toFloat(aa.scored*aa.scored) as team2, toFloat(bb.scored*bb.scored) as opp2, game, a,b
//
//Remember that Pythagorean Expectation is (points_scored^2 / (points scored^2 + points allowed^2))
//
WITH ((team2)/(team2+opp2)) as PyEx, game,a,b
//
// tying the game to the correct day in our time tree
MATCH (game)-[:OCCURED_ON]->(day)<-[:HAS_DAY]-(month)<-[:HAS_MONTH]-(year)
//
//setting March 15th to the day before the tournament and calculating how many days ago the last game was played
WITH (365*2015 + 2015 /4 - 2015 /100 + 2015 /400 + 15 + (153*3+8)/5) as dayBeforeTournament,
(365*(year.year) + (year.year)/4 - (year.year)/100 + (year.year)/400 + (day.day) + (153*(month.month)+8)/5) as oldYear, PyEx,a,b
//
//assuming that "win relevance" decays linearly over 4 years or how long any part of those same teams are playing against one another represented by "weight"
//
WITH ((4*365.25)-(dayBeforeTournament-oldYear))/(4*365.25) as weight, PyEx, a, b
//
//adding up all of the weights*pythagorean expectation for a current win probability
//
WITH SUM(weight*PyEx) as winPower, a, b
//creating a new relationship that stores a team's given probability of defeating another team as of March 15th
//
MERGE (a)-[w:WINPOWER]->(b)
SET w.winPower = winPower;
UBcDDF6
MATCH (a)-[w:WINPOWER]->(b)
RETURN a.name as `Team 1`, w.winPower as `Winpower Against`, b.name as `Team 2`
ORDER BY w.winPower DESC
LIMIT 10

步骤 5:大奖

现在,这才是有趣的部分,让我们看看几场比赛,看看我们能否准确地预测谁会赢。

比赛 1:海军队对密歇根州立大学

//bracketmaker, bracketmaker, make me a billion
//
CREATE (g:SimulatedGame)
WITH g
MATCH (a:Team)-[x:WINPOWER]->(b:Team), (a)<-[y:WINPOWER]-(b)
WHERE a.name = 'Navy' AND b.name = 'Michigan St'
MERGE (a)-[:SIMULATED]->(g)<-[:SIMULATED]-(b)
WITH b,a,g,x,y,
CASE
WHEN x.winPower > y.winPower
THEN a.name
ELSE b.name
END AS winName
SET g.winner=winName
RETURN a.name AS `Team 1`, x.winPower AS `Team 1 WP`, b.name AS `Team 2`, y.winPower AS `Team 2 WP`,  g.winner AS `Winner of Matchup`
MATCH (a)-[:SIMULATED]->(g)<-[:SIMULATED]-(b)
WHERE a.name = 'Navy' AND b.name = 'Michigan St'
RETURN a, b, g;

很明显,我们看到密歇根州立大学是赢家。

比赛 2:肯塔基大学对 UT 阿灵顿大学

//bracketmaker, bracketmaker, make me a billion
//
MATCH (a)-[:PLAYED_IN]->(g:Game)<-[:PLAYED_IN]-(b)
WHERE a.name = 'Kentucky' AND b.name = 'UT Arlington'
RETURN a, g, b

糟了……没有结果。这意味着在我们的常规赛中,肯塔基大学和 UT 阿灵顿大学没有交手过(也没有在过去四年的 NCAA 锦标赛中交手过)。

那么,为什么不比较一下他们都打过的球队,然后取他们“获胜能力分数”的平均值或总和呢?例如,我们可以推断,如果肯塔基大学总是击败密歇根州立大学,而密歇根州立大学总是击败汉普顿大学,那么肯塔基大学很有可能击败汉普顿大学。我们可以轻松地比较所有共同的球队以及他们与这些球队的战绩,并得出谁应该获胜的良好猜测。在完整的数据库中,我们可以对 64 强所有比赛进行此操作,但在截断的“Graph Gist”数据库中,我精心挑选了一个示例来说明。

//What if they've never played each other?!
CREATE (g:SimulatedGame)
WITH g
MATCH (a:Team)-[aa:WINPOWER]->(intermediate:Team)<-[bb:WINPOWER]-(b:Team)
WHERE a.name = 'Kentucky' AND b.name = 'UT Arlington'
WITH g, a, aa, b, bb,
CASE
WHEN SUM(aa.winPower) > SUM(bb.winPower)
THEN a.name
ELSE b.name
END AS winName
SET g.winner=winName
WITH*
MERGE (a)-[:SIMULATED]->(g)<-[:SIMULATED]-(b)
WITH*
RETURN a.name AS `Team 1`, sum(aa.winPower) AS `Team 1 WP`, b.name AS `Team 2`, sum(bb.winPower) AS `Team 2 WP`, g.winner AS `Winner of Matchup`

结束

Z2t3Hee medium

参考

从此博客文章中下载原始数据集和所有代码片段:https://github.com/kvangundy/graph-graph-baby/

三月疯狂推荐引擎

Kevin Van Gundy <kevin@neotechnology.com> :neo4j-version: 2.3.0 :author: Kevin Van Gundy :twitter: @kevinvangundy

5nlEMrk

(三月疯狂)←[:MADE_SANE_WITH]-(Neo4j {version:2.2})

三月疯狂是社交环境中数据极客能够发挥作用的少数时刻之一。它是一个罕见的、记录良好的数据和流行文化相结合的事件。 沃伦·巴菲特的十亿美元赌注吸引了从华尔街量化师到硅谷工程师以及所有热爱棒球的人的关注。

一切都是相对的

对我来说,篮球与关系息息相关——当然,有些球队明显比其他球队强。但是,几乎总是存在某种相对的性能偏差。其中,球队表现比他们平均表现的预期要好或要差,这是由于多种因素的共同影响造成的,无论是球迷狂热程度令人恐惧的球队,还是一个能够解开你联盟领先球队的区域防守的控球后卫,或者一个长达数十年的竞争对手,激发了你的球员更加努力地拼搏。表现是相对的。 这些数据很难在一个赛季内追踪,而且经常难以跨越时间追踪。

其次,能够迭代该模型在编写查询和保持合理性能方面都很费力。我积累了前四个赛季(大约 50,000 场比赛)的数据。包括得分、地点、日期等等。我们可以轻松添加更细致的信息或更多历史数据——尽管出于统计原因,我们并没有这么做,只是因为这让我更轻松,所以我决定在我的模型中,这些“获胜能力关系”应该几乎每四年更新一次(因为现役球员毕业并离开)。我们稍后将详细介绍这些球队之间的“获胜能力”关系,这是根据毕达哥拉斯期望模型得出的。

步骤

步骤 1:从想法到图模型

我并不聪明。

但是,我有一些聪明的工具可以使用。其中最主要的是 Neo4j(Neo Technology 的图数据库)。因此,我开始我的所有图项目的方式——从我最常问的问题和一块白板(或者在这种情况下是纸)开始。

表 5. 想法
eT6VsHM

经过一些努力,它变成了

表 6. 想法 v2
zvTPnin

步骤 2:及时出现!

在将任何数据加载到 Neo4j 之前,我首先需要创建上面模型中显示的时间树。Neo4j 的一位天才工程师为我完成了大部分工作,并编写了一个简短的 Cypher 代码片段来生成我需要的时间模型。

我从他的博客文章中获取了这个代码片段,该文章介绍了如何创建时间树。

WITH range(2013, 2015) AS years, range(1,12) as months
FOREACH(year IN years |
  MERGE (y:Year {year: year})
  FOREACH(month IN months |
    CREATE (m:Month {month: month})
    MERGE (y)-[:HAS_MONTH]->(m)
    FOREACH(day IN (CASE
                      WHEN month IN [1,3,5,7,8,10,12] THEN range(1,31)
                      WHEN month = 2 THEN
                        CASE
                          WHEN year % 4 <> 0 THEN range(1,28)
                          WHEN year % 100 <> 0 THEN range(1,29)
                          WHEN year % 400 <> 0 THEN range(1,29)
                          ELSE range(1,28)
                        END
                      ELSE range(1,30)
                    END) |
      CREATE (d:Day {day: day})
      MERGE (m)-[:HAS_DAY]->(d))))
WITH *
MATCH (year:Year)-[:HAS_MONTH]->(month)-[:HAS_DAY]->(day)
WITH year,month,day
ORDER BY year.year, month.month, day.day
WITH collect(day) as days
FOREACH(i in RANGE(0, length(days)-2) |
    FOREACH(day1 in [days[i]] |
        FOREACH(day2 in [days[i+1]] |
            CREATE UNIQUE (day1)-[:NEXT]->(day2))));

结果如下

表 7. 我的时间模型
lTG2TKf

步骤 3:my.csv 到 Graph.db

Neo4j 附带一个非常强大的 ETL 工具,称为LOAD CSV。我们将使用它。我下载了一堆 NCAA 比分,然后偷偷地将从 Excel 电子表格中下载的数据转换为 CSV 格式。我已经将它们托管在一个公共 Dropbox 中,链接可以在上面链接的 GitHub 存储库中找到。我们将导入多个 CSV 文件,每个文件代表一个给定的赛季,然后根据球队名称将它们缝合在一起。因为这只是一个 Gist,所以我们只加载大约一千场比赛,而不是我们过去四个赛季的 50,000 场比赛。

//Loading 750 games from the 2015 Regular Season
LOAD CSV WITH HEADERS from 'https://dl.dropboxusercontent.com/u/313565755/ncaa2015.csv' AS line
WITH line, toINT(line.Year) as Year, toINT(line.Month) as Month, toINT(line.Day) as Day
LIMIT 750
WHERE line.ignore IS NOT NULL
MATCH (:Year {year:Year})-[:HAS_MONTH]->(:Month {month:Month})-[:HAS_DAY]->(t:Day {day:Day})
CREATE (game:Game {winner:line.winnerName})-[:OCCURED_ON]->(t)
WITH line, game
MERGE (team:Team {name:line.Team})
WITH line, game, team
MERGE(opp:Team {name:line.Opponent})
WITH line, game, team, opp
WITH line, toINT(line.opponentScore) as oppScore, toINT(line.teamScore) as teamScore, team, opp, game, toFLOAT(line.teamDiff) as teamDiff, toFLOAT(line.oppDiff) as oppDiff
CREATE (team)-[:PLAYED_IN {scored:teamScore, differential:teamDiff, location:line.teamLocation}]->(game)<-[:PLAYED_IN {scored:oppScore, differential:oppDiff, location:line.oppLocation}]-(opp)

步骤 4:历史、胜利和一点数学

我决定在每支球队之间创建一种名为:WINPOWER的关系,它基于棒球中的一个概念,称为毕达哥拉斯期望。它本质上是根据得分与失分来分配获胜概率。我在其中添加了一个衰减因子,使最近的比赛的权重高于很久以前的比赛。

表 8. 毕达哥拉斯期望
b0da4b28fff6eb1edcdc59da5cff7934
//Assigning Pythagorean Expectation
MATCH (a:Team)-[aa:PLAYED_IN]->(game)<-[bb:PLAYED_IN]-(b:Team)
WITH toFloat(aa.scored*aa.scored) as team2, toFloat(bb.scored*bb.scored) as opp2, game, a,b
//
//Remember that Pythagorean Expectation is (points_scored^2 / (points scored^2 + points allowed^2))
//
WITH ((team2)/(team2+opp2)) as PyEx, game,a,b
//
// tying the game to the correct day in our time tree
MATCH (game)-[:OCCURED_ON]->(day)<-[:HAS_DAY]-(month)<-[:HAS_MONTH]-(year)
//
//setting March 15th to the day before the tournament and calculating how many days ago the last game was played
WITH (365*2015 + 2015 /4 - 2015 /100 + 2015 /400 + 15 + (153*3+8)/5) as dayBeforeTournament,
(365*(year.year) + (year.year)/4 - (year.year)/100 + (year.year)/400 + (day.day) + (153*(month.month)+8)/5) as oldYear, PyEx,a,b
//
//assuming that "win relevance" decays linearly over 4 years or how long any part of those same teams are playing against one another represented by "weight"
//
WITH ((4*365.25)-(dayBeforeTournament-oldYear))/(4*365.25) as weight, PyEx, a, b
//
//adding up all of the weights*pythagorean expectation for a current win probability
//
WITH SUM(weight*PyEx) as winPower, a, b
//creating a new relationship that stores a team's given probability of defeating another team as of March 15th
//
MERGE (a)-[w:WINPOWER]->(b)
SET w.winPower = winPower;
UBcDDF6
MATCH (a)-[w:WINPOWER]->(b)
RETURN a.name as `Team 1`, w.winPower as `Winpower Against`, b.name as `Team 2`
ORDER BY w.winPower DESC
LIMIT 10

步骤 5:大奖

现在,这才是有趣的部分,让我们看看几场比赛,看看我们能否准确地预测谁会赢。

比赛 1:海军队对密歇根州立大学

//bracketmaker, bracketmaker, make me a billion
//
CREATE (g:SimulatedGame)
WITH g
MATCH (a:Team)-[x:WINPOWER]->(b:Team), (a)<-[y:WINPOWER]-(b)
WHERE a.name = 'Navy' AND b.name = 'Michigan St'
MERGE (a)-[:SIMULATED]->(g)<-[:SIMULATED]-(b)
WITH b,a,g,x,y,
CASE
WHEN x.winPower > y.winPower
THEN a.name
ELSE b.name
END AS winName
SET g.winner=winName
RETURN a.name AS `Team 1`, x.winPower AS `Team 1 WP`, b.name AS `Team 2`, y.winPower AS `Team 2 WP`,  g.winner AS `Winner of Matchup`
MATCH (a)-[:SIMULATED]->(g)<-[:SIMULATED]-(b)
WHERE a.name = 'Navy' AND b.name = 'Michigan St'
RETURN a, b, g;

很明显,我们看到密歇根州立大学是赢家。

比赛 2:肯塔基大学对 UT 阿灵顿大学

//bracketmaker, bracketmaker, make me a billion
//
MATCH (a)-[:PLAYED_IN]->(g:Game)<-[:PLAYED_IN]-(b)
WHERE a.name = 'Kentucky' AND b.name = 'UT Arlington'
RETURN a, g, b

糟了……没有结果。这意味着在我们的常规赛中,肯塔基大学和 UT 阿灵顿大学没有交手过(也没有在过去四年的 NCAA 锦标赛中交手过)。

那么,为什么不比较一下他们都打过的球队,然后取他们“获胜能力分数”的平均值或总和呢?例如,我们可以推断,如果肯塔基大学总是击败密歇根州立大学,而密歇根州立大学总是击败汉普顿大学,那么肯塔基大学很有可能击败汉普顿大学。我们可以轻松地比较所有共同的球队以及他们与这些球队的战绩,并得出谁应该获胜的良好猜测。在完整的数据库中,我们可以对 64 强所有比赛进行此操作,但在截断的“Graph Gist”数据库中,我精心挑选了一个示例来说明。

//What if they've never played each other?!
CREATE (g:SimulatedGame)
WITH g
MATCH (a:Team)-[aa:WINPOWER]->(intermediate:Team)<-[bb:WINPOWER]-(b:Team)
WHERE a.name = 'Kentucky' AND b.name = 'UT Arlington'
WITH g, a, aa, b, bb,
CASE
WHEN SUM(aa.winPower) > SUM(bb.winPower)
THEN a.name
ELSE b.name
END AS winName
SET g.winner=winName
WITH*
MERGE (a)-[:SIMULATED]->(g)<-[:SIMULATED]-(b)
WITH*
RETURN a.name AS `Team 1`, sum(aa.winPower) AS `Team 1 WP`, b.name AS `Team 2`, sum(bb.winPower) AS `Team 2 WP`, g.winner AS `Winner of Matchup`

结束

Z2t3Hee medium

参考

从此博客文章中下载原始数据集和所有代码片段:https://github.com/kvangundy/graph-graph-baby/