GraphGists

使用 Neo4j 分析足球转会

使用 Neo4j 分析足球转会

transfermarkt

在本指南中,我们将了解如何使用 Neo4j 导入和分析足球转会。

模型

这是我们将要构建的转会图模型

footballtransfer-model

查看数据

我们已将一些转会数据写入 CSV 文件,让我们快速查看前几行。运行以下查询

WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
RETURN row
LIMIT 5

此查询返回 CSV 文件的前五行。

导入数据:约束

在导入任何数据之前,我们将创建一些约束,以确保我们不会向数据库插入重复记录。

CREATE CONSTRAINT ON (player:Player)
ASSERT player.id IS UNIQUE;
CREATE CONSTRAINT ON (club:Club)
ASSERT club.id IS UNIQUE;
CREATE CONSTRAINT ON (transfer:Transfer)
ASSERT transfer.id IS UNIQUE;
CREATE CONSTRAINT ON (country:Country)
ASSERT country.name IS UNIQUE;

让我们检查它们是否都已创建

CALL db.constraints()

看起来不错。开始导入数据吧。

导入数据:球员

首先,我们将导入球员。运行以下查询

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
MERGE (player:Player {id: row.playerUri})
ON CREATE SET player.name =  row.playerName, player.position = row.playerPosition

让我们检查球员是否正确导入。运行以下查询

MATCH (player:Player)
RETURN player
LIMIT 25

看起来不错!

导入数据:球员国家

接下来,我们将导入这些球员代表的国家。运行以下查询

WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.playerNationality <> ''
WITH DISTINCT row.playerNationality AS nationality
MERGE (country:Country {name: nationality })
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
WITH row WHERE row.playerNationality <> ''
MATCH (player:Player {id: row.playerUri})
MATCH (country:Country {name: row.playerNationality })
MERGE (player)-[:PLAYS_FOR]->(country)

导入数据:俱乐部

接下来,我们将导入俱乐部。俱乐部的数据结构非常有趣。我们有买方和卖方俱乐部,因此最简单的方法是编写两个导入查询,每个一个

WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row

WITH row where row.sellerClubUri <> ''
MERGE (club:Club {id: row.sellerClubUri})
ON CREATE SET club.name = row.sellerClubName
MERGE (country:Country {name: row.sellerClubCountry})
MERGE (club)-[:PART_OF]->(country)
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row where row.buyerClubUri <> ''
MERGE (club:Club {id: row.buyerClubUri})
ON CREATE SET club.name = row.buyerClubName
MERGE (country:Country {name: row.buyerClubCountry})
MERGE (club)-[:PART_OF]->(country)

编写两个查询来完成这项工作有点麻烦。也许还有其他方法…​

导入数据:俱乐部 II

确实有!我们的好朋友 UNWIND 可以在这里帮助我们

WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
UNWIND [
  {uri: row.sellerClubUri, name: row.sellerClubName, country: row.sellerClubCountry},
  {uri: row.buyerClubUri,  name: row.buyerClubName,  country: row.buyerClubCountry}
] AS club
WITH club WHERE club.uri <> ''
WITH DISTINCT club
MERGE (c:Club {id: club.uri})
ON CREATE SET c.name = club.name
MERGE (country:Country {name: club.country })
MERGE (c)-[:PART_OF]->(country)

导入数据:转会

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
MATCH (player:Player {id: row.playerUri})
MATCH (source:Club {id: row.sellerClubUri})
MATCH (destination:Club {id: row.buyerClubUri})
MERGE (t:Transfer {id: row.transferUri})
ON CREATE SET t.season = row.season,
              t.fee = row.transferFee,
              t.timestamp = toInteger(row.timestamp)
MERGE (t)-[ofPlayer:OF_PLAYER]->(player) SET ofPlayer.age = row.playerAge
MERGE (t)-[:FROM_CLUB]->(source)
MERGE (t)-[:TO_CLUB]->(destination);

现在是时候清理转会费了,以便我们可以更轻松地比较不同的转会。

清理数据:转会费

让我们看看目前的转会费是什么样的。运行以下查询

MATCH (transfer:Transfer)
RETURN transfer.fee, COUNT(*) AS occurrences
ORDER BY occurrences DESC
LIMIT 100

这里有很多不同的值,但看起来如果值是"?"或"-",那么我们就不知道实际的转会费是多少。

让我们看看是否有任何有效转会使用了这些值。运行以下查询

MATCH (t:Transfer)
WHERE t.fee contains "?" or t.fee  contains "-"
RETURN t.fee, count(*)

看起来没有!让我们排除这些转会

MATCH (t:Transfer)
WHERE t.fee contains "?" or t.fee  contains "-"
REMOVE t:Transfer
SET t:TransferWithoutFee

标记租借转会

在我们的数据集中,有一些转会是球员暂时在球队之间转移。这是一个使用第二个标签的好时机。让我们为这些转会添加标签 Loan

MATCH (t:Transfer)
WHERE t.fee STARTS WITH 'Loan'
SET t:Loan

清理数据:转会费

现在剩下的都是我们可以转换为数字值的值了。

MATCH (transfer:Transfer)
RETURN transfer.fee, COUNT(*) AS occurrences
ORDER BY occurrences DESC
LIMIT 100

让我们添加一个新属性,为每个转会费设置一个数字值。运行以下查询

MATCH (t:Transfer)
WITH t, replace(replace(replace(replace(t.fee, "k", ""), "m", ""), "Loan fee:", ""), "£", "") AS rawNumeric
WITH t,
CASE
 WHEN t.fee ENDS WITH "k" THEN toFloat(apoc.number.exact.mul(trim(rawNumeric),"1000"))
 WHEN trim(t.fee) IN ["Free transfer", "ablösefrei ", "gratuito", "free", "free transfer", "Ablösefrei", "transfervrij", "ablöserei", "Free Transfer", "Libre", "gratutito", "ablsöefrei", "ablösefrei", "ablösefei", "abösefrei", "Loan", "draft", "Swap deal", "trade", "ablösefrei", "ablösefreei", "Free", "ablosefrei", "Draft", "Trade", "Libre para traspaso", "bez odstępnego", "ablossefrei", "Bez odstępnego", "Gratuito", "ablödefrei", "Bonservissiz", "ablösfrei", "ablõsefrei", "ablösefre", "custo zero", "ablösefrei!", "ablösefreo", "svincolato", "Ablösfrei", "livre", "libre", "Leihe", "abolsfrei", "ablösefrai", "ablösefreil", "abllösefrei", "abölsefrei", "ablöserfrei", "abklösefrei", "ablöaefrei", "Ablosefrei", "Nessuno", "ablösesfrei", "Free Tranfer", "abblösefrei", "Spielertausch", "ablösebrei", "abslösefrei", "spielertausch", "a", "ablöseferi", "ablöserfei", "Tausch"] THEN 0
 WHEN NOT(exists(t.fee)) THEN 0
 WHEN rawNumeric = '' THEN 0
 WHEN t.fee ENDS WITH "m" THEN toFloat(apoc.number.exact.mul(trim(rawNumeric),"1000000"))
 ELSE toFloat(trim(rawNumeric))
END AS numericFee
SET t.numericFee = numericFee

还有一些转会留有令人恼火的值,所以让我们把它们从数据集中排除。

MATCH (t:Transfer)
WHERE not exists(t.numericFee)
REMOVE t:Transfer
SET t:TransferWithoutFee

清理数据:浮点数

您可能已经注意到,我们使用 APOC 函数 apoc.number.exact.mul 来乘以转会费,但为什么我们不能直接在纯 Cypher 中进行计算呢?

浮点数的乐趣!

WITH "8.37" as rawNumeric
RETURN toFloat(rawNumeric) * 1000000 AS numeric

我们期望得到 8370000,但却没有!让我们尝试使用 APOC 函数进行相同的计算

WITH "8.37" as rawNumeric
RETURN apoc.number.exact.mul(rawNumeric,"1000000") AS apocConversion

这可以,但它仍然是一个字符串,所以我们需要将其转换回数字值

WITH "8.37" as rawNumeric
RETURN toFloat(apoc.number.exact.mul(rawNumeric,"1000000")) AS apocConversion

现在我们准备查询图了。

顶级转会

我们将首先找出最昂贵的转会。

MATCH (transfer:Transfer)-[:OF_PLAYER]->(player),
      (from)<-[:FROM_CLUB]-(transfer)-[:TO_CLUB]->(to)
RETURN player.name, from.name, to.name, transfer.numericFee
ORDER BY transfer.numericFee DESC
LIMIT 10

球队转会

现在让我们缩小范围,找到涉及特定球队的转会

MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
      (transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "FC Barcelona"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC

脱欧意味着脱欧

在脱欧的这些日子里,我们可以编写一个查询来查找英格兰球员在英格兰球队之间的转会

MATCH (to:Club)<-[:TO_CLUB]-(t:Transfer)-[:FROM_CLUB]-(from:Club),
      (t)-[:OF_PLAYER]->(player:Player)-[:PLAYS_FOR]->(country),
      (to)-[:PART_OF]->(country:Country)<-[:PART_OF]-(from)
WHERE country.name = "England"
RETURN player.name, from.name, to.name, t.numericFee, t.season
ORDER BY t.numericFee DESC
LIMIT 10

我们也可以轻松更改国家并查看其他国家的转会。

转会足迹最大的球员

我们还可以编写一个聚合查询来查找在他们身上花费最多钱的球员

MATCH (t:Transfer)-[:OF_PLAYER]->(p:Player)
WITH p, sum(t.numericFee) as moneyTrace, COUNT(*) AS numberOfTransfers
RETURN p.name, apoc.number.format(moneyTrace), numberOfTransfers
ORDER BY moneyTrace desc
LIMIT 10

相邻转会

到目前为止,我们的查询还没有特别图化。我们执行了一些包含多次连接的查询,但尚未真正利用图的力量。

一种方法是在涉及单个球员的相邻转会之间引入 NEXT 关系。运行以下查询

MATCH (p:Player)<-[:OF_PLAYER]-(transfer)
WHERE transfer.numericFee > 0

WITH p, transfer
ORDER BY p.name, transfer.timestamp

WITH p, collect(transfer) AS transfers
WHERE size(transfers) > 1
UNWIND range(0, size(transfers)-2) AS idx

WITH transfers[idx] AS t1, transfers[idx+1] AS t2
MERGE (t1)-[:NEXT]->(t2)

最大利润

现在我们可以编写一个查询来查看哪个俱乐部在一名球员身上赚取了最多利润

MATCH (p:Player)<-[:OF_PLAYER]-(t1)-[:NEXT]->(t2),
      (initial)<-[:FROM_CLUB]-(t1)-[:TO_CLUB]->(club1)<-[:FROM_CLUB]-(t2)-[:TO_CLUB]->(club2)
WHERE none(t in [t1, t2] where t:Loan)
RETURN p.name as player, club1.name AS profitMaker , initial.name as buysFrom, club2.name AS sellsTo, t2.numericFee - t1.numericFee as profit, (t2.timestamp - t1.timestamp) / 60 / 60 / 24 AS daysAtClub
ORDER BY profit DESC

每日利润

然后我们可以进一步计算出一名球员被俱乐部拥有的每一天赚取了多少利润

MATCH (p:Player)<-[:OF_PLAYER]-(t1)-[:NEXT]->(t2),
      (club0)<-[:FROM_CLUB]-(t1)-[:TO_CLUB]->(club1)<-[:FROM_CLUB]-(t2)-[:TO_CLUB]->(club2)
WHERE none(t in [t1, t2] where t:Loan)
WITH p, club1.name AS profitMaker, club0.name AS buysFrom, club2.name AS sellsTo, t2.numericFee - t1.numericFee as profit, (t2.timestamp - t1.timestamp) / 60 / 60 / 24 AS daysAtClub
RETURN p.name AS player, profitMaker, buysFrom, sellsTo, profit, daysAtClub, profit / daysAtClub AS profitPerDay
ORDER BY profitPerDay DESC

这显示了一些看起来相当奇怪的转会!Andrea Bertolacci 似乎在连续两天内被转会了两次,利润巨大。大概这些转会都是事先安排好的吧?

循环转会

我们还可以找到回到最初卖出他们的俱乐部的球员。运行以下查询

MATCH (p:Player)<-[:OF_PLAYER]-(t:Transfer)
MATCH path = (t)-[:NEXT*]->(t2)
MATCH (t)-[:FROM_CLUB]->(club)<-[:TO_CLUB]-(t2)
WHERE none(t in [t, t2] where t:Loan)
WITH p, t.numericFee - t2.numericFee AS profit, [transfer in nodes(path) | [(from)<-[:FROM_CLUB]-(transfer)-[:TO_CLUB]->(to) | from.name + "->" + to.name][0]] AS transfers, path
RETURN p.name, apoc.number.format(profit), transfers, (nodes(path)[-1].timestamp - nodes(path)[0].timestamp) / 60 / 60 / 24 AS days
ORDER BY profit DESC

资金流向

特定赛季俱乐部之间的资金流向是怎样的?

MATCH (t:Transfer)
WITH DISTINCT t.season AS season
MATCH (seller)<-[:FROM_CLUB]-(t:Transfer)-[:TO_CLUB]->(buyer)
WHERE t.season = season AND t.numericFee > 0
WITH season, seller, buyer, sum(t.numericFee) AS cash_flow, count(t) AS player_count
RETURN buyer, cash_flow, player_count, season, seller
ORDER BY cash_flow DESC
LIMIT 10

让我们将其持久化到图中,这样我们就不需要每次都重新计算了

MATCH (t:Transfer)
WITH DISTINCT t.season AS season
MATCH (seller)<-[:FROM_CLUB]-(t:Transfer)-[:TO_CLUB]->(buyer)
WHERE t.season = season AND t.numericFee > 0
WITH season, seller, buyer, sum(t.numericFee) AS cash_flow, count(t) AS player_count
MERGE (buyer)-[:CASH_FLOW {total: cash_flow, playerCount: player_count, season: season}]->(seller)

在 2010/2011 赛季,哪些俱乐部从皇家马德里获得了资金?

这种新的关系类型使我们能够找出特定赛季资金的流向——例如,我们可以看到皇家马德里在 2010/2011 赛季从哪些俱乐部购买了球员。

MATCH path = (buyer:Club)-[:CASH_FLOW]->(seller:Club)
WHERE buyer.name = "Real Madrid" AND all(f in relationships(path) WHERE f.season="2010/2011")
RETURN *

转会金额最大的交易

我们可以编写一个查询来找出哪些俱乐部参与了最大的转会金额交易

MATCH (buyer:Club)-[f:CASH_FLOW]->(seller:Club)
RETURN f.season AS season, buyer.name AS buyer, seller.name AS seller, f.total AS amount, f.playerCount AS playersTransferred
ORDER BY amount DESC limit 10

看到曼城/摩纳哥出现在其中令人惊讶,这两支球队之间转会的是谁?

MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
      (transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "Monaco" AND to.name = "Man City"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC

特定赛季的最大金额交易

我们还可以深入查看特定赛季,只关注该赛季的资金流向

MATCH (buyer:Club)-[f:CASH_FLOW]->(seller:Club)
WHERE f.season = "2016/2017"
RETURN f.season AS season, buyer.name AS buyer, seller.name AS seller, f.total AS amount, f.playerCount AS playersTransferred
ORDER BY amount DESC limit 10
MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
      (transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "Valencia CF" AND to.name = "FC Barcelona"  AND transfer.season = "2016/2017"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC

结束

bugs-bunny-the-end