使用 Neo4j 分析足球转会
查看数据
我们已经将一些转会信息写入了一个 CSV 文件,让我们快速浏览一下前几行。运行以下查询。
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
RETURN row
LIMIT 5
此查询返回 CSV 文件的前五行。
导入数据:约束
在导入任何内容之前,我们将创建一些约束,以确保我们不会将重复记录插入数据库。
CREATE CONSTRAINT ON (player:Player)
ASSERT player.id IS UNIQUE;
CREATE CONSTRAINT ON (club:Club)
ASSERT club.id IS UNIQUE;
CREATE CONSTRAINT ON (transfer:Transfer)
ASSERT transfer.id IS UNIQUE;
CREATE CONSTRAINT ON (country:Country)
ASSERT country.name IS UNIQUE;
让我们检查一下它们是否都已创建。
CALL db.constraints()
看起来不错。让我们导入数据。
导入数据:球员
首先,我们将导入球员。运行以下查询。
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
MERGE (player:Player {id: row.playerUri})
ON CREATE SET player.name = row.playerName, player.position = row.playerPosition
让我们检查一下球员是否正确导入。运行以下查询。
MATCH (player:Player)
RETURN player
LIMIT 25
看起来不错!
导入数据:球员国籍
接下来,我们将导入这些球员代表的国家。运行以下查询。
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.playerNationality <> ''
WITH DISTINCT row.playerNationality AS nationality
MERGE (country:Country {name: nationality })
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
WITH row WHERE row.playerNationality <> ''
MATCH (player:Player {id: row.playerUri})
MATCH (country:Country {name: row.playerNationality })
MERGE (player)-[:PLAYS_FOR]->(country)
导入数据:俱乐部
接下来,我们将导入俱乐部。俱乐部的结构方式很有趣。我们有买入和卖出俱乐部,因此最简单的方法是编写两个导入查询,分别针对每个。
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row where row.sellerClubUri <> ''
MERGE (club:Club {id: row.sellerClubUri})
ON CREATE SET club.name = row.sellerClubName
MERGE (country:Country {name: row.sellerClubCountry})
MERGE (club)-[:PART_OF]->(country)
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row where row.buyerClubUri <> ''
MERGE (club:Club {id: row.buyerClubUri})
ON CREATE SET club.name = row.buyerClubName
MERGE (country:Country {name: row.buyerClubCountry})
MERGE (club)-[:PART_OF]->(country)
必须编写两个查询来完成这项工作有点麻烦。也许还有其他方法…
导入数据:俱乐部 II
确实有!我们的好朋友 UNWIND
可以帮我们解决这个问题。
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
UNWIND [
{uri: row.sellerClubUri, name: row.sellerClubName, country: row.sellerClubCountry},
{uri: row.buyerClubUri, name: row.buyerClubName, country: row.buyerClubCountry}
] AS club
WITH club WHERE club.uri <> ''
WITH DISTINCT club
MERGE (c:Club {id: club.uri})
ON CREATE SET c.name = club.name
MERGE (country:Country {name: club.country })
MERGE (c)-[:PART_OF]->(country)
导入数据:转会
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
MATCH (player:Player {id: row.playerUri})
MATCH (source:Club {id: row.sellerClubUri})
MATCH (destination:Club {id: row.buyerClubUri})
MERGE (t:Transfer {id: row.transferUri})
ON CREATE SET t.season = row.season,
t.fee = row.transferFee,
t.timestamp = toInteger(row.timestamp)
MERGE (t)-[ofPlayer:OF_PLAYER]->(player) SET ofPlayer.age = row.playerAge
MERGE (t)-[:FROM_CLUB]->(source)
MERGE (t)-[:TO_CLUB]->(destination);
现在是时候清理转会费,以便我们能够更轻松地比较不同的转会。
清理数据:转会费
让我们看看目前转会费是什么样的。运行以下查询。
MATCH (transfer:Transfer)
RETURN transfer.fee, COUNT(*) AS occurrences
ORDER BY occurrences DESC
LIMIT 100
这里有许多不同的值,但看起来如果值为 "?" 或 "-",那么我们不知道实际的转会费是多少。
让我们看看是否有任何有效的转会包含这些值。运行以下查询。
MATCH (t:Transfer)
WHERE t.fee contains "?" or t.fee contains "-"
RETURN t.fee, count(*)
看起来没有!让我们排除这些转会。
MATCH (t:Transfer)
WHERE t.fee contains "?" or t.fee contains "-"
REMOVE t:Transfer
SET t:TransferWithoutFee
标记租借转会
在我们的数据集中,有些转会是球员在球队之间临时转会。这是一个使用第二个标签的好时机。让我们为这些转会添加标签 Loan
。
MATCH (t:Transfer)
WHERE t.fee STARTS WITH 'Loan'
SET t:Loan
清理数据:转会费
现在,我们剩下的都是可以转换为数值的所有值。
MATCH (transfer:Transfer)
RETURN transfer.fee, COUNT(*) AS occurrences
ORDER BY occurrences DESC
LIMIT 100
让我们添加一个新的属性,该属性为每个转会费提供一个数值。运行以下查询。
MATCH (t:Transfer)
WITH t, replace(replace(replace(replace(t.fee, "k", ""), "m", ""), "Loan fee:", ""), "£", "") AS rawNumeric
WITH t,
CASE
WHEN t.fee ENDS WITH "k" THEN toFloat(apoc.number.exact.mul(trim(rawNumeric),"1000"))
WHEN trim(t.fee) IN ["Free transfer", "ablösefrei ", "gratuito", "free", "free transfer", "Ablösefrei", "transfervrij", "ablöserei", "Free Transfer", "Libre", "gratutito", "ablsöefrei", "ablösefrei", "ablösefei", "abösefrei", "Loan", "draft", "Swap deal", "trade", "ablösefrei", "ablösefreei", "Free", "ablosefrei", "Draft", "Trade", "Libre para traspaso", "bez odstępnego", "ablossefrei", "Bez odstępnego", "Gratuito", "ablödefrei", "Bonservissiz", "ablösfrei", "ablõsefrei", "ablösefre", "custo zero", "ablösefrei!", "ablösefreo", "svincolato", "Ablösfrei", "livre", "libre", "Leihe", "abolsfrei", "ablösefrai", "ablösefreil", "abllösefrei", "abölsefrei", "ablöserfrei", "abklösefrei", "ablöaefrei", "Ablosefrei", "Nessuno", "ablösesfrei", "Free Tranfer", "abblösefrei", "Spielertausch", "ablösebrei", "abslösefrei", "spielertausch", "a", "ablöseferi", "ablöserfei", "Tausch"] THEN 0
WHEN NOT(exists(t.fee)) THEN 0
WHEN rawNumeric = '' THEN 0
WHEN t.fee ENDS WITH "m" THEN toFloat(apoc.number.exact.mul(trim(rawNumeric),"1000000"))
ELSE toFloat(trim(rawNumeric))
END AS numericFee
SET t.numericFee = numericFee
还有一些转会留下了令人讨厌的值,让我们将它们从数据集中排除。
MATCH (t:Transfer)
WHERE not exists(t.numericFee)
REMOVE t:Transfer
SET t:TransferWithoutFee
清理数据:浮点数
您可能已经注意到我们使用 APOC 函数 apoc.number.exact.mul
来乘以转会费,但为什么我们不能在纯 Cypher 中进行计算呢?
浮点运算有趣!
WITH "8.37" as rawNumeric
RETURN toFloat(rawNumeric) * 1000000 AS numeric
我们期望返回 8370000
,但没有!让我们尝试使用 APOC 函数进行相同的计算。
WITH "8.37" as rawNumeric
RETURN apoc.number.exact.mul(rawNumeric,"1000000") AS apocConversion
这有效,但它仍然是一个字符串,因此我们需要将其转换回数值。
WITH "8.37" as rawNumeric
RETURN toFloat(apoc.number.exact.mul(rawNumeric,"1000000")) AS apocConversion
现在,我们准备查询图。
最昂贵的转会
我们将从查找最昂贵的转会开始。
MATCH (transfer:Transfer)-[:OF_PLAYER]->(player),
(from)<-[:FROM_CLUB]-(transfer)-[:TO_CLUB]->(to)
RETURN player.name, from.name, to.name, transfer.numericFee
ORDER BY transfer.numericFee DESC
LIMIT 10
来自球队的转会
现在,让我们缩小范围,查找涉及特定球队的转会。
MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
(transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "FC Barcelona"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC
脱欧意味着脱欧
在当今的脱欧时代,我们可以编写一个查询来查找英格兰球员在英格兰球队之间的转会。
MATCH (to:Club)<-[:TO_CLUB]-(t:Transfer)-[:FROM_CLUB]-(from:Club),
(t)-[:OF_PLAYER]->(player:Player)-[:PLAYS_FOR]->(country),
(to)-[:PART_OF]->(country:Country)<-[:PART_OF]-(from)
WHERE country.name = "England"
RETURN player.name, from.name, to.name, t.numericFee, t.season
ORDER BY t.numericFee DESC
LIMIT 10
我们也可以轻松地更改国家,查看其他国家/地区的转会。
转会影响最大的球员
我们还可以编写一个聚合查询,以查找花费最多转会费的球员。
MATCH (t:Transfer)-[:OF_PLAYER]->(p:Player)
WITH p, sum(t.numericFee) as moneyTrace, COUNT(*) AS numberOfTransfers
RETURN p.name, apoc.number.format(moneyTrace), numberOfTransfers
ORDER BY moneyTrace desc
LIMIT 10
相邻转会
到目前为止,我们的查询并不是特别图形化。我们执行了一些包含多个连接的查询,但我们还没有真正利用图的强大功能。
我们可以通过在涉及单个球员的相邻转会之间引入 NEXT
关系来实现这一点。运行以下查询。
MATCH (p:Player)<-[:OF_PLAYER]-(transfer)
WHERE transfer.numericFee > 0
WITH p, transfer
ORDER BY p.name, transfer.timestamp
WITH p, collect(transfer) AS transfers
WHERE size(transfers) > 1
UNWIND range(0, size(transfers)-2) AS idx
WITH transfers[idx] AS t1, transfers[idx+1] AS t2
MERGE (t1)-[:NEXT]->(t2)
赚取利润最多
现在,我们可以编写一个查询来查看哪个俱乐部在球员身上赚取了最多的利润。
MATCH (p:Player)<-[:OF_PLAYER]-(t1)-[:NEXT]->(t2),
(initial)<-[:FROM_CLUB]-(t1)-[:TO_CLUB]->(club1)<-[:FROM_CLUB]-(t2)-[:TO_CLUB]->(club2)
WHERE none(t in [t1, t2] where t:Loan)
RETURN p.name as player, club1.name AS profitMaker , initial.name as buysFrom, club2.name AS sellsTo, t2.numericFee - t1.numericFee as profit, (t2.timestamp - t1.timestamp) / 60 / 60 / 24 AS daysAtClub
ORDER BY profit DESC
每天的利润
然后,我们可以更进一步,计算出俱乐部拥有球员的每一天赚取了多少利润。
MATCH (p:Player)<-[:OF_PLAYER]-(t1)-[:NEXT]->(t2),
(club0)<-[:FROM_CLUB]-(t1)-[:TO_CLUB]->(club1)<-[:FROM_CLUB]-(t2)-[:TO_CLUB]->(club2)
WHERE none(t in [t1, t2] where t:Loan)
WITH p, club1.name AS profitMaker, club0.name AS buysFrom, club2.name AS sellsTo, t2.numericFee - t1.numericFee as profit, (t2.timestamp - t1.timestamp) / 60 / 60 / 24 AS daysAtClub
RETURN p.name AS player, profitMaker, buysFrom, sellsTo, profit, daysAtClub, profit / daysAtClub AS profitPerDay
ORDER BY profitPerDay DESC
这向我们展示了一些看起来很奇怪的转会!安德烈亚·贝托拉奇 似乎在连续两天内以巨额利润转会了两次。大概这些转会都是事先安排好的?
循环转会
我们还可以找到那些回到最初出售他们的俱乐部的球员。运行以下查询。
MATCH (p:Player)<-[:OF_PLAYER]-(t:Transfer)
MATCH path = (t)-[:NEXT*]->(t2)
MATCH (t)-[:FROM_CLUB]->(club)<-[:TO_CLUB]-(t2)
WHERE none(t in [t, t2] where t:Loan)
WITH p, t.numericFee - t2.numericFee AS profit, [transfer in nodes(path) | [(from)<-[:FROM_CLUB]-(transfer)-[:TO_CLUB]->(to) | from.name + "->" + to.name][0]] AS transfers, path
RETURN p.name, apoc.number.format(profit), transfers, (nodes(path)[-1].timestamp - nodes(path)[0].timestamp) / 60 / 60 / 24 AS days
ORDER BY profit DESC
资金流
某个赛季的俱乐部之间的资金流是如何的?
MATCH (t:Transfer)
WITH DISTINCT t.season AS season
MATCH (seller)<-[:FROM_CLUB]-(t:Transfer)-[:TO_CLUB]->(buyer)
WHERE t.season = season AND t.numericFee > 0
WITH season, seller, buyer, sum(t.numericFee) AS cash_flow, count(t) AS player_count
RETURN buyer, cash_flow, player_count, season, seller
ORDER BY cash_flow DESC
LIMIT 10
让我们将其持久化到图中,这样我们就无需每次都重新计算。
MATCH (t:Transfer)
WITH DISTINCT t.season AS season
MATCH (seller)<-[:FROM_CLUB]-(t:Transfer)-[:TO_CLUB]->(buyer)
WHERE t.season = season AND t.numericFee > 0
WITH season, seller, buyer, sum(t.numericFee) AS cash_flow, count(t) AS player_count
MERGE (buyer)-[:CASH_FLOW {total: cash_flow, playerCount: player_count, season: season}]->(seller)
2010/2011 赛季,谁从皇家马德里获得了资金?
这种新的关系类型使我们能够找出某个赛季的资金流向 - 例如,我们可以看到 2010/2011 赛季皇家马德里从谁那里购买了球员。
MATCH path = (buyer:Club)-[:CASH_FLOW]->(seller:Club)
WHERE buyer.name = "Real Madrid" AND all(f in relationships(path) WHERE f.season="2010/2011")
RETURN *
转会金额最大
我们可以编写一个查询来找出参与最大金额转会的俱乐部。
MATCH (buyer:Club)-[f:CASH_FLOW]->(seller:Club)
RETURN f.season AS season, buyer.name AS buyer, seller.name AS seller, f.total AS amount, f.playerCount AS playersTransferred
ORDER BY amount DESC limit 10
令人惊讶的是,曼城/摩纳哥在其中,这两支球队之间转会的是谁?
MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
(transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "Monaco" AND to.name = "Man City"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC
特定赛季的最高金额
我们还可以深入到特定赛季,只查看该赛季的现金流。
MATCH (buyer:Club)-[f:CASH_FLOW]->(seller:Club)
WHERE f.season = "2016/2017"
RETURN f.season AS season, buyer.name AS buyer, seller.name AS seller, f.total AS amount, f.playerCount AS playersTransferred
ORDER BY amount DESC limit 10
MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
(transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "Valencia CF" AND to.name = "FC Barcelona" AND transfer.season = "2016/2017"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC
此页面对您有帮助吗?