实体解析 - 技术演练

1. 行业介绍

2. 介绍

如前所述,实体解析是任何数据项目的关键方面,无论所分析的数据类型如何。这包括

  • 客户

  • 交易

  • 产品

  • 订单

  • 地址

  • 政策

  • 产品申请

  • 等等

任何时候,如果需要人工在自由文本框中输入信息,都可能导致数据不一致。本指南旨在演示知识图谱如何独特地帮助解决此问题。在本例中,我们将重点关注地址去重,但相同的原则可以应用于您组织的任何方面。

3. 建模

本节将展示在示例图上执行 Cypher 查询的示例。目的是说明查询的样式,并提供在实际设置中如何构建数据的指南。我们将在包含多个节点的小型图上进行操作。示例图将基于以下数据模型:

3.1. 数据模型

agnostic entity resolution model

3.1.1. 必填字段

以下是入门所需的字段:

Address 节点

  • RegAddressAddressLine1:地址第一行

  • RegAddressAddressLine2:地址第二行

  • RegAddressPostTown:城镇

  • RegAddressPostCode:邮政编码

  • Latitude:基于邮政编码的纬度

  • Longitude:基于邮政编码的经度

3.2. 演示数据

以下 Cypher 语句将在 Neo4j 数据库中创建示例图:

// Create all Address Nodes
CREATE (:Address {`RegAddressAddressLine1`: "37 ALBYN PLACE", `RegAddressAddressLine2`: "ALBYN PLACE", RegAddressPostTown: "ABERDEEN", RegAddressPostCode: "AB101JB", FullAddress: "37 ALBYN PLACE ALBYN PLACE ABERDEEN AB101JB"})
CREATE (:Address {`RegAddressAddressLine1`: "COMPANY NAME", `RegAddressAddressLine2`: "37 ALBYN PLACE", RegAddressPostTown: "ABERDEEN", RegAddressPostCode: "AB101JB", FullAddress: "COMPANY NAME 37 ALBYN PLACE ABERDEEN AB101JB"});

// Update each Address Node with longitude and latitude
MATCH (a:Address)
CALL apoc.spatial.geocode(a.RegAddressPostCode) YIELD location
SET a.Latitude = location.latitude,
    a.Longitude = location.longitude;

3.3. Neo4j 架构

如果你调用

// Show neo4j scheme
CALL db.schema.visualization()

你将看到以下响应

agnostic entity resolution schema

4. Cypher 查询

4.1. 计算地址之间的距离(以米为单位)

此 Cypher 查询旨在根据地理坐标(纬度和经度)计算不同 Address 节点之间的距离。此查询的独特之处在于它使用 point.distance 函数直接在查询中计算距离,并使用 ID(a1) > ID(a2) 来避免重复比较。

// Calculate the distance between Address Nodes
MATCH (a1:Address), (a2:Address)
WHERE ID(a1) > ID(a2)
RETURN a1.FullAddress AS FullAddress1, a2.FullAddress AS FullAddress2,
       point.distance(point({ latitude: a1.Latitude, longitude: a1.Longitude }),
       point({ latitude: a2.Latitude, longitude: a2.Longitude })) AS DistanceInMeters

4.1.1. 此查询的作用是什么?

  1. MATCH (a1:Address), (a2:Address):查询的这部分匹配所有带有 Address 标签的节点。使用两个独立的变量 a1 和 a2 来表示这些 Address 节点。

  2. WHERE ID(a1) > ID(a2):此条件确保查询不会将地址与自身进行比较,并通过基于其内部 Neo4j ID 确保 a1 和 a2 相互独立来避免重复比较。

  3. RETURN a1.FullAddress AS FullAddress1, a2.FullAddress AS FullAddress2:查询的这部分返回两个被比较节点的完整地址,并将其重命名为 FullAddress1 和 FullAddress2 以便于解释。

  4. point.distance(point({ latitude: a1.Latitude, longitude: a1.Longitude }), point({ latitude: a2.Latitude, longitude: a2.Longitude })) AS DistanceInMeters:这是查询的核心部分,用于计算两个地址节点之间的地理距离。

    1. point({ latitude: a1.Latitude, longitude: a1.Longitude }) 根据 a1 的纬度和经度构建一个点。

    2. point({ latitude: a2.Latitude, longitude: a2.Longitude }) 对 a2 进行同样的操作。

    3. point.distance() 然后用于计算这两个点之间的距离(以米为单位)。

4.2. 地址节点相似度评分

这个复杂的 Cypher 查询旨在根据多个属性(如地址行和邮政编码)计算不同 Address 节点之间的相似度分数。该查询使用 APOC (Awesome Procedures On Cypher) 库的 apoc.cypher.mapParallel2 函数并行执行相似度评分,从而提高性能。Levenshtein 算法测量文本相似度,从而可以对地址字段进行细致的比较。该查询还结合了多层选择逻辑,以确保高质量的相似度匹配。

// Parallel Similarity Scoring Version
MATCH (a:Address)
WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) AS postcodes
CALL apoc.cypher.mapParallel2("
    MATCH (a:Address), (b:Address)
        WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH _ AND b.RegAddressPostCode STARTS WITH _
        // Pass Variables
        WITH a, b,
        // Build similarity scores
        apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim,
        apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim,
        apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1,
        apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1,
        apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim,
        apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim
        WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation

        // Selection logic //

        // Limit the similarity of the full address
        WHERE full_address_sim > 0.6

        // Postcodes can not be too far apart
            AND post_sim > 0.7
            // Looks at addresses that have prefixes, e.g. 37 ALBYN PLACE vs COMPANY NAME 37 ALBYN PLACE
            // This addition pushes the address into Line 2
            AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85)
            AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9)

        RETURN id(a) as a_id, a.FullAddress as a_FullAddress,id(b) as b_id, b.FullAddress as b_FullAddress, full_address_sim;
    ",
    {parallel:True, batchSize:1000, concurrency:6}, postcodes, 6) YIELD value
RETURN value.a_id AS a_id, value.a_FullAddress AS a_full_address, value.b_id AS b_id, value.b_FullAddress AS b_full_address, value.full_address_sim AS full_address_similarity;

4.2.1. 此查询的作用是什么?

  1. MATCH (a:Address):通过匹配所有带有 Address 标签的节点来启动查询。

  2. WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) AS postcodes:将这些邮政编码的前三个不同字符收集到一个名为 postcodes 的列表中。

  3. CALL apoc.cypher.mapParallel2("…​", {parallel:True, batchSize:1000, concurrency:6}, postcodes, 6) YIELD value:并行执行嵌套的 Cypher 查询,批处理大小为 1000,并发级别为 6。

嵌套查询详情

  1. MATCH (a:Address), (b:Address):匹配所有用于比较的 Address 节点对。

  2. WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH _ AND b.RegAddressPostCode STARTS WITH _:确保每对都是唯一的,并且两个地址都以 postcodes 列表中的邮政编码开头。

  3. Levenshtein 相似度计算:利用 apoc.text.levenshteinSimilarity 计算不同地址 a 和 b 属性之间的相似度。

  4. 选择逻辑:应用各种条件来筛选结果。例如,它要求完整地址(full_address_sim > 0.6)和邮政编码(post_sim > 0.7)具有高度相似性。

  5. RETURN id(a) as a_id, a.FullAddress as a_FullAddress, id(b) as b_id, b.FullAddress as b_FullAddress, full_address_sim;:返回 a 和 b 的 ID 和完整地址,以及完整地址相似度分数。

通过结合先进的文本相似度算法和详细的选择逻辑,此查询非常适合捕获地址之间细微的关系

4.3. 创建地址节点间的相似关系

此 Cypher 查询旨在根据通过 Levenshtein 算法计算的多个相似度分数,在 Address 节点之间创建类型为 SIMILAR_ADDRESS 的关系。值得注意的是,该查询使用 APOC (Awesome Procedures On Cypher) 库的 apoc.text.levenshteinSimilarity 函数执行这些计算。它还采用复杂的选择逻辑来过滤掉不符合特定相似度条件的关系。此查询特别适用于地址共享共同前缀或地址行存在细微差异的情况。

// Create Similarity Relationship
MATCH (a:Address), (b:Address)

// Pass Variables
WITH a, b,

// Build similarity scores
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1,
apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim,
apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim

WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation

// Selection logic

// Limit the similarity of the full address
WHERE full_address_sim > 0.6

    // Postcodes can not be too far apart
    AND post_sim > 0.7

    // Looks at addresses who have prefixes, e.g. 37 ALBYN PLACE vs COMPANY NAME 37 ALBYN PLACE
    // This addition pushes the address into Line 2
    AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85)
    AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9)

MERGE (a)-[:SIMILAR_ADDRESS {
    full_address_similarity: full_address_sim,
    postcode_similarity: post_sim,
    line_2_similarity: line_2_sim,
    line_1_similarity: line_1_sim,
    line_1_2_similarity: a_b_line_1,
    line_2_1_similarity: b_a_line_1
    }]->(b);

4.3.1. 此查询的作用是什么?

  • MATCH (a:Address), (b:Address):查询通过匹配所有带有 Address 标签的节点来启动,这些节点由变量 a 和 b 表示。

  • WITH a, b, …:此子句将匹配的 a 和 b 节点以及几个计算出的相似度分数传递给后续查询部分。

  • Levenshtein 相似度计算:它使用 apoc.text.levenshteinSimilarity 计算 a 和 b 的各种属性(如地址行和邮政编码)之间的相似度分数。

  • WITH a, b, line_1_sim, …:查询保留原始节点和计算出的相似度分数以用于查询的下一部分。

  • 选择逻辑:查询的这一部分施加多个过滤条件来优化相似度匹配。这些条件考虑了完整地址相似度、邮政编码相似度,甚至地址前缀,以创建最有意义的关系。

  • MERGE (a)-[:SIMILAR_ADDRESS {…​}]→(b);:最后,如果 a 和 b 满足条件,则在它们之间创建 SIMILAR_ADDRESS 关系。它还将计算出的相似度分数作为此关系的属性存储,以备将来使用。

通过结合先进的文本相似度算法和详细的选择逻辑,此查询非常适合捕获地址之间细微的关系。

© . All rights reserved.