GraphGists

在 Neo4j 中使用分层树

hospitalmeta

介绍

我的名字是 Tomaz Bratanic。我想演示一下如何在 Neo4j 中处理分层位置树。根据我在导入/查询它们过程中学到的知识,我提出了一些应该遵循的基本规则,以便获得正确的查询结果。

位置树规则

  • 所有关系都从子节点指向父节点,向上层级发展。

  • 我们为所有关系使用单一类型。(PARENT;FROM;IS_IN)

  • 每个节点都有一条指向其父节点的出度关系。

  • 每个节点可以有一个或多个来自其子节点的入度关系。

联系方式

导入

让我们将一些数据加载到我们的图中进行探索。

添加约束和索引

首先,我们需要添加索引和约束,因为它们会优化我们的查询。下面过程中的第一个数组设置索引,第二个数组包含唯一约束。您需要安装 APOC 库。

CREATE INDEX ON :County(name);
CREATE INDEX ON :City(name);
CREATE INDEX ON :ZipCode(name);
CREATE INDEX ON :Address(name);

CREATE CONSTRAINT ON (h:Hospital) ASSERT h.id IS UNIQUE;
CREATE CONSTRAINT ON (s:State) ASSERT s.name IS UNIQUE;

位置分层树导入

您可以注意到,我们没有采用标准方法,即分别合并每个节点,而是以模式及其父节点在分层树中的方式合并它们,因为某些县/市/地址具有相同的名称。

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/Hospital%20General%20Information.csv" as row
WITH row
WHERE row.State = 'New York'
// state name is unique
MERGE (state:State{name:row.State})
// merge by pattern with their parents
MERGE (state)<-[:IS_IN]-(county:County{name:row.`County Name`})
MERGE (county)<-[:IS_IN]-(city:City{name:row.City})
MERGE (city)<-[:IS_IN]-(zip:ZipCode{name:row.`ZIP Code`})
MERGE (zip)<-[:IS_IN]-(address:Address{name:row.Address})
// for entities, it is best to have an id system
MERGE (h:Hospital{id:row.`Provider ID`})
MERGE (h)-[:IS_IN]->(address)

其他医院信息

我们还将导入有关医院的一些其他信息,例如其评级、所有权等。

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/Hospital%20General%20Information.csv" as row
WITH row
WHERE row.State = 'New York'
MATCH (h:Hospital{id:row.`Provider ID`})
SET h.phone=row.`Phone Number`,
    h.emergency_services = row.`Emergency Services`,
    h.name= row.`Hospital Name`,
    h.mortality = row.`Mortality national comparison`,
    h.safety = row.`Safety of care national comparison`,
    h.timeliness = row.`Timeliness of care national comparison`,
    h.experience = row.`Patient experience national comparison`,
    h.effectiveness = row.`Effectiveness of care national comparison`
MERGE (type:HospitalType{name:row.`Hospital Type`})
MERGE (h)-[:HAS_TYPE]->(type)
MERGE (ownership:Ownership{name: row.`Hospital Ownership`})
MERGE (h)-[:HAS_OWNERSHIP]->(ownership)
MERGE (rating:Rating{name:row.`Hospital overall rating`})
MERGE (h)-[:HAS_RATING]->(rating)

地理空间导入

最后要导入的是医院的地理空间信息。

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/gpsinfo.csv" as row
WITH row
WHERE row.State = 'New York'
MATCH (hospital:Hospital {id:row.id})
SET hospital.latitude = toFloat(row.latitude),
    hospital.longitude = toFloat(row.longitude)

空间查询示例

假设您在自由岛迷路了,并且想要找到最近的 10 家医院。距离以米为单位。**注意:在 Neo4j 沙箱中不起作用。**

WITH "Liberty Island, Manhattan" as myLocation
CALL apoc.spatial.geocodeOnce(myLocation) YIELD location
WITH point({longitude: location.longitude, latitude: location.latitude}) as myPosition,100 as distanceInKm
MATCH (h:Hospital)-->(rating:Rating)
WHERE exists(h.latitude) and
  distance(myPosition, point({longitude:h.longitude,latitude:h.latitude})) < (distanceInKm * 100)
RETURN h.name as hospital,rating.name as rating,distance(myPosition,
       point({longitude:h.longitude,latitude:h.latitude})) as distance
ORDER BY distance LIMIT 10

数据验证

验证 #1

我们可以检查是否有任何:Address具有多个向上层级发展的关系。每个节点都有一条指向其父节点的出度关系规则。

MATCH (a:Address)
WHERE size((a)-[:IS_IN]->()) > 1
RETURN a

验证 #2

我们还可以检查位置树中所有路径的长度。由于我们制定的规则,每家医院都必须恰好有一条位置路径,因为每家医院都恰好有一个地址。

MATCH path=(h:Hospital)-[:IS_IN*..10]->(location)
WHERE NOT (location)-[:IS_IN]->()
RETURN distinct(length(path)) as length,
       count(*) as numberOfPaths,
       count(distinct(h)) as numberOfHospitals

验证 #3

检查每个节点有多少个标签。这在学习时很有用。您不希望节点没有标签。

MATCH (n)
RETURN size(labels(n)) as size,count(*) as count

查询

让我们运行一些查询并了解我们的数据。

按所有权计算平均评分

MATCH (r)<-[:HAS_RATING]-(h:Hospital)-[:HAS_OWNERSHIP]->(o)
RETURN o.name as ownership,avg(toInteger(r.name)) as averageRating
ORDER BY averageRating DESC LIMIT 15

每个城市的医院数量

MATCH (h:Hospital)-[:IS_IN*3..3]->(city)
RETURN city.name as city,count(h) as NumberOfHospitals
ORDER BY NumberOfHospitals DESC LIMIT 15

按评分排名前 10 的州

MATCH (r)<-[:HAS_RATING]-(h:Hospital)-[:IS_IN*5..5]->(state)
WHERE NOT r.name="Not Available"
RETURN state.name as state,avg(toInteger(r.name)) as averageRating,count(h) as numberOfHospitals
ORDER BY averageRating DESC LIMIT 15

哪些州的有效性高于平均水平的医院最多

MATCH (h:Hospital)-[:IS_IN*5..5]->(state)
WHERE h.effectiveness = "Above the National average"
RETURN state.name as state,h.effectiveness,count(h) as numberOfHospitals
ORDER BY numberOfHospitals DESC LIMIT 15

哪些州的死亡率低于平均水平的医院最多

MATCH (h:Hospital)-[:IS_IN*5..5]->(state)
WHERE h.mortality = "Below the National average"
RETURN state.name as state,h.mortality,count(h) as numberOfHospitals
ORDER BY numberOfHospitals DESC LIMIT 15