导出 Apache Parquet - APOC 扩展文档

库要求

Apache Parquet 过程依赖于客户端库，该库不包含在 APOC Extended 库中。

这些依赖项包含在 apoc-hadoop-dependencies-5.26.1-all.jar，可以从发布页面下载。

下载该文件后，应将其放置在 plugins 目录中，并重新启动 Neo4j 服务器。

可用过程

下表描述了可用的过程

名称	描述
apoc.export.parquet.all	将整个数据库导出为 Parquet 字节数组
apoc.export.parquet.data	将给定的节点和关系导出为 Parquet 字节数组
apoc.export.parquet.graph	将给定的图导出为 Parquet 字节数组
apoc.export.parquet.query	将给定的 Cypher 查询导出为 Parquet 字节数组
apoc.export.parquet.all.stream	将整个数据库导出为 Parquet 文件
apoc.export.parquet.data.stream	将给定的节点和关系导出为 Parquet 文件
apoc.export.parquet.graph.stream	将给定的图导出为 Parquet 文件
apoc.export.parquet.query.stream	将给定的 Cypher 查询导出为 Parquet 文件

名称

描述

apoc.export.parquet.all

将整个数据库导出为 Parquet 字节数组

apoc.export.parquet.data

将给定的节点和关系导出为 Parquet 字节数组

apoc.export.parquet.graph

将给定的图导出为 Parquet 字节数组

apoc.export.parquet.query

将给定的 Cypher 查询导出为 Parquet 字节数组

apoc.export.parquet.all.stream

将整个数据库导出为 Parquet 文件

apoc.export.parquet.data.stream

将给定的节点和关系导出为 Parquet 文件

apoc.export.parquet.graph.stream

将给定的图导出为 Parquet 文件

apoc.export.parquet.query.stream

将给定的 Cypher 查询导出为 Parquet 文件

我们可以使用这些过程之一导入或加载导出的结果。

配置参数

这些过程支持以下配置参数

表 1. 配置参数
名称	类型	默认值	描述
batchSize	long	20000	每 n 个结果更新 Parquet 文件/字节数组
mapping	Map	20000	用于映射复杂文件。请参阅下面的“`Mapping config`”部分

用法

本节中的示例基于以下样本图

CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})
CREATE (Hugo:Person {name:'Hugo Weaving', born:1960})
CREATE (LillyW:Person {name:'Lilly Wachowski', born:1967})
CREATE (LanaW:Person {name:'Lana Wachowski', born:1965})
CREATE (JoelS:Person {name:'Joel Silver', born:1952})
CREATE
(Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
(Carrie)-[:ACTED_IN {roles:['Trinity']}]->(TheMatrix),
(Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix),
(Hugo)-[:ACTED_IN {roles:['Agent Smith']}]->(TheMatrix),
(LillyW)-[:DIRECTED]->(TheMatrix),
(LanaW)-[:DIRECTED]->(TheMatrix),
(JoelS)-[:PRODUCED]->(TheMatrix);

以下查询将整个数据库导出到 Parquet 文件 test.parquet 中

CALL apoc.export.parquet.all('test.parquet')

表 2. 结果
文件	源	格式	节点数	关系数	属性	时间	行数	batchSize	批次	数据
"file:///test.parquet"	"graph: nodes(8), rels(7)"	"parquet"	8	7	0	0	0	20000	0	null

以下过程将指定的图导出到 Parquet 文件 testData.parquet 中

MATCH (n:Person)-[r]->()
WITH collect(n) as nodes, collect(r) as rels
call apoc.export.parquet.data(nodes, rels, 'testData.parquet')
YIELD file RETURN file

表 3. 结果
文件
"file:///testData.parquet"

以下过程将指定的节点和关系导出到 Parquet 文件

CALL apoc.graph.fromDB('neo4j',{}) YIELD graph
CALL apoc.export.parquet.graph(graph, 'testGraph.parquet')
YIELD file RETURN file

表 4. 结果
文件
"file:///testGraph.parquet"

以下过程将指定的查询结果导出到 Parquet 文件

CALL apoc.export.parquet.query("MATCH (n:Person) RETURN n", 'testQuery.parquet')

表 5. 结果
文件	源	格式	节点数	关系数	属性	时间	行数	batchSize	批次	数据
"file:///testQuery.parquet"	"statement: cols(1)"	"parquet"	8	7	0	0	0	20000	0	null

我们还可以使用 apoc.export.parquet.<type>.stream 过程直接将 Parquet 字节数组作为结果导出，例如

CALL apoc.export.parquet.all.stream

表 6. 结果
值
<byte_array_parquet_file>