Amazon Redshift

Amazon Redshift 使用 SQL 分析跨数据仓库、操作型数据库和数据湖的结构化和半结构化数据，并利用 AWS 设计的硬件和机器学习，在任何规模下都能提供最佳的性价比。

先决条件

您需要一个正在运行的 Amazon Redshift 实例。如果您没有，可以从此处创建。

依赖项

如果您在 Databricks Runtime 环境中，则无需添加任何外部依赖项；否则，您可能需要：

com.amazon.redshift:redshift-jdbc42:<version>
org.apache.spark:spark-avro_<scala_version>:<version>
io.github.spark-redshift-community:spark-redshift_<scala_version>:<version>
com.amazonaws:aws-java-sdk:<version>

从 Redshift 到 Neo4j

在 Databricks Runtime 中

在这种情况下，一个好的起点是这份Databricks 指南。

// Step (1)
// Load a table into a Spark DataFrame
val redshiftDF: DataFrame = spark.read
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://<the-rest-of-the-connection-string>")
  .option("dbtable", "CUSTOMER")
  .option("tempdir", "s3a://<your-bucket>/<your-directory-path>")
  .load()

// Step (2)
// Save the `redshiftDF` as nodes with labels `Person` and `Customer` into Neo4j
redshiftDF.write
  .format("org.neo4j.spark.DataSource")
  .mode(SaveMode.ErrorIfExists)
  .option("url", "neo4j://<host>:<port>")
  .option("labels", ":Person:Customer")
  .save()

# Step (1)
# Load a table into a Spark DataFrame
redshiftDF = (spark.read
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://<the-rest-of-the-connection-string>")
  .option("dbtable", "CUSTOMER")
  .option("tempdir", "s3a://<your-bucket>/<your-directory-path>")
  .load())

# Step (2)
# Save the `redshiftDF` as nodes with labels `Person` and `Customer` into Neo4j
(redshiftDF.write
  .format("org.neo4j.spark.DataSource")
  .mode("ErrorIfExists")
  .option("url", "neo4j://<host>:<port>")
  .option("labels", ":Person:Customer")
  .save())

在其他任何带有 Redshift 社区依赖项的 Spark Runtime 中

在这种情况下，一个好的起点是这个Redshift 社区存储库

// Step (1)
// Load a table into a Spark DataFrame
val redshiftDF: DataFrame = spark.read
  .format("io.github.spark_redshift_community.spark.redshift")
  .option("url", "jdbc:redshift://<the-rest-of-the-connection-string>")
  .option("dbtable", "CUSTOMER")
  .option("tempdir", "s3a://<your-bucket>/<your-directory-path>")
  .load()

// Step (2)
// Save the `redshiftDF` as nodes with labels `Person` and `Customer` into Neo4j
redshiftDF.write
  .format("org.neo4j.spark.DataSource")
  .mode(SaveMode.ErrorIfExists)
  .option("url", "neo4j://<host>:<port>")
  .option("labels", ":Person:Customer")
  .save()

# Step (1)
# Load a table into a Spark DataFrame
redshiftDF = (spark.read
  .format("io.github.spark_redshift_community.spark.redshift")
  .option("url", "jdbc:redshift://<the-rest-of-the-connection-string>")
  .option("dbtable", "CUSTOMER")
  .option("tempdir", "s3a://<your-bucket>/<your-directory-path>")
  .load())

# Step (2)
# Save the `redshiftDF` as nodes with labels `Person` and `Customer` into Neo4j
(redshiftDF.write
  .format("org.neo4j.spark.DataSource")
  .mode("ErrorIfExists")
  .option("url", "neo4j://<host>:<port>")
  .option("labels", ":Person:Customer")
  .save())

从 Neo4j 到 Redshift

在 Databricks Runtime 中

在这种情况下，一个好的起点是这份Databricks 指南。

// Step (1)
// Load `:Person:Customer` nodes as DataFrame
val neo4jDF: DataFrame = spark.read.format("org.neo4j.spark.DataSource")
  .option("url", "neo4j://<host>:<port>")
  .option("labels", ":Person:Customer")
  .load()

// Step (2)
// Save the `neo4jDF` as table CUSTOMER into Redshift
neo4jDF.write
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://<the-rest-of-the-connection-string>")
  .option("dbtable", "CUSTOMER")
  .option("tempdir", "s3a://<your-bucket>/<your-directory-path>")
  .mode("error")
  .save()

# Step (1)
# Load `:Person:Customer` nodes as DataFrame
neo4jDF = (spark.read.format("org.neo4j.spark.DataSource")
  .option("url", "neo4j://<host>:<port>")
  .option("labels", ":Person:Customer")
  .load())

# Step (2)
# Save the `neo4jDF` as table CUSTOMER into Redshift
(neo4jDF.write
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://<the-rest-of-the-connection-string>")
  .option("dbtable", "CUSTOMER")
  .option("tempdir", "s3a://<your-bucket>/<your-directory-path>")
  .mode("error")
  .save())

在其他任何带有 Redshift 社区依赖项的 Spark Runtime 中

在这种情况下，一个好的起点是这个Redshift 社区存储库。

// Step (1)
// Load `:Person:Customer` nodes as DataFrame
val neo4jDF: DataFrame = spark.read.format("org.neo4j.spark.DataSource")
  .option("url", "neo4j://<host>:<port>")
  .option("labels", ":Person:Customer")
  .load()

// Step (2)
// Save the `neo4jDF` as table CUSTOMER into Redshift
neo4jDF.write
  .format("io.github.spark_redshift_community.spark.redshift")
  .option("url", "jdbc:redshift://<the-rest-of-the-connection-string>")
  .option("dbtable", "CUSTOMER")
  .option("tempdir", "s3a://<your-bucket>/<your-directory-path>")
  .mode("error")
  .save()

# Step (1)
# Load `:Person:Customer` nodes as DataFrame
neo4jDF = (spark.read.format("org.neo4j.spark.DataSource")
  .option("url", "neo4j://<host>:<port>")
  .option("labels", ":Person:Customer")
  .load())

# Step (2)
# Save the `neo4jDF` as table CUSTOMER into Redshift
(neo4jDF.write
  .format("io.github.spark_redshift_community.spark.redshift")
  .option("url", "jdbc:redshift://<the-rest-of-the-connection-string>")
  .option("dbtable", "CUSTOMER")
  .option("tempdir", "s3a://<your-bucket>/<your-directory-path>")
  .mode("error")
  .save())