使用云AI提供商创建嵌入

Cypher^® 函数 genai.vector.encode 和过程 genai.vector.encodeBatch 允许您通过外部 AI 提供商为一个或多个文本片段生成嵌入。您需要获得其中一个受支持提供商（OpenAI、Vertex AI、Azure OpenAI、Amazon Bedrock）的 API 令牌。

本页面假定您已经导入了推荐数据集并设置了您的环境，并展示了如何根据 Movie 节点的标题和情节生成并存储嵌入。

嵌入始终在 Neo4j 外部生成，但存储在 Neo4j 数据库中。

设置环境

编码函数是 Neo4j GenAI 插件的一部分。

在 Aura 实例上，此插件默认启用，因此如果您在 Aura 上使用 Neo4j，则无需执行任何其他操作。
对于自管理实例，需要安装此插件。您可以通过将 neo4j-genai.jar 文件从 Neo4j 主目录中的 /products 移动到 /plugins，或者通过使用额外参数 --env NEO4J_PLUGINS='["genai"]' 启动 Docker 容器来完成此操作。
有关更多信息，请参阅配置 → 插件。

为电影创建嵌入

以下示例从数据库中获取所有 Movie 节点，生成电影标题和情节串联的嵌入，并将其作为额外的 embedding 属性添加到每个节点。

import neo4j


URI = '<URI for Neo4j database>'
AUTH = ('<Username>', '<Password>')
DB_NAME = '<Database name>'  # examples: 'recommendations-50', 'neo4j'

openAI_token = '<OpenAI API token>'


def main():
    driver = neo4j.GraphDatabase.driver(URI, auth=AUTH)  (1)
    driver.verify_connectivity()

    batch_size = 100
    batch_n = 1
    movies_batch = []
    with driver.session(database=DB_NAME) as session:
        # Fetch `Movie` nodes
        result = session.run('MATCH (m:Movie) RETURN m.plot AS plot, m.title AS title')
        for record in result:
            title = record.get('title')
            plot = record.get('plot')

            if title is not None and plot is not None:
                movies_batch.append({
                    'title': title,
                    'plot': plot,
                    'to_encode': f'Title: {title}\nPlot: {plot}'  (2)
                })

            # Import a batch; flush buffer
            if len(movies_batch) == batch_size:  (3)
                import_batch(driver, movies_batch, batch_n)
                movies_batch = []
                batch_n += 1

        # Flush last batch
        import_batch(driver, movies_with_embeddings, batch_n)

    # Import complete, show counters
    records, _, _ = driver.execute_query('''
    MATCH (m:Movie WHERE m.embedding IS NOT NULL)
    RETURN count(*) AS countMoviesWithEmbeddings, size(m.embedding) AS embeddingSize
    ''', database_=DB_NAME)
    print(f"""
Embeddings generated and attached to nodes.
Movie nodes with embeddings: {records[0].get('countMoviesWithEmbeddings')}.
Embedding size: {records[0].get('embeddingSize')}.
    """)


def import_batch(driver, nodes, batch_n):
    # Generate and store embeddings for Movie nodes
    driver.execute_query('''
    CALL genai.vector.encodeBatch($listToEncode, 'OpenAI', { token: $token }) YIELD index, vector  (4)
    MATCH (m:Movie {title: $movies[index].title, plot: $movies[index].plot})  (5)
    CALL db.create.setNodeVectorProperty(m, 'embedding', vector)  (6)
    ''', movies=nodes, listToEncode=[movie['to_encode'] for movie in nodes], token=openAI_token,
    database_=DB_NAME)
    print(f'Processed batch {batch_n}')


if __name__ == '__main__':
    main()

'''
Movie nodes with embeddings: 9083.
Embedding size: 1536.
'''

1	`driver` 对象是与 Neo4j 实例交互的接口。有关更多信息，请参阅使用 Neo4j 和 Python 构建应用程序。
2	OpenAI 应该编码成嵌入的字符串。
3	在将整个批次提交到数据库之前，会收集一定数量的嵌入。这避免了将整个数据集保留在内存中以及可能出现的超时（这对于大型数据集尤其重要）。
4	过程 `genai.vector.encodeBatch()` 将批次提交给 OpenAI 进行编码。OpenAI 的默认模型是 `text-embedding-ada-002`，它将文本嵌入到大小为 1536 的向量（即 1536 个数字的列表）中。有关支持的提供商和选项列表，请参阅生成式AI提供商。
5	从 `genai.vector.encodeBatch` 返回的 `index` 允许将嵌入与电影关联起来，从而可以检索每个电影节点并将其嵌入附加到其上。
6	过程 `db.create.setNodeVectorProperty` 将嵌入 `vector` 存储在每个电影节点 `m` 的名为 `embedding` 的属性中。通过此过程添加嵌入比使用 `SET` Cypher 子句更高效。要在关系上设置向量属性，请使用 `db.create.setRelationshipVectorProperty`。

一旦嵌入进入数据库，您就可以使用它们来比较一部电影与另一部电影的相似度。