使用开源库创建嵌入

Python 库 SentenceTransformers 提供预训练模型来生成文本和图像的嵌入,并允许您在无需使用 OpenAI 或其他专有服务的帐户的情况下使用嵌入。

此页面假设您已 导入推荐数据集已设置环境,并展示了如何基于标题和情节为Movie 节点生成和存储嵌入。

嵌入始终在 Neo4j *外部* 生成,但在 Neo4j 数据库中*存储*。

设置环境

作为最后一步,请安装sentence-transformers 包。

pip install sentence-transformers

为电影创建嵌入

以下示例从数据库中获取所有Movie 节点,为标题和情节生成嵌入,并将该嵌入作为附加的embedding 属性添加到每个节点。

from sentence_transformers import SentenceTransformer
import neo4j


URI = '<URI for Neo4j database>'
AUTH = ('<Username>', '<Password>')
DB_NAME = '<Database name>'  # examples: 'recommendations-50', 'neo4j'


def main():
    driver = neo4j.GraphDatabase.driver(URI, auth=AUTH)  (1)
    driver.verify_connectivity()

    model = SentenceTransformer('all-MiniLM-L6-v2')  # vector size 384  (2)

    batch_size = 100
    batch_n = 1
    movies_with_embeddings = []
    with driver.session(database=DB_NAME) as session:
        # Fetch `Movie` nodes
        result = session.run('MATCH (m:Movie) RETURN m.plot AS plot, m.title AS title')
        for record in result:
            title = record.get('title')
            plot = record.get('plot')

            # Create embedding for title and plot
            if title is not None and plot is not None:
                movies_with_embeddings.append({
                    'title': title,
                    'plot': plot,
                    'embedding': model.encode(f'''  (3)
                        Title: {title}\n
                        Plot: {plot}
                    '''),
                })

            # Import when a batch of movies has embeddings ready; flush buffer
            if len(movies_with_embeddings) == batch_size:  (4)
                import_batch(driver, movies_with_embeddings, batch_n)
                movies_with_embeddings = []
                batch_n += 1

    # Import complete, show counters
    records, _, _ = driver.execute_query('''
    MATCH (m:Movie WHERE m.embedding IS NOT NULL)
    RETURN count(*) AS countMoviesWithEmbeddings, size(m.embedding) AS embeddingSize
    ''', database_=DB_NAME)
    print(f"""
Embeddings generated and attached to nodes.
Movie nodes with embeddings: {records[0].get('countMoviesWithEmbeddings')}.
Embedding size: {records[0].get('embeddingSize')}.
    """)


def import_batch(driver, nodes_with_embeddings, batch_n):
    # Add embeddings to Movie nodes
    driver.execute_query('''  (5)
    UNWIND $movies as movie
    MATCH (m:Movie {title: movie.title, plot: movie.plot})
    CALL db.create.setNodeVectorProperty(m, 'embedding', movie.embedding)
    ''', movies=nodes_with_embeddings, database_=DB_NAME)
    print(f'Processed batch {batch_n}.')


if __name__ == '__main__':
    main()

'''
Movie nodes with embeddings: 9083.
Embedding size: 384.
'''
1 driver 对象是与 Neo4j 实例交互的接口。有关更多信息,请参阅 使用 Neo4j 和 Python 构建应用程序
2 模型 all-MiniLM-L6-V2 将文本映射到大小为 384 的向量(即 384 个数字的列表)。
3 .encode() 方法为给定的字符串(本例中为标题和情节一起)生成嵌入。
4 在将整个批次提交到数据库之前,会收集一定数量的嵌入。这样可以避免将整个数据集保存到内存中,并避免潜在的超时(对于更大的数据集尤其重要)。
5 导入查询在每个节点 m 上设置一个新的embedding 属性,其值为嵌入向量 movie.embedding。它使用 Cypher 过程 db.create.setNodeVectorProperty,它比使用 SET Cypher 子句更有效地存储向量属性。要将向量属性设置为关系,请使用 db.create.setRelationshipVectorProperty

一旦嵌入到数据库中,您就可以使用它们来 比较一部电影与另一部电影的相似度