使用开源库创建嵌入

Python 库 SentenceTransformers 提供预训练模型,用于为文本和图像生成嵌入,让您无需在 OpenAI 或其他专有服务上拥有账户即可使用嵌入。

本页面假设您已经导入了推荐数据集设置了环境,并展示了如何根据电影的标题和剧情生成并存储 Movie 节点的嵌入。

嵌入总是生成于 Neo4j 之外,但存储在 Neo4j 数据库中。

设置环境

作为最后一步设置,请安装 sentence-transformers 包。

pip install sentence-transformers

为电影创建嵌入

以下示例从数据库中获取所有 Movie 节点,为标题和剧情生成嵌入,并将其作为额外的 embedding 属性添加到每个节点。

from sentence_transformers import SentenceTransformer
import neo4j


URI = '<URI for Neo4j database>'
AUTH = ('<Username>', '<Password>')
DB_NAME = '<Database name>'  # examples: 'recommendations-50', 'neo4j'


def main():
    driver = neo4j.GraphDatabase.driver(URI, auth=AUTH)  (1)
    driver.verify_connectivity()

    model = SentenceTransformer('all-MiniLM-L6-v2')  # vector size 384  (2)

    batch_size = 100
    batch_n = 1
    movies_with_embeddings = []
    with driver.session(database=DB_NAME) as session:
        # Fetch `Movie` nodes
        result = session.run('MATCH (m:Movie) RETURN m.plot AS plot, m.title AS title')
        for record in result:
            title = record.get('title')
            plot = record.get('plot')

            # Create embedding for title and plot
            if title is not None and plot is not None:
                movies_with_embeddings.append({
                    'title': title,
                    'plot': plot,
                    'embedding': model.encode(f'''  (3)
                        Title: {title}\n
                        Plot: {plot}
                    '''),
                })

            # Import when a batch of movies has embeddings ready; flush buffer
            if len(movies_with_embeddings) == batch_size:  (4)
                import_batch(driver, movies_with_embeddings, batch_n)
                movies_with_embeddings = []
                batch_n += 1

        # Flush last batch
        import_batch(driver, movies_with_embeddings, batch_n)

    # Import complete, show counters
    records, _, _ = driver.execute_query('''
    MATCH (m:Movie WHERE m.embedding IS NOT NULL)
    RETURN count(*) AS countMoviesWithEmbeddings, size(m.embedding) AS embeddingSize
    ''', database_=DB_NAME)
    print(f"""
Embeddings generated and attached to nodes.
Movie nodes with embeddings: {records[0].get('countMoviesWithEmbeddings')}.
Embedding size: {records[0].get('embeddingSize')}.
    """)


def import_batch(driver, nodes_with_embeddings, batch_n):
    # Add embeddings to Movie nodes
    driver.execute_query('''  (5)
    UNWIND $movies as movie
    MATCH (m:Movie {title: movie.title, plot: movie.plot})
    CALL db.create.setNodeVectorProperty(m, 'embedding', movie.embedding)
    ''', movies=nodes_with_embeddings, database_=DB_NAME)
    print(f'Processed batch {batch_n}.')


if __name__ == '__main__':
    main()

'''
Movie nodes with embeddings: 9083.
Embedding size: 384.
'''
1 driver 对象是与您的 Neo4j 实例交互的接口。欲了解更多信息,请参阅使用 Neo4j 和 Python 构建应用程序
2 模型 all-MiniLM-L6-V2 将文本映射为大小为 384 的向量(即 384 个数字的列表)。
3 .encode() 方法为给定字符串(在此示例中为标题和剧情)生成嵌入。
4 在将整个批次提交到数据库之前,会收集一定数量的嵌入。这可以避免将整个数据集保存在内存中以及潜在的超时(对于大型数据集尤其重要)。
5 导入查询在每个节点 m 上设置一个新的 embedding 属性,以嵌入向量 movie.embedding 作为值。它使用 Cypher 过程 db.create.setNodeVectorProperty,该过程比使用 SET Cypher 子句添加向量属性更有效率地存储向量属性。要在关系上设置向量属性,请使用 db.create.setRelationshipVectorProperty

一旦嵌入进入数据库,您就可以使用它们比较一部电影与另一部电影的相似度

© . All rights reserved.