使用开源库创建嵌入
Python 库 SentenceTransformers 提供预训练模型来生成文本和图像的嵌入,并允许您在无需使用 OpenAI 或其他专有服务的帐户的情况下使用嵌入。
嵌入始终在 Neo4j *外部* 生成,但在 Neo4j 数据库中*存储*。 |
为电影创建嵌入
以下示例从数据库中获取所有Movie
节点,为标题和情节生成嵌入,并将该嵌入作为附加的embedding
属性添加到每个节点。
from sentence_transformers import SentenceTransformer
import neo4j
URI = '<URI for Neo4j database>'
AUTH = ('<Username>', '<Password>')
DB_NAME = '<Database name>' # examples: 'recommendations-50', 'neo4j'
def main():
driver = neo4j.GraphDatabase.driver(URI, auth=AUTH) (1)
driver.verify_connectivity()
model = SentenceTransformer('all-MiniLM-L6-v2') # vector size 384 (2)
batch_size = 100
batch_n = 1
movies_with_embeddings = []
with driver.session(database=DB_NAME) as session:
# Fetch `Movie` nodes
result = session.run('MATCH (m:Movie) RETURN m.plot AS plot, m.title AS title')
for record in result:
title = record.get('title')
plot = record.get('plot')
# Create embedding for title and plot
if title is not None and plot is not None:
movies_with_embeddings.append({
'title': title,
'plot': plot,
'embedding': model.encode(f''' (3)
Title: {title}\n
Plot: {plot}
'''),
})
# Import when a batch of movies has embeddings ready; flush buffer
if len(movies_with_embeddings) == batch_size: (4)
import_batch(driver, movies_with_embeddings, batch_n)
movies_with_embeddings = []
batch_n += 1
# Import complete, show counters
records, _, _ = driver.execute_query('''
MATCH (m:Movie WHERE m.embedding IS NOT NULL)
RETURN count(*) AS countMoviesWithEmbeddings, size(m.embedding) AS embeddingSize
''', database_=DB_NAME)
print(f"""
Embeddings generated and attached to nodes.
Movie nodes with embeddings: {records[0].get('countMoviesWithEmbeddings')}.
Embedding size: {records[0].get('embeddingSize')}.
""")
def import_batch(driver, nodes_with_embeddings, batch_n):
# Add embeddings to Movie nodes
driver.execute_query(''' (5)
UNWIND $movies as movie
MATCH (m:Movie {title: movie.title, plot: movie.plot})
CALL db.create.setNodeVectorProperty(m, 'embedding', movie.embedding)
''', movies=nodes_with_embeddings, database_=DB_NAME)
print(f'Processed batch {batch_n}.')
if __name__ == '__main__':
main()
'''
Movie nodes with embeddings: 9083.
Embedding size: 384.
'''
1 | driver 对象是与 Neo4j 实例交互的接口。有关更多信息,请参阅 使用 Neo4j 和 Python 构建应用程序。 |
2 | 模型 all-MiniLM-L6-V2 将文本映射到大小为 384 的向量(即 384 个数字的列表)。 |
3 | .encode() 方法为给定的字符串(本例中为标题和情节一起)生成嵌入。 |
4 | 在将整个批次提交到数据库之前,会收集一定数量的嵌入。这样可以避免将整个数据集保存到内存中,并避免潜在的超时(对于更大的数据集尤其重要)。 |
5 | 导入查询在每个节点 m 上设置一个新的embedding 属性,其值为嵌入向量 movie.embedding 。它使用 Cypher 过程 db.create.setNodeVectorProperty ,它比使用 SET Cypher 子句更有效地存储向量属性。要将向量属性设置为关系,请使用 db.create.setRelationshipVectorProperty 。 |
一旦嵌入到数据库中,您就可以使用它们来 比较一部电影与另一部电影的相似度。