Description
If BaseVectorStoreDriver.upsert_text_artifacts is used to upsert multiple chunks, and a dict is provided via its meta
kwarg, then the TextArtifact
stored in the meta
column in the embeddings table is not guaranteed to be the artifact used to generate the embedding vector.
To reproduce (requires OpenAI credentials and a local PostgreSQL instance with pgvector):
#!/usr/bin/env python
from griptape.chunkers import TextChunker
from griptape.drivers.vector.pgvector import PgVectorVectorStoreDriver
from griptape.drivers.embedding.openai import OpenAiEmbeddingDriver
# Prepare external deps
embedding_driver = OpenAiEmbeddingDriver(model="text-embedding-3-small")
vector_store = PgVectorVectorStoreDriver(
connection_string="postgresql://localhost:5432/test_db",
embedding_driver=embedding_driver,
table_name="test_embeddings",
)
vector_store.setup()
test_text="""
This is some content for testing embeddings.
It spans multiple lines.
It is otherwise quite uninteresting.
"""
chunker = TextChunker(max_tokens=10)
chunks = chunker.chunk(test_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk.to_text()}")
print(f"Upserting {len(chunks)} chunks...")
vector_store.upsert_text_artifacts(chunks, meta={"metadata_field": "metadata_value"})
print("Done!")
In the database we end up with:
test_db=# select (vector::float4[])[0:3],meta->'artifact' from test_embeddings;
vector | ?column?
-----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{0.015975304,-0.01191094,0.0093917055} | "{\"type\": \"TextArtifact\", \"id\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"reference\": null, \"meta\": {}, \"name\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"value\": \"It spans multiple lines.\"}"
{-0.012808571,0.011958273,0.09028612} | "{\"type\": \"TextArtifact\", \"id\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"reference\": null, \"meta\": {}, \"name\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"value\": \"It spans multiple lines.\"}"
{-0.041691482,0.023411594,-0.032732744} | "{\"type\": \"TextArtifact\", \"id\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"reference\": null, \"meta\": {}, \"name\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"value\": \"It spans multiple lines.\"}"
(3 rows)
We see different vectors for each row but the TextArtifact stored in the meta
column is always the same chunk.
I spent a bit of time debugging and I think the following is a likely explanation:
upsert_text_artifacts
executesBaseVectorStoreDriver.upsert_text_artifact
using worker threads.- Each thread adds its
TextArtifact
to themeta
dict viameta["artifact"] = artifact.to_json()
. - This causes the dict to be modified for all threads.
- When it is time for a thread to send the
meta
dict to the vector store then theartifact
field may contain a TextArtifact handled by a different thread.
If the meta
dict is omitted when calling upsert_text_artifacts
then everything works as expected because each thread creates its own meta
dict.