Skip to content

Race condition when using upsert_text_artifacts with the meta kwarg to upsert multiple chunks #1781

Closed
@mikewallace1979

Description

@mikewallace1979

If BaseVectorStoreDriver.upsert_text_artifacts is used to upsert multiple chunks, and a dict is provided via its meta kwarg, then the TextArtifact stored in the meta column in the embeddings table is not guaranteed to be the artifact used to generate the embedding vector.

To reproduce (requires OpenAI credentials and a local PostgreSQL instance with pgvector):

#!/usr/bin/env python

from griptape.chunkers import TextChunker
from griptape.drivers.vector.pgvector import PgVectorVectorStoreDriver
from griptape.drivers.embedding.openai import OpenAiEmbeddingDriver

# Prepare external deps
embedding_driver = OpenAiEmbeddingDriver(model="text-embedding-3-small")
vector_store = PgVectorVectorStoreDriver(
    connection_string="postgresql://localhost:5432/test_db",
    embedding_driver=embedding_driver,
    table_name="test_embeddings",
)
vector_store.setup()

test_text="""
This is some content for testing embeddings.
It spans multiple lines.
It is otherwise quite uninteresting.
"""

chunker = TextChunker(max_tokens=10)
chunks = chunker.chunk(test_text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk.to_text()}")

print(f"Upserting {len(chunks)} chunks...")
vector_store.upsert_text_artifacts(chunks, meta={"metadata_field": "metadata_value"})

print("Done!")

In the database we end up with:

test_db=# select (vector::float4[])[0:3],meta->'artifact' from test_embeddings;
                 vector                  |                                                                                                 ?column?                                                                                                 
-----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {0.015975304,-0.01191094,0.0093917055}  | "{\"type\": \"TextArtifact\", \"id\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"reference\": null, \"meta\": {}, \"name\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"value\": \"It spans multiple lines.\"}"
 {-0.012808571,0.011958273,0.09028612}   | "{\"type\": \"TextArtifact\", \"id\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"reference\": null, \"meta\": {}, \"name\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"value\": \"It spans multiple lines.\"}"
 {-0.041691482,0.023411594,-0.032732744} | "{\"type\": \"TextArtifact\", \"id\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"reference\": null, \"meta\": {}, \"name\": \"747c4fff2f00453dbedc64cb7b14b28e\", \"value\": \"It spans multiple lines.\"}"
(3 rows)

We see different vectors for each row but the TextArtifact stored in the meta column is always the same chunk.

I spent a bit of time debugging and I think the following is a likely explanation:

  1. upsert_text_artifacts executes BaseVectorStoreDriver.upsert_text_artifact using worker threads.
  2. Each thread adds its TextArtifact to the meta dict via meta["artifact"] = artifact.to_json().
  3. This causes the dict to be modified for all threads.
  4. When it is time for a thread to send the meta dict to the vector store then the artifact field may contain a TextArtifact handled by a different thread.

If the meta dict is omitted when calling upsert_text_artifacts then everything works as expected because each thread creates its own meta dict.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions