fix(chunker): correctly determine chunk midpoint when empty chunks are present #1800

collindutter · 2025-03-04T20:09:36Z

I have read and agree to the contributing guidelines.

Describe your changes

Problem:

["foo", '', "bar", 'baz'] is token counted as 'foobarbaz' rather than 'foo bar baz' when getting the midpoint index:

griptape/griptape/chunkers/base_chunker.py

Line 106 in 41ad7f5

    
           subchunk_tokens_count = self.tokenizer.count_tokens("".join(subchunks[: index + 1]))

This leads to an incorrect midpoint index which results in an incorrect chunk split. In certain cases this can lead to hitting recursive max depth.

Solution:

Join the chunks on the separator that we originally split them on:

griptape/griptape/chunkers/base_chunker.py

Line 56 in 41ad7f5

subchunks = chunk.strip().split(separator.value)

griptape/griptape/chunkers/base_chunker.py

Line 106 in 4b7bb05

    
           subchunk_tokens_count = self.tokenizer.count_tokens(separator.value.join(subchunks[: index + 1]))

This correctly calculates the midpoint index which results in a correct chunk split.

Other changes in the PR are updates to the tests because chunk boundaries have changed slightly.

Issue ticket number and link

Closes #1796

…e present Previously ["foo", '', "bar", 'baz'] would be token counted as 'foobarbaz' rather than 'foo bar baz' when getting the midpoint index

…e present (#1800) Previously ["foo", '', "bar", 'baz'] would be token counted as 'foobarbaz' rather than 'foo bar baz' when getting the midpoint index

collindutter added the chunkers label Mar 4, 2025

collindutter added this to the 1.5 milestone Mar 4, 2025

collindutter requested a review from a team March 4, 2025 20:09

collindutter self-assigned this Mar 4, 2025

collindutter enabled auto-merge March 4, 2025 20:10

cjkindel approved these changes Mar 4, 2025

View reviewed changes

collindutter force-pushed the fix/chunker-empty-subchunks branch from 4b7bb05 to 8f11146 Compare March 4, 2025 21:51

fix(chunker): correctly determine chunk midpoint when empty chunks ar…

fd7d8fb

…e present Previously ["foo", '', "bar", 'baz'] would be token counted as 'foobarbaz' rather than 'foo bar baz' when getting the midpoint index

collindutter force-pushed the fix/chunker-empty-subchunks branch from 8f11146 to fd7d8fb Compare March 4, 2025 22:23

collindutter added this pull request to the merge queue Mar 4, 2025

Merged via the queue into main with commit 8ec2a8a Mar 4, 2025
15 checks passed

collindutter deleted the fix/chunker-empty-subchunks branch March 4, 2025 22:32

collindutter mentioned this pull request Mar 4, 2025

chore(main): release 1.5.0 #1768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(chunker): correctly determine chunk midpoint when empty chunks are present #1800

fix(chunker): correctly determine chunk midpoint when empty chunks are present #1800

Uh oh!

collindutter commented Mar 4, 2025

Uh oh!

Uh oh!

Uh oh!

fix(chunker): correctly determine chunk midpoint when empty chunks are present #1800

fix(chunker): correctly determine chunk midpoint when empty chunks are present #1800

Uh oh!

Conversation

collindutter commented Mar 4, 2025

Describe your changes

Problem:

Solution:

Issue ticket number and link

Uh oh!

Uh oh!

Uh oh!