Skip to content

fix(chunker): correctly determine chunk midpoint when empty chunks are present #1800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 4, 2025

Conversation

collindutter
Copy link
Member

Describe your changes

Problem:

["foo", '', "bar", 'baz'] is token counted as 'foobarbaz' rather than 'foo bar baz' when getting the midpoint index:

subchunk_tokens_count = self.tokenizer.count_tokens("".join(subchunks[: index + 1]))

This leads to an incorrect midpoint index which results in an incorrect chunk split. In certain cases this can lead to hitting recursive max depth.

Solution:

Join the chunks on the separator that we originally split them on:

subchunks = chunk.strip().split(separator.value)

subchunk_tokens_count = self.tokenizer.count_tokens(separator.value.join(subchunks[: index + 1]))

This correctly calculates the midpoint index which results in a correct chunk split.

Other changes in the PR are updates to the tests because chunk boundaries have changed slightly.

Issue ticket number and link

Closes #1796

@collindutter collindutter added this to the 1.5 milestone Mar 4, 2025
@collindutter collindutter requested a review from a team March 4, 2025 20:09
@collindutter collindutter self-assigned this Mar 4, 2025
@collindutter collindutter enabled auto-merge March 4, 2025 20:10
@collindutter collindutter force-pushed the fix/chunker-empty-subchunks branch from 4b7bb05 to 8f11146 Compare March 4, 2025 21:51
…e present

Previously ["foo", '', "bar", 'baz'] would be token counted as
'foobarbaz' rather than 'foo  bar baz' when getting the midpoint index
@collindutter collindutter force-pushed the fix/chunker-empty-subchunks branch from 8f11146 to fd7d8fb Compare March 4, 2025 22:23
@collindutter collindutter added this pull request to the merge queue Mar 4, 2025
Merged via the queue into main with commit 8ec2a8a Mar 4, 2025
15 checks passed
@collindutter collindutter deleted the fix/chunker-empty-subchunks branch March 4, 2025 22:32
collindutter added a commit that referenced this pull request Mar 4, 2025
…e present (#1800)

Previously ["foo", '', "bar", 'baz'] would be token counted as
'foobarbaz' rather than 'foo  bar baz' when getting the midpoint index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

maximum recursion depth exceeded using MarkdownChunker.chunk with certain inputs
2 participants