feat: TagValueIterator holds RLock for too long (#26369) #26372
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I looks like we are holding a reader lock and defer'ing it, while at the same time acquiring that same reader lock within the call to
return tk.TagValueIterator()
According to Go's RWMutex documentation this is not a good idea: https://pkg.go.dev/sync#RWMutex
I believe this is what is causing the deadlock we are seeing in a few different tickets. I had ran a reproducer where I was effectively reading and writing from a bucket that I had set the retention policy to 1 minute and the shard rollover to 30 seconds. I was able to reproduce the deadlock consistently this way. Adding some logging around the mutexes I am seeing the following during every reproduction:
The
TagValueIterator tk
is found withinreturn tk.TagValueIterator()
influxdb/tsdb/index/tsi1/log_file.go
Line 1388 in eebe655
The
AddSeriesList Lock()
is found here:influxdb/tsdb/index/tsi1/log_file.go
Line 545 in eebe655
So it appears that we are attempt to lock inside both
AddSeriesList
andTagValueIterator
with different Lock types on the sameRWMutex
.After modifying the code to the following I'm no longer seeing the problem. I have a feeling that the recursive call to
RLock()
is the underlying issue.(cherry picked from commit 2e842ff)
Closes #
Describe your proposed changes here.