-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Add then resync then Delete cause Pop() panic #14232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@knobunc PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good... but the tests are flipped around. Also I have a nit about the error string.
pkg/client/cache/eventqueue_test.go
Outdated
items := q.List() | ||
|
||
if len(items) > 0 { | ||
t.Fatalf("expected the list to be empty after a coalesce, but it was not") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this code, I'm not sure what I was getting at...
How about: "expected Resync() not to add a duplicate item to the event queue, but it did"
pkg/client/cache/eventqueue_test.go
Outdated
@@ -216,3 +216,17 @@ func TestEventQueue_PopAfterResyncShouldBeTypeModified(t *testing.T) { | |||
t.Fatalf("Expected resynced event to be of type watch.Modified, got %q", eventType) | |||
} | |||
} | |||
|
|||
func TestEventQueue_ResyncsShouldNotRestoreDeletedItems(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrong test... it goes with the other commit.
pkg/client/cache/eventqueue_test.go
Outdated
@@ -217,6 +217,24 @@ func TestEventQueue_PopAfterResyncShouldBeTypeModified(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestEventQueue_ResyncsShouldNotCausePanics(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test goes with the other bug...
@knobunc PTAL |
[Test]ing while waiting on the merge queue |
Hold on, would like an extra set of eyes. |
I think this is ok, but it's making my brain hurt. If @pmorie is around it would be nice to get his review too, and ultimately it will be much better to kill off this event queue once and for all. |
He's on PTO. Sign off good for me [merge]
…On Tue, May 23, 2017 at 2:49 PM, Andy Goldstein ***@***.***> wrote:
I think this is ok, but it's making my brain hurt. If @pmorie
<https://github.com/pmorie> is around it would be nice to get his review
too, and ultimately it will be much better to kill off this event queue
once and for all.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14232 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p2gjR_Y6HljBJDEY-yw5VNlnYnCbks5r8ypNgaJpZM4Neabf>
.
|
Ok, cool, good to know.
…On Tue, May 23, 2017 at 3:23 PM, Ben Bennett ***@***.***> wrote:
@ncdc <https://github.com/ncdc>: The PR to nuke the event queue is merged
at #14030 <#14030>
This really is only going in so that we can back-port it (and in case
anyone else is use it from out of the tree)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14232 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABYofeGfidgzF4Uyi82lDXVHYu8P4nks5r8zIYgaJpZM4Neabf>
.
|
@pecameron -- merge failed "pkg/client/cache/eventqueue.go:298: ok declared and not used" |
We also need a card to nuke event queue from router.
…On Tue, May 23, 2017 at 3:28 PM, Ben Bennett ***@***.***> wrote:
@pecameron <https://github.com/pecameron> -- merge failed
"pkg/client/cache/eventqueue.go:298: ok declared and not used"
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14232 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p1I8qUung0Ln8jl5cfbQFrVOENM2ks5r8zNwgaJpZM4Neabf>
.
|
The eventqueue did not properly handle the following sequence of operations: handleEvent ADD default/hello Pop ADD default/hello Resync handleEvent DELETE default/hello Pop DELETE default/hello Pop MODIFIED default/hello <<< this Pop should not happen The last Pop references the default/hello route which has been deleted. This results in a panic in Pop() The root cause of the problem is the resync adds the route to the queue with the default ADD state instead of MODIFY. The previous fix for this in Pop() (46a9328) was moved to Resync(). Fixes 1447928 https://bugzilla.redhat.com/show_bug.cgi?id=1447928
When handleEvent() is called to ADD a route, then called again to DELETE the route before Pop() processes the ADD, there is no need for the route to remain in store. handleEvent() can remove it from store. Deleted routes that are found in store during a Resync() are deleted by Pop() as they are encouneted. This limits the number of deleted routes in store and reduces the number of routes that Resync() adds to the queue.
Evaluated for origin test up to 8dee4ea |
[merge] @smarterclayton Event Queue went away from the router with #14030 card https://trello.com/c/8mv19t4h This is to fix it in case there are any out of tree users, and to give us something to back-port. We'll follow on with a PR to remove the Event Queue entirely soon. |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/1662/) (Base Commit: 16c3f11) |
It failed with "FAILURE: Generated bootstrap bindata out of date. Please run hack/update-generated-bindata.sh"... I have no idea how that can be related to this commit. [merge] |
That was my fault, merged something accidentally, didn't catch the problem.
|
Sorry.. wrong card. The replace the event queue is https://trello.com/c/y6SFvOA7 and it has not committed yet. |
[merge][severity: blocker]
this is bad
…On Wed, May 24, 2017 at 9:21 PM, OpenShift Bot ***@***.***> wrote:
continuous-integration/openshift-jenkins/merge FAILURE (
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/779/)
(Base Commit: 423315b
<423315b>
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14232 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p0GQZGR9DIU49NSxNsCDvqs3atvtks5r9NexgaJpZM4Neabf>
.
|
Evaluated for origin merge up to 8dee4ea |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/789/) (Base Commit: 5ef63e6) (Extended Tests: blocker) (Image: devenv-rhel7_6269) |
@@ -152,6 +152,11 @@ func (eq *EventQueue) handleEvent(obj interface{}, newEventType watch.EventType) | |||
case watchEventEffectDelete: | |||
delete(eq.events, key) | |||
eq.queue = eq.queueWithout(key) | |||
// A delete means we added and deleted _before_ we popped, we need to clean up the store here | |||
if err := eq.store.Delete(obj); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this fail? I also don't understand the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If and add event and a delete event come in for the same object before the consumer of the queue pulls the object from the queue, the add puts the event into the store and the queue of events, but the delete event would just remove it from the queue. So when the consumer popped, there would correctly be nothing returned. However, when resync function was called on the queue, it would loop over all items in the store and add them back to the queue... so Added and immediately Deleted things would reappear. This is the problem that your PR was fixing where they would crash the router because they had no entry in the events map. You fixed that by assuming they had state modified, but really, the entries were dead and should have been removed earlier.
@@ -400,6 +397,7 @@ func (eq *EventQueue) Resync() error { | |||
for _, id := range eq.store.ListKeys() { | |||
if !inQueue.Has(id) { | |||
eq.queue = append(eq.queue, id) | |||
eq.events[id] = watch.Modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment as to why it's desirable to set the state to modified on resync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. We should just say that if they are in the store, but not in the queue, they must be modified (since if the event was created or deleted, there should already be a queue entry)
@maru
There was an add request and a delete of the same item before a pop
happened. This means that it was added and deleted before the backend
found out about it. In this case it doesn't need to be in store. It
should have been deleted here but was not.
There is a side effect from the test case. The test very aggressively
adds and deletes the same route. There were times when sync ended up
with over 10000 unneeded deletes and all processing was effectively
swamped. This effect has been happening on a smaller scale in production
routers.
…On 06/12/2017 03:20 PM, Maru Newby wrote:
***@***.**** commented on this pull request.
------------------------------------------------------------------------
In pkg/client/cache/eventqueue.go
<#14232 (comment)>:
> @@ -152,6 +152,11 @@ func (eq *EventQueue) handleEvent(obj interface{}, newEventType watch.EventType)
case watchEventEffectDelete:
delete(eq.events, key)
eq.queue = eq.queueWithout(key)
+ // A delete means we added and deleted _before_ we popped, we need to clean up the store here
+ if err := eq.store.Delete(obj); err != nil {
When would this fail? I also don't understand the comment.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14232 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANUgehpJ5d7LH6G-HCEipZ2YEU-rFvzcks5sDY99gaJpZM4Neabf>.
|
What happens is that the event queue has a producer side (the watch events) and a consumer side (the router pulling events) with a queue in the middle (because the router can't always immediately handle events, so we need a buffer between them). The event queue does a few extra things:
These three behaviors are good... but they had a bug that led to a nasty interaction. When the ADD then DELETE happened before the event was consumed, the event was removed from the queue, but left in the backing cache. When a resync happened, all deleted events were then added in to the queue itself and then had to be processed. Over time, that jams up the event processing in the router and makes it more likely an ADD and DELETE will happen before being processed. So it keeps adding more and more dead items to the queue. These junk entries would never be deleted and just keep building up in the cache, and would be added to the queue on every resyc. |
==================
Don't allow deleted routes in resync list