Skip to content

Add then resync then Delete cause Pop() panic #14232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 25, 2017

Conversation

pecameron
Copy link
Contributor

The eventqueue did not properly handle the following sequence of
operations:

handleEvent ADD default/hello
Pop         ADD default/hello
Resync
handleEvent DELETE default/hello
Pop         DELETE default/hello
Pop         MODIFIED default/hello  <<< this Pop should not happen

The last Pop references the default/hello route which has been deleted.
This results in a panic in Pop()

The root cause of the problem is the resync adds the route to the queue
with the default ADD state instead of MODIFY.

The previous fix for this in Pop() (46a93289) was moved to Resync().

Fixes 1447928
https://bugzilla.redhat.com/show_bug.cgi?id=1447928

==================
Don't allow deleted routes in resync list

When handleEvent() is called to ADD a route, then called again to
DELETE the route before Pop() processes the ADD, there is no need
for the route to remain in store. handleEvent() can remove it from
store. Deleted routes that are found in store during a Resync()
are deleted by Pop() as they are encouneted.

This limits the number of deleted routes in store and reduces the
number of routes that Resync() adds to the queue.

@pecameron
Copy link
Contributor Author

@knobunc PTAL

@pecameron pecameron self-assigned this May 18, 2017
@pecameron pecameron requested a review from marun May 18, 2017 12:07
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good... but the tests are flipped around. Also I have a nit about the error string.

items := q.List()

if len(items) > 0 {
t.Fatalf("expected the list to be empty after a coalesce, but it was not")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote this code, I'm not sure what I was getting at...

How about: "expected Resync() not to add a duplicate item to the event queue, but it did"

@@ -216,3 +216,17 @@ func TestEventQueue_PopAfterResyncShouldBeTypeModified(t *testing.T) {
t.Fatalf("Expected resynced event to be of type watch.Modified, got %q", eventType)
}
}

func TestEventQueue_ResyncsShouldNotRestoreDeletedItems(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong test... it goes with the other commit.

@@ -217,6 +217,24 @@ func TestEventQueue_PopAfterResyncShouldBeTypeModified(t *testing.T) {
}
}

func TestEventQueue_ResyncsShouldNotCausePanics(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test goes with the other bug...

@pecameron
Copy link
Contributor Author

@knobunc PTAL

@smarterclayton
Copy link
Contributor

@pmorie and @ncdc for familiarity with the queues

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@smarterclayton
Copy link
Contributor

Hold on, would like an extra set of eyes.

@ncdc
Copy link
Contributor

ncdc commented May 23, 2017

I think this is ok, but it's making my brain hurt. If @pmorie is around it would be nice to get his review too, and ultimately it will be much better to kill off this event queue once and for all.

@smarterclayton
Copy link
Contributor

smarterclayton commented May 23, 2017 via email

@knobunc
Copy link
Contributor

knobunc commented May 23, 2017

@ncdc: The PR to nuke the event queue is merged at #14030

This really is only going in so that we can back-port it (and in case anyone else is use it from out of the tree)

@ncdc
Copy link
Contributor

ncdc commented May 23, 2017 via email

@knobunc
Copy link
Contributor

knobunc commented May 23, 2017

@pecameron -- merge failed "pkg/client/cache/eventqueue.go:298: ok declared and not used"

@smarterclayton
Copy link
Contributor

smarterclayton commented May 23, 2017 via email

pecameron added 2 commits May 23, 2017 16:24
The eventqueue did not properly handle the following sequence of
operations:

handleEvent ADD default/hello
Pop         ADD default/hello
Resync
handleEvent DELETE default/hello
Pop         DELETE default/hello
Pop         MODIFIED default/hello  <<< this Pop should not happen

The last Pop references the default/hello route which has been
deleted.  This results in a panic in Pop()

The root cause of the problem is the resync adds the route to the
queue with the default ADD state instead of MODIFY.

The previous fix for this in Pop() (46a9328) was moved to Resync().

Fixes 1447928
https://bugzilla.redhat.com/show_bug.cgi?id=1447928
When handleEvent() is called to ADD a route, then called again to
DELETE the route before Pop() processes the ADD, there is no need
for the route to remain in store. handleEvent() can remove it from
store. Deleted routes that are found in store during a Resync()
are deleted by Pop() as they are encouneted.

This limits the number of deleted routes in store and reduces the
number of routes that Resync() adds to the queue.
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 8dee4ea

@knobunc
Copy link
Contributor

knobunc commented May 23, 2017

[merge] @smarterclayton Event Queue went away from the router with #14030 card https://trello.com/c/8mv19t4h

This is to fix it in case there are any out of tree users, and to give us something to back-port.

We'll follow on with a PR to remove the Event Queue entirely soon.

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/1662/) (Base Commit: 16c3f11)

@knobunc
Copy link
Contributor

knobunc commented May 24, 2017

It failed with "FAILURE: Generated bootstrap bindata out of date. Please run hack/update-generated-bindata.sh"... I have no idea how that can be related to this commit.

[merge]

@smarterclayton
Copy link
Contributor

smarterclayton commented May 24, 2017 via email

@knobunc
Copy link
Contributor

knobunc commented May 24, 2017

Sorry.. wrong card. The replace the event queue is https://trello.com/c/y6SFvOA7 and it has not committed yet.

@smarterclayton
Copy link
Contributor

smarterclayton commented May 25, 2017 via email

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 8dee4ea

@openshift-bot
Copy link
Contributor

openshift-bot commented May 25, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/789/) (Base Commit: 5ef63e6) (Extended Tests: blocker) (Image: devenv-rhel7_6269)

@openshift-bot openshift-bot merged commit 0bea328 into openshift:master May 25, 2017
@@ -152,6 +152,11 @@ func (eq *EventQueue) handleEvent(obj interface{}, newEventType watch.EventType)
case watchEventEffectDelete:
delete(eq.events, key)
eq.queue = eq.queueWithout(key)
// A delete means we added and deleted _before_ we popped, we need to clean up the store here
if err := eq.store.Delete(obj); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this fail? I also don't understand the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If and add event and a delete event come in for the same object before the consumer of the queue pulls the object from the queue, the add puts the event into the store and the queue of events, but the delete event would just remove it from the queue. So when the consumer popped, there would correctly be nothing returned. However, when resync function was called on the queue, it would loop over all items in the store and add them back to the queue... so Added and immediately Deleted things would reappear. This is the problem that your PR was fixing where they would crash the router because they had no entry in the events map. You fixed that by assuming they had state modified, but really, the entries were dead and should have been removed earlier.

@@ -400,6 +397,7 @@ func (eq *EventQueue) Resync() error {
for _, id := range eq.store.ListKeys() {
if !inQueue.Has(id) {
eq.queue = append(eq.queue, id)
eq.events[id] = watch.Modified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment as to why it's desirable to set the state to modified on resync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We should just say that if they are in the store, but not in the queue, they must be modified (since if the event was created or deleted, there should already be a queue entry)

@pecameron
Copy link
Contributor Author

pecameron commented Jun 12, 2017 via email

@pecameron pecameron deleted the bz1447928 branch June 29, 2017 20:18
@knobunc
Copy link
Contributor

knobunc commented Jun 30, 2017

What happens is that the event queue has a producer side (the watch events) and a consumer side (the router pulling events) with a queue in the middle (because the router can't always immediately handle events, so we need a buffer between them).

The event queue does a few extra things:

  • It maintains a cache of all "live" objects, so that when a delete event happens, it can provide the deleted object along with the event
  • It coalesces events, so that when an ADD and a DELETE watch event are produced before the consumer (router) gets around to handling them, the event should just be deleted and never delivered
  • When the event queue is resynchronized, all objects in the cache are added back into the queue in modified state

These three behaviors are good... but they had a bug that led to a nasty interaction.

When the ADD then DELETE happened before the event was consumed, the event was removed from the queue, but left in the backing cache. When a resync happened, all deleted events were then added in to the queue itself and then had to be processed. Over time, that jams up the event processing in the router and makes it more likely an ADD and DELETE will happen before being processed. So it keeps adding more and more dead items to the queue. These junk entries would never be deleted and just keep building up in the cache, and would be added to the queue on every resyc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants