Skip to content

ManagedChannel does not return from TRANSIENT_FAILURE to CONNECTING #10594

Closed
@programmatix

Description

@programmatix

What version of gRPC-Java are you using?

I see this issue with 1.56.0 - but 1.55.1 works as expected.
Update: on further testing I found the issue also exists in 1.55.3 and 1.58.0.

What did you expect to see?

I create a ManagedChannel, and then have some logic that's waiting for it to go to connected. (Logic below)

I'm testing this logic when the server is not running. Based on this doc I expect the channel to go from state IDLE to CONNECTING to TRANSIENT_FAILURE, and then after an exponentially increasing delay, back to CONNECTING and so on. With 1.55.1 this is exactly what happens. The notifyWhenStateChanged() callback gets called plenty of times, and my code eventually raises a TimeoutException as expected.

ConnectivityState state = managedChannel.getState(true);
notify(state, onDone, deadline);

  private void notify(ConnectivityState current, CompletableFuture<Void> onDone, Deadline deadline) {
    if (inDesiredState(current)) {
      onDone.complete(null);
    }
    else {
      this.managedChannel.notifyWhenStateChanged(current, () -> {
        ConnectivityState now = this.managedChannel.getState(true);

        if (inDesiredState(now, waitingForReady)) {
          onDone.complete(null);
        } else if (deadline.exceeded()) {
          onDone.completeExceptionally(new TimeoutException());
        } else {
          notify(now, onDone, deadline, waitingForReady);
        }
      });
    }
  }

What did you see instead?

With 1.55.3, the channel goes from IDLE to CONNECTING to TRANSIENT_FAILURE - and then stays there. It will never return to CONNECTING. So my deadline never gets checked and this wait-until-ready code hangs indefinitely.

Interestingly - if I start the server, after a short delay the channel goes to READY state. So I think internally it is still going back into CONNECTING state - but perhaps neglecting to publish this to the state notification mechanism?

Steps to reproduce the bug

Hopefully the above information and logic is sufficient to replicate.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions