Skip to content

Include error messages for stale primary #1714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vbabanin
Copy link
Member

@vbabanin vbabanin self-assigned this May 20, 2025
@vbabanin vbabanin changed the title Include error messages for stale primary. Include error messages for stale primary May 20, 2025
@vbabanin vbabanin requested a review from Copilot May 22, 2025 15:50
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR propagates specific error messages when marking primaries as stale and extends server invalidation methods to accept a Throwable cause.

  • Added Throwable cause parameters to resetToConnecting/invalidate methods in Server interfaces and implementations
  • Updated AbstractMultiServerCluster to pass detailed exception messages when detecting stale primaries
  • Enhanced unit tests to assert exception message propagation in ServerDiscoveryAndMonitoringTest and adapted TestServer behavior

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
driver-core/src/test/unit/com/mongodb/internal/connection/TestServer.java Updated invalidate/resetToConnecting to include an exception in the ServerDescription
driver-core/src/test/unit/com/mongodb/internal/connection/ServerDiscoveryAndMonitoringTest.java Added assertions for expected exception message when "error" key is present
driver-core/src/test/unit/com/mongodb/internal/connection/DefaultServerSpecification.groovy Invoke invalidate(Throwable) in tests instead of parameterless call
driver-core/src/main/com/mongodb/internal/connection/LoadBalancedServer.java Updated no-op resetToConnecting/invalidate signatures to accept a Throwable
driver-core/src/main/com/mongodb/internal/connection/DefaultServer.java Propagate cause into SDAM update and use SdamIssue.specific for invalidate
driver-core/src/main/com/mongodb/internal/connection/ClusterableServer.java Extended interface methods to accept a Throwable cause
driver-core/src/main/com/mongodb/internal/connection/AbstractMultiServerCluster.java Invalidate stale primaries with specific exception messages via new cause parameter
Comments suppressed due to low confidence (2)

driver-core/src/test/unit/com/mongodb/internal/connection/ServerDiscoveryAndMonitoringTest.java:120

  • [nitpick] Consider adding an else branch to explicitly assert that no exception is present when the "error" key is absent, preventing unexpected exceptions from slipping through tests.
if(expectedServerDescriptionDocument.containsKey("error")){

driver-core/src/test/unit/com/mongodb/internal/connection/DefaultServerSpecification.groovy:139

  • Using a concrete TestServerListener instead of a Mock prevents interaction verification. Use a Mock or Spy to allow verifying serverDescriptionChanged invocation.
def serverListener = new TestServerListener()

Comment on lines -95 to -99
* @see #specific(Throwable, Context)
*/
static SdamIssue unspecified(final Context context) {
return new SdamIssue(null, assertNotNull(context));
}
Copy link
Member Author

@vbabanin vbabanin May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, an unspecified()was used in a case where the primary was marked as stale due to the discovery of a newer primary. This behavior was handled in SdamDescriptionManager by updating the server’s description to UNKNOWN and initiating a reconnect:

} else if (sdamIssue.relatedToWriteConcern() || !sdamIssue.specific()) {
updateDescription(sdamIssue.serverDescription());
serverMonitor.connect();
}

This behavior aligns with the following section of the SDAM specification:

“A note on invalidating the old primary: when a new primary is discovered, the client finds the previous primary (there should be none or one) and replaces its description with a default ServerDescription of type Unknown. Additionally, the error field of the new ServerDescription object MUST include a descriptive error explaining that it was invalidated because the primary was determined to be stale. A multi-threaded client MUST request an immediate check for that server as soon as possible.”

The unspecified() method previously implied a transition to UNKNOWN without providing an error. As this is no longer consistent with the current spec - which requires an explanatory error in such transitions - the method has been removed.

Relevant spec changes:
SDAM Error Field Documentation
Discussion in Spec PR #1729

@@ -56,7 +56,7 @@ final class DefaultSdamServerDescriptionManager implements SdamServerDescription
}

@Override
public void update(final ServerDescription candidateDescription) {
public void monitorUpdate(final ServerDescription candidateDescription) {
Copy link
Member Author

@vbabanin vbabanin May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original update() method performed two checks:

  1. Checks if the server is either a data-bearing connection or a direct connection. This aligns with the Connection Pool Management section of the SDAM spec.
  2. Checks if the ServerDescription contains an exception, and invalidates the connection pool accordingly. This corresponds to the Error Handling section under Server Monitoring.
    Although the spec specifies handling only network and command errors, the implementation currently checks for any Throwable.

These two behaviors conform to the spec requirements for server monitor behavior:

  1. The first check is applied after a successful server check.
  2. The second check applies when a server check fails.

The update() method was invoked from two places:

  • ServerMonitor, which is expected and directly supports the server monitor responsibilities outlined in the spec.
  • DefaultServer.resetToConnecting(), which is used in one specific case: when an RSPrimary is discovered with a stale electionId/setVersion. In that case, update() was called to transition the server to an UNKNOWN state. However, neither check (1 nor 2) was triggered, so the method merely updated the description.

This approach worked until the introduction of a new change where the RSPrimary with a stale electionId/setVersion case began passing an explicit exception, which unintentionally triggered a side effect of connection pool invalidation.

To address this and maintain separation of concerns, the logic has now been split:

monitorUpdate(): now encapsulates the server monitor-specific logic (checks 1 and 2).
update(): is simplified to only update the description to UNKNOWN (currently used for one specific case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant