DRIVERS-2884 Avoid connection churn when operations timeout #1675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

prestonvasquez wants to merge 46 commits into mongodb:master from prestonvasquez:DRIVERS-2884

Member

prestonvasquez commented Oct 14, 2024 •

edited

Loading

This PR implements the design for connection pooling improvements described in DRIVERS-2884, based on the CSOT (Client-Side Operation Timeout) spec. It addresses connection churn caused by network timeouts during operations, especially in environments with low client-side timeouts and high latency.

When a connection is checked out after a network timeout, the driver now attempts to resume and complete reading any pending server response (instead of closing and discarding the connection). This may require multiple checkouts.
Each pending response read is subject to a cumulative 3-second static timeout. The timeout is refreshed after each successful read, acknowledging that progress is being made. If no data is read and the timeout is exceeded, the connection is closed.

To reduce unnecessary latency, if the timeout has expired while the connection was idle in the pool, a non-blocking single-byte read is performed; if no data is available, the connection is closed immediately.
This update introduces new CMAP events and logging messages (PendingResponseStarted, PendingResponseSucceeded, PendingResponseFailed) to improve observability of this path.

Please complete the following before merging:

Update changelog.
Make sure there are generated JSON files from the YAML test files.
Test changes in at least one language driver. Go: GODRIVER-3173 Complete pending reads on conn checkout mongo-go-driver#1977
Test these changes against all server versions and topologies (including standalone, replica set, sharded
clusters, and serverless).

prestonvasquez added 2 commits

October 14, 2024 15:06


          DRIVERS-2884 Add connection churn spec tests

0f12706


          DRIVERS-2884 Update json

fe18120

prestonvasquez requested a review from ShaneHarvey

October 14, 2024 21:13

ShaneHarvey reviewed

View reviewed changes

source/client-side-operations-timeout/tests/connection-churn.yml Outdated Show resolved Hide resolved

source/client-side-operations-timeout/tests/connection-churn.yml Outdated Show resolved Hide resolved

prestonvasquez requested a review from ShaneHarvey

October 21, 2024 18:15

prestonvasquez added 7 commits

October 30, 2024 15:59


          DRIVERS-2884 Clean up spec tests

05cc88b


          Update CMAP to include foreground read

98c2a73


          Update changelog


          Add justification for CMAP update

234b729


          Remove unecessary example

ccfbcf1


          Use consistent keys

fed567b


          Update timeouts

8840be4

ShaneHarvey requested changes

View reviewed changes

source/client-side-operations-timeout/tests/connection-churn.yml Outdated

+                  # after maxTimeMS, whereas mongod returns it after
+                  # max(blockTimeMS, maxTimeMS).  Until this ticket is resolved, these tests
+                  # will not pass on sharded clusters.
+                  topologies: ["standalone", "replicaset"]

Member

ShaneHarvey Apr 17, 2025

standalone -> single

source/client-side-operations-timeout/tests/connection-churn.yml Outdated

+                    - name: findOne
+                      object: *collection
+                      arguments:
+                        timeoutMS: 50

Member

ShaneHarvey Apr 17, 2025

In python this timeout is too small and causes this find to fail before sending anything to the server. The same problem exists in the other tests too. Perhaps all of theses tests should run a setup command (eg ping) to ensure a connection is created and available in the pool, then run the finds. What do you think?

prestonvasquez added 15 commits

April 21, 2025 18:11


          DRIVERS-2884 Resolve merge conflicts

c1bee3b


          DRIVERS-2884 Update pending response unified spec tests

c0e5aee


          DRIVERS-2884 Add UML and update wording

dde9e22


          DRIVERS-2884 Remove uneeded text from code snippet

5e0305a


          DRIVERS-2884 Add prose tests

496724c


          DRIVERS-2884 Clean up presentation

258edf8


          DRIVERS-2884 Add logs and events

d217d10


          DRIVERS-2884 Add log part

cc8aec0


          DRIVERS-2884 Add Q&A section

3d98039


          DRIVERS-2884 Add changelog

07e75bd


          DRIVERS-2884 Fix Markdown failures

8d9e71b


          DRIVERS-2884 Update schema

5c68f77


          DRIVERS-2884 Update schema w/ new connection events

e2653cb


          DRIVERS-2884 Remove additional properties

00aa620


          DRIVERS-2884 Remove ignoring extra events

b29d6cc

prestonvasquez marked this pull request as ready for review

April 25, 2025 21:36

prestonvasquez requested a review from a team as a code owner

April 25, 2025 21:36

alcaeus approved these changes

View reviewed changes

Member

alcaeus left a comment

Changes to the unified test format LGTM.

@prestonvasquez as per our conversation around where to add the missing event names in #1782, this schema version would be an ideal candidate as it already adds new events to the list.

prestonvasquez added 4 commits

May 5, 2025 10:15


          DRIVERS-2884 Add pending response state

9500fd5


          DRIVERS-2884 Move logging tests to csot-specific file

5f3726b


          DRIVERS-2884 Generate connection-logging-csot.json

d286786


          DRIVERS-2884 Fix bug; add test super section

9c5b33a

prestonvasquez requested a review from baileympearson

May 5, 2025 19:34

baileympearson requested changes

View reviewed changes

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md Outdated Show resolved Hide resolved

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md Show resolved Hide resolved

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md Outdated

+                connectionId: int64;
+                /**
+                 *  The time it took to complete the pending read.

Contributor

baileympearson May 5, 2025

Agreed. Can we clarify that in the description of duration? We can take inspiration from the definitions of duration for checkout failed and checkout succeeded events. Ex:

  /**
   * The time it took to establish the connection.
   * In accordance with the definition of establishment of a connection
   * specified by `ConnectionPoolOptions.maxConnecting`,
   * it is the time elapsed between emitting a `ConnectionCreatedEvent`
   * and emitting this event as part of the same checking out.
   *
   * Naturally, when establishing a connection is part of checking out,
   * this duration is not greater than
   * `ConnectionCheckedOutEvent`/`ConnectionCheckOutFailedEvent.duration`.
   *
   * A driver MAY choose the type idiomatic to the driver.
   * If the type chosen does not convey units, e.g., `int64`,
   * then the driver MAY include units in the name, e.g., `durationMS`.
   */
  duration: Duration;

So, maybe something like:

  /**
   * The time it took to complete the pending read.
   * This duration is defined as the time elapsed between emitting a `PendingResponseStarted` event
   * and emitting this event as part of the same checking out.
   *
   * A driver MAY choose the type idiomatic to the driver.
   * If the type chosen does not convey units, e.g., `int64`,
   * then the driver MAY include units in the name, e.g., `durationMS`.
   */
  duration: Duration;

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md Outdated

+                connectionId: int64;
+                /**
+                 *  The time it took to complete the pending read.

Contributor

baileympearson May 5, 2025

(same comment for other definitions of duration in this PR).

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md Outdated Show resolved Hide resolved

source/connection-monitoring-and-pooling/tests/README.md Show resolved Hide resolved

source/connection-monitoring-and-pooling/tests/README.md Show resolved Hide resolved

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md Outdated Show resolved Hide resolved

ShaneHarvey reviewed

View reviewed changes

source/client-side-operations-timeout/tests/pending-response.yml Show resolved Hide resolved

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md Outdated Show resolved Hide resolved

prestonvasquez removed the request for review from qingyang-hu

May 6, 2025 23:28

prestonvasquez added 5 commits

May 8, 2025 10:55


          DRIVERS-2884 Add commandName: ping

7bd5c00


          DRIVERS-2884 Clarify behavior for exhaust cursors

cc9ec5c


          DRIVERS-2884 Account for both pull and push i/o patterns

a397306


          DRIVERS-2884 Update duration commentary

ef75645


          DRIVERS-2884 Ensure all branches are tested

b5c1202

prestonvasquez commented

View reviewed changes

source/client-side-operations-timeout/tests/pending-response.yml Outdated

+                        - connectionCheckedInEvent: {} # Second find succeeds.
+                  # If the connection is closed server-side while draining the response, the
+                  # driver must close the connection.
+                - description: "connection closed server-side while draining response"

Member Author

prestonvasquez May 9, 2025

This seems to be a sufficient check that if the awaitPendingResponse function fails with a non-timeout error the connection should be closed.


          DRIVERS-2884 Make Q&A read/receive agnostic

c785d0a

prestonvasquez requested review from ShaneHarvey and baileympearson

May 9, 2025 20:29

ShaneHarvey requested changes

View reviewed changes

source/client-side-operations-timeout/tests/pending-response.yml Outdated

+                        timeoutMS: 50
+                        filter: {_id: 1}
+                      expectError:
+                        isTimeoutError: false

Member

ShaneHarvey May 20, 2025

Shouldn't this error be considered retryable under the readable read/write specs?

Member Author

prestonvasquez May 20, 2025

CMAP only makes the pool-cleared error retryable at check-out. Since retryable reads occur at the operation layer and this particular network error happens at the connection pool layer (before a read command goes on the wire), I think we would have to extend the CMAP spec to say that network errors while checking out qualify as retryable.

Member

ShaneHarvey May 20, 2025

Checkout errors should already be retryable. For example, a network error when establishing a new connection will cause an automatic retry.

Member Author

prestonvasquez May 20, 2025

The only error type we tag as retryable during checkOut is PoolClearedError. Nothing else in that layer is marked as retryable, which is why the test in question passes in the Go Driver. Am I missing something in the CMAP spec?

Member

ShaneHarvey May 21, 2025 •

edited

Loading

Yes that is a bug in the Go driver, see https://jira.mongodb.org/browse/DRIVERS-746

And the retryable writes spec:

When the driver encounters a network error establishing an initial connection to a server, it MUST add a RetryableWriteError label to that error if the MongoClient performing the operation has the retryWrites configuration option set to true.

https://github.com/mongodb/specifications/blob/master/source/retryable-writes/retryable-writes.md#retryablewriteerror-labels

Member Author

prestonvasquez May 21, 2025

This isn't a network error that occurs during a handshake, it's a network error encountered when trying to drain data from an established connection.

Member

ShaneHarvey May 21, 2025 •

edited

Loading

Correct but my point is that it is the same case in spirit. We can't introduce this new error mode without making it retrtyable.

Member Author

prestonvasquez May 28, 2025

I've updated the retryable reads and writes specifications to retry for network errors when checking out a connection.

prestonvasquez requested a review from ShaneHarvey

May 20, 2025 22:25


          DRIVERS-2884 Update retryable reads and writes to retry for network e…

1097f63

…rror on conn c/o

prestonvasquez requested a review from a team as a code owner

May 28, 2025 22:10

prestonvasquez requested review from isabelatkinson and removed request for a team

May 28, 2025 22:10


          DRIVERS-2884 Update 1.24 with open/close schema udpates

49b5508

ShaneHarvey requested changes

View reviewed changes

source/retryable-writes/retryable-writes.md Outdated Show resolved Hide resolved

source/retryable-reads/retryable-reads.md Show resolved Hide resolved


          DRIVERS-2884 Give pending response examplesin retryable r/w specs

fe79807

prestonvasquez requested a review from ShaneHarvey

June 5, 2025 20:49

ShaneHarvey reviewed

View reviewed changes

source/retryable-writes/retryable-writes.md

                   RetryableWriteError label to that error if the MongoClient performing the operation has the retryWrites
                   configuration option set to true.
+              - When the driver encounters a network error checking out a connection, it MUST add a RetryableWriteError label to that
+                  error if the MongoClient performing the operation has the retryWrites configuration option set to true. For example,
+                  a network error encountered when checking out a connection that must attempt to discard a pending response from the

Member

ShaneHarvey Jun 13, 2025 •

edited

Loading

a network error encountered when checking out a connection that must attempt to discard a pending response

Is this sentence correct? I'm getting tripped up by the "that must attempt". Should it be

a network error encountered when reading a pending response during connection checkout.

baileympearson requested changes

View reviewed changes

source/client-side-operations-timeout/tests/pending-response.yml Show resolved Hide resolved

source/client-side-operations-timeout/tests/pending-response.yml

Comment on lines +336 to +346

+                        - connectionCheckedOutEvent: {}
+                        - connectionCheckedInEvent: {} # Ping finishes.
+                        - connectionCheckedOutEvent: {}
+                        - connectionCheckedInEvent: {} # Insert fails.
+                        - connectionPendingResponseStarted: {} # Pending read fails on first find
+                        - connectionPendingResponseFailed:
+                            reason: error
+                        - connectionClosedEvent:
+                            reason: error
+                        - connectionCheckedOutEvent: {}
+                        - connectionCheckedInEvent: {} # Find finishes.

Contributor

baileympearson Jun 13, 2025

Could we add server selection events to this list (maybe we could use logging tests for this, even though not all drivers have implemented the CLAM spec?)? It would be nice to clarify that the retry happens because the error returned is retryable and we use the existing retry mechanism, not that we retry checkout directly. Basically:

- connectionPendingResponseStarted: {} # Pending read fails on first find
          - connectionPendingResponseFailed:
              reason: error
          - connectionClosedEvent:
              reason: error
          - serverSelectionStarted
          - serverSelectionFinished
          - connectionCheckedOutEvent: {}
          - connectionCheckedInEvent: {} # Find finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ShaneHarvey ShaneHarvey requested changes

baileympearson baileympearson requested changes

alcaeus alcaeus approved these changes

isabelatkinson Awaiting requested review from isabelatkinson isabelatkinson is a code owner automatically assigned from mongodb/dbx-spec-owners-retryability

Requested changes must be addressed to merge this pull request.

Labels

None yet