RUST-521 Implement naive streaming and resume token caching for change streams #531

abr-egn · 2021-12-02T18:30:49Z

The Stream impl just forwards to the underlying cursor; this will need to be updated for RUST-522 to resume on error, but that seemed like a big enough chunk of work to be worth doing separately.

The token-tracking machinery is included in the base Cursor behavior but only exposed via ChangeStream; the other option here would have been introducing a new type parallel to GenericCursor and wrapping that directly in ChangeStream. That didn't seem like a good way to go since it would either involve a lot of duplication of things like the buffering and stream impl, or adding another nested doll to encapsulate the shared logic (i.e. something like Cursor -> CursorCommon -> GenericCursor and ChangeStream -> CursorCommon -> GenericCursorWithToken).

abr-egn · 2021-12-02T18:32:48Z

src/change_stream/event.rs

@@ -15,7 +23,31 @@ use serde::{Deserialize, Serialize};
 /// [here](https://docs.mongodb.com/manual/changeStreams/#change-stream-resume-token) for more
 /// information on resume tokens.
 #[derive(Clone, Debug, Deserialize, Serialize)]
-pub struct ResumeToken(pub(crate) Bson);
+pub struct ResumeToken(pub(crate) RawBson);


Using RawBson here means the token can be essentially handled as a (nearly) uninterpreted byte blob.

abr-egn · 2021-12-02T18:48:44Z

src/client/executor.rs

+        }
+    }
+
+    pub(crate) async fn execute_watch<T>(


Watch needs its own version of these utility functions so the resume token from the initial aggregate command can be preserved; conveniently, this tidied up the call sites quite a bit.

If we move resume token tracking to the ChangeStream per my comment below, we can reuse execute_cursor_operation in the body of this function, ditto for the session one.

Unfortunately we need the CursorSpecification in order to get the initial resume token, and that's not preserved or exposed by Cursor.

abr-egn · 2021-12-02T19:05:13Z

src/cursor/common.rs

@@ -37,8 +38,10 @@ where
    client: Client,
    info: CursorInformation,
    buffer: VecDeque<RawDocumentBuf>,
+    post_batch_resume_token: Option<ResumeToken>,


Two tokens need to be tracked - the one returned by the most recent getMore call, and the one that will be returned to users via resume_token(). The two are only the same at the end of a batch.

patrickfreed · 2021-12-03T19:09:40Z

src/client/executor.rs

+        }
+    }
+
+    pub(crate) async fn execute_watch<T>(


If we move resume token tracking to the ChangeStream per my comment below, we can reuse execute_cursor_operation in the body of this function, ditto for the session one.

patrickfreed · 2021-12-03T19:15:17Z

src/cursor/common.rs

+                        if self.buffer.is_empty() && self.post_batch_resume_token.is_some() {
+                            self.post_batch_resume_token.clone()
+                        } else {
+                            doc.get("_id")?.map(|val| ResumeToken(val.to_raw_bson()))


This could potentially be an expensive lookup if the user is querying large documents with _id projected out. I think it would probably be better to track these at the ChangeStream level so that only users of ChangeStream (which will be a small %) have to pay this cost. The post_batch_resume_token will still need to be tracked at the cursor level, though.

Another thing to note, the spec requires that we provide a way for users to receive every resume token that gets cached. Currently, this cursor implementation will keep looping until the cursor closes or a document is received, so the tokens that get cached in between empty batches (from post_batch_resume_token) aren't ever made available to the user. Most drivers get around this by providing a tryNext() method which returns null if there aren't any documents available yet, and I think we could do something similar here (and exposed via a separate method on ChangeStream). We'd need to refactor the implementation of this method into a separate function so we could reuse it for both the Stream implementation and tryNext, but otherwise I don't think the implementation of that should require too much new code. One issue is that we can't name the method try_next, as this would clash with the methods provided by TryStreamExt. Maybe something like next_in_batch? I can't really think of a good name to be honest.

Implementing a tryNext equivalent on GenericCursor should make the tracking of the resume token at the ChangeStream level much easier too, since I don't think you'd need to introduce any new intermediate types.

Updated! I also realized that I hadn't implemented any of this for the session types, so added that in, and refactored things a bit to be able to share the bulk of the implementation.

abr-egn · 2021-12-14T17:20:24Z

src/cursor/common.rs

+    }
+}
+
+pub(crate) struct NextInBatchFuture<'a, T>(&'a mut T);


It seems awkward to have to have a one-off struct to go from the desugared fn() -> Poll<T> form back to .await, but I guess this is the way until poll_fn stabilizes.

abr-egn · 2021-12-14T17:21:38Z

src/cursor/common.rs

            info: self.info,
            pinned_connection: self.pinned_connection,
            _phantom: Default::default(),
        }
    }
 }

-impl<P, T> Stream for GenericCursor<P, T>
+pub(crate) trait CursorStream {


This trait allows the implementation of poll_next to be shared via the stream_poll_next fn; open to suggestions for a more descriptive name.

abr-egn · 2021-12-14T17:22:25Z

src/cursor/common.rs

            info: self.info,
            pinned_connection: self.pinned_connection,
            _phantom: Default::default(),
        }
    }
 }

-impl<P, T> Stream for GenericCursor<P, T>
+pub(crate) trait CursorStream {
+    fn poll_next_in_batch(&mut self, cx: &mut Context<'_>) -> Poll<Result<BatchValue>>;


I used the next_in_batch suggestion for internal methods but named the external next_if_any since the batching behavior is not the main point of the method from a user's perspective.

isabelatkinson

just one small suggestion! adding my LGTM now so as not to block this while I'm gone

isabelatkinson · 2021-12-14T23:18:57Z

src/change_stream/event.rs

+pub struct ResumeToken(pub(crate) RawBson);
+
+impl ResumeToken {
+    pub(crate) fn initial(


nit: can we update this so that both values don't always need to be created? e.g.

match spec.post_batch_resume_token { Some(token) if spec.initial_buffer.is_empty() => token, None => // token from options }

That's much nicer, thank you!

patrickfreed

looks good! just have one suggestion

patrickfreed · 2021-12-15T16:55:33Z

src/change_stream/mod.rs

+    /// # let coll = client.database("foo").collection("bar");
+    /// let mut change_stream = coll.watch(None, None).await?;
+    /// let mut resume_token = None;
+    /// loop {


this makes me think we'll need to include an is_alive method or something like that to allow this loop to terminate in the event the stream is closed.

Ah, good thought, done.

patrickfreed

LGTM!

abr-egn commented Dec 2, 2021

View reviewed changes

abr-egn marked this pull request as ready for review December 2, 2021 19:23

abr-egn requested review from patrickfreed and isabelatkinson December 2, 2021 19:23

patrickfreed reviewed Dec 3, 2021

View reviewed changes

abr-egn added 17 commits December 13, 2021 13:00

session with_type

629774b

resume_token

2557955

initial resume token

15d93c9

execute_watch helper

e4819a3

initial resume token

7bbc26c

rustfmt

f2bb095

execute_watch_with_session

49ecb29

populate initial resume token only when buffer is empty

18da865

shift resume token to cursor

9fb7e48

use RawDocumentBuf for token

8211cfb

populate resume token from getmore response

2ca6f9d

cache token when yielding values

58684d5

use RawBson for resume token

0eb841e

clippy

98cba67

factor out cursor buffer polling

d69d3e6

move resume token tracking to change stream

62e0d02

fix lost field from rebase

67b420f

abr-egn force-pushed the RUST-521/change-stream-stream branch from 31189e5 to 67b420f Compare December 13, 2021 18:06

abr-egn added 7 commits December 13, 2021 16:33

common utility for poll_next impl

b5081ec

next_in_batch

9cf8eb5

clean up resume token extraction

5d11cee

session machinery

cdcb246

session next_in_batch

fb522fa

documentation

b3108f3

rustfmt

451424b

abr-egn commented Dec 14, 2021

View reviewed changes

clippy

f7ef6e2

isabelatkinson approved these changes Dec 14, 2021

View reviewed changes

better ResumeToken::initial

b45f321

patrickfreed reviewed Dec 15, 2021

View reviewed changes

still alive

225695b

patrickfreed approved these changes Dec 15, 2021

View reviewed changes

abr-egn merged commit 7209033 into mongodb:master Dec 15, 2021

RUST-521 Implement naive streaming and resume token caching for change streams #531

RUST-521 Implement naive streaming and resume token caching for change streams #531

Uh oh!

Conversation

abr-egn commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isabelatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickfreed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickfreed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abr-egn commented Dec 2, 2021 •

edited

Loading