RUST-580 Add tracing integration and command tracing events #757

kmahar · 2022-10-05T14:51:22Z

This is the Rust implementation of the changes detailed in mongodb/specifications#1303. Various comments inline.

kmahar · 2022-10-05T14:51:56Z

.evergreen/compile-only.sh

@@ -1,6 +1,7 @@
 #!/bin/bash

 set -o errexit
+set -o xtrace


couple changes to this script:

switch to using the feature-combinations.sh script to produce feature combinations to test. as a side effect of this we no longer have separate compilation tasks based on runtime; there is just one task and between all the values in the feature combinations list we get coverage of both tokio and async-std.

xtrace seemed useful so that when there is a failure you can tell which combination of feature flags actually caused it.

kmahar · 2022-10-05T14:54:34Z

.evergreen/config.yml

@@ -548,7 +548,7 @@ functions:
        working_dir: "src"
        script: |
          ${PREPARE_SHELL}
-          ASYNC_RUNTIME=${ASYNC_RUNTIME} RUST_VERSION=${RUST_VERSION} MSRV=${MSRV} .evergreen/compile-only.sh
+          RUST_VERSION=${RUST_VERSION} MSRV=${MSRV} .evergreen/compile-only.sh


related to changes in the compile-only.sh script, we no longer need to pass a runtime here, and no longer need to include async runtime in the matrix definition below.

kmahar · 2022-10-05T15:10:11Z

.evergreen/run-tests.sh

@@ -10,7 +10,7 @@ if [ "$SINGLE_THREAD" = true ]; then
 	OPTIONS="$OPTIONS --test-threads=1"
 fi

-FEATURE_FLAGS="zstd-compression,snappy-compression,zlib-compression,${TLS_FEATURE}"
+FEATURE_FLAGS="zstd-compression,snappy-compression,zlib-compression,tracing-unstable,${TLS_FEATURE}"


I initially had planned to introduce separate tasks for testing tracing to avoid introducing any potential extra overhead from having a tracing subscriber registered in every test. however, I tried out just running the tracing tests alongside the other ones, and the total runtime for a full patch seemed to fall within the usual bounds; for the last 10 commits to main patches took anywhere from 14 days, 7 hours to 16 days, 2 hours computing time, and both of mine took around 15 days, 8 hours.

So there does not seem to be a ton of overhead. also, if we split out the tracing tasks we would need to separately install rustup, cargo, and driver deps for even more tasks which would probably cancel out any speedup in the other tests.

kmahar · 2022-10-05T15:11:18Z

Cargo.toml

@@ -70,6 +70,11 @@ snappy-compression = ["snap"]
 # DO NOT USE; see https://jira.mongodb.org/browse/RUST-569 for the status of CSFLE support in the Rust driver.
 csfle = ["mongocrypt", "rayon"]

+# DO NOT USE; see https://jira.mongodb.org/browse/RUST-580 for the status of tracing/logging support in the Rust driver.
+# Enables support for emitting tracing events.
+# TODO: pending https://github.com/tokio-rs/tracing/issues/2036 stop depending directly on log.


if this doesn't end up getting in before we release I don't think it is a huge deal because we already transitively depend on log via a bunch of our dependencies so we're not adding anything extra to compile here.

kmahar · 2022-10-05T15:16:28Z

src/error.rs

@@ -40,7 +40,7 @@ pub type Result<T> = std::result::Result<T, Error>;
 /// [`ErrorKind`](enum.ErrorKind.html) is wrapped in an `Arc` to allow the errors to be
 /// cloned.
 #[derive(Clone, Debug, Error)]
-#[error("{kind}")]
+#[error("Kind: {kind}, labels: {labels:?}")]


currently we omitted the labels from the Display implementations so I added that. I currently use the Display impls to generate the string representations of errors to attach to "command failed" events. (tracing events only support attaching a few primitive types like strings and various numbers.)

otherwise, I wanted to avoid going down a rabbit hole of hashing out the ideal string representations for every error type in the driver so I just left those as-is. since they are just string representations I don't think there are any semver concerns, and if we find issues with them in the future we could improve them as we go.

SGTM. I have seen other Rust devs expressing the opinion that Display impls should be stable and Debug doesn't need to be, but I don't think I really agree with that personally either.

yeah, it seems weird to me that any string representation besides e.g. a JSON one would be stable 🤔

src/error.rs

src/test/spec/unified_runner/matcher.rs

src/test/spec/unified_runner/operation.rs

kmahar · 2022-10-05T15:32:48Z

src/test/util/event.rs

-    receiver: tokio::sync::broadcast::Receiver<Event>,
-}
-
-impl EventSubscriber<'_> {


this type has been made generic over the handler and event types and moved into src/test/util/subscriber.rs so we can reuse it for tracing events as well.
wait_for_successful_command_execution is only meaningful for an EventSubscriber<'_, EventHandler, Event> though so I left that method here.

kmahar · 2022-10-05T15:54:01Z

src/trace/command.rs

+// TODO: subject to change based on what exact version of truncation we decide to go with.
+/// Truncates the given string at the closest UTF-8 character boundary >= the provided length.
+/// If the new length is >= the current length, does nothing.
+pub(crate) fn truncate_on_char_boundary(s: &mut String, new_len: usize) {


I know there was some discussion about this on my WIP PR back a few months ago and a desire to avoid rolling our own truncation logic.

the spec does provide a lot of flexibility in how truncation is implemented:

Drivers MUST support configuring the maximum logged length for extended JSON documents in log messages.
The unit here is flexible and can be bytes, Unicode code points, code units, or graphemes, depending on
what a driver is able to accomplish with its language's string APIs. The default max length is 1000 of
whichever unit is selected.
If the chosen unit is anything other than a Unicode code point, the driver MUST ensure that it
gracefully handles cases where the truncation length falls mid code point, by either rounding the length
up or down to the closest code point boundary or using the Unicode replacement character, to avoid
producing invalid Unicode data.
Truncated extended JSON MUST have a trailing ellipsis ... appended to indicate to the user that
truncation occurred. The ellipsis MUST NOT count toward the the max length.

but honestly, I'm not sure this flexibility makes things any easier for us or that we can really do better until ceil_char_boundary lands.

What I have here is the "byte" approach; basically we find the first index >= the max length that is a char boundary (and therefore safe to truncate on), and call truncate.)

There's no way to determine/iterate over code points that I could find.And for graphemes, there is the Unicode segmentation crate but from what I understand using the APIs provided there would require every string gets split up into its individual graphemes and we would then have to merge it back together, which seems very inefficient.

So for now I've left this and tried to cover various edge cases in the tracing_truncation test. But I am open to opinions and other ideas for what to do here.

Sounds good to me to keep this for now. Can we file a RUST ticket to track the ceil_char_boundary portion?

Also, if we did roll our own extJSON serializer that did support graceful truncation, would the spec allow for that? I think it would require a "truncated": true field or something to be added to the JSON instead of an ellipsis to ensure the JSON is valid.

filed RUST-1496 and added a reference to it in the TODO here.

we didn't mention anything along those lines in the spec. I think it's implied (though not explicitly stated) that drivers do truncation the "easy way". we chose this route because iirc, it was non-trivial to design and implement JSON truncation for the server logs, and I think Jeremy mentioned there have been some bugs in their truncation algorithm as well. so we didn't want to require drivers to do anything fancy like that given the complexity, and also given that it's not clear how valuable this would be - for the server where the entire log message is JSON it makes sense that they handled truncation off the bat because the messages might be useless if not parseable. but since this just applies to a command or reply within a log message, which we're not sure if there are many use cases for parsing, it's less clear whether producing valid JSON is worth the effort. as one data point, Java does the "dumb" truncation for their commands/replies and I don't believe anyone has requested a smarter approach. there is also an escape hatch for anyone who actually needs to see the full untruncated command (crank up the max document size).

if we end up having users ask for truncated yet parseable JSON, it would be additive to permit that in the spec. but I'd rather avoid permitting it off the bat as we'd need to e.g. create additional prose tests and consider questions like whether we need to standardize on the truncation algorithm (e.g. consider the document {"a": {"b": 1 }, "c" : 2 }; is it better to leave both top-level keys but truncate the subdocument, or cut out "c" but leave "a"'s value?), etc.

all that said: I think it's probably worth making the "dumb" truncation a requirement rather than an implication, and adding a rationale section to the spec explaining what I just wrote there, so I will do that

Sounds good, thanks for another clear + detailed explanation. Java not having any complaints seems like pretty good signal that the "dumb" approach should be good for us 👍

patrickfreed

Looks great! I only have some minor suggestions about docstrings / comments and a few questions

src/client/mod.rs

src/client/options/mod.rs

patrickfreed · 2022-10-06T22:32:57Z

src/error.rs

@@ -40,7 +40,7 @@ pub type Result<T> = std::result::Result<T, Error>;
 /// [`ErrorKind`](enum.ErrorKind.html) is wrapped in an `Arc` to allow the errors to be
 /// cloned.
 #[derive(Clone, Debug, Error)]
-#[error("{kind}")]
+#[error("Kind: {kind}, labels: {labels:?}")]


SGTM. I have seen other Rust devs expressing the opinion that Display impls should be stable and Debug doesn't need to be, but I don't think I really agree with that personally either.

src/error.rs

src/test/spec/unified_runner/matcher.rs

src/test/spec/unified_runner/operation.rs

patrickfreed · 2022-10-06T22:59:32Z

src/test/util/event.rs

-    receiver: tokio::sync::broadcast::Receiver<Event>,
-}
-
-impl EventSubscriber<'_> {


patrickfreed · 2022-10-06T23:10:09Z

src/trace/command.rs

+// TODO: subject to change based on what exact version of truncation we decide to go with.
+/// Truncates the given string at the closest UTF-8 character boundary >= the provided length.
+/// If the new length is >= the current length, does nothing.
+pub(crate) fn truncate_on_char_boundary(s: &mut String, new_len: usize) {


Sounds good to me to keep this for now. Can we file a RUST ticket to track the ceil_char_boundary portion?

Also, if we did roll our own extJSON serializer that did support graceful truncation, would the spec allow for that? I think it would require a "truncated": true field or something to be added to the JSON instead of an ellipsis to ensure the JSON is valid.

src/test/spec/unified_runner/matcher.rs

patrickfreed

Looks great! I only have a few really minor comments / questions, so going to tag in other reviews so they can get started too.

src/test/spec/unified_runner/matcher.rs

src/test/spec/unified_runner/operation.rs

src/trace/command.rs

patrickfreed · 2022-10-12T23:01:34Z

src/trace/command.rs

+// TODO: subject to change based on what exact version of truncation we decide to go with.
+/// Truncates the given string at the closest UTF-8 character boundary >= the provided length.
+/// If the new length is >= the current length, does nothing.
+pub(crate) fn truncate_on_char_boundary(s: &mut String, new_len: usize) {


Sounds good, thanks for another clear + detailed explanation. Java not having any complaints seems like pretty good signal that the "dumb" approach should be good for us 👍

src/client/options/mod.rs

patrickfreed · 2022-10-12T23:12:41Z

src/trace.rs

+
+/// Emits a tracing event at debug level. In tests, the provided client ID will be added as field to
+/// each event to enable distinguishing which client produced the event. Outside of tests, the ID
+/// will be ignored.


I wonder if including client ID in non-tests scenarios would actually be useful too, since it would allow a user to distinguish between clients if they happen to use multiple in their applications. I feel like we may have discussed this before (and I apologize if that is the case!), but is there a reason why we wouldn't want this in the actual driver?

I think we did talk about this a while back, but never actually reached a conclusion, so no worries! I agree a client ID seems like it would have some general usefulness. it would also help to easily identify from logs the cases where a user isn't properly caching clients since there would be multiple IDs being used over time.

I vaguely recall talking about requiring a client ID at some point in the spec process and we decided against requiring it since drivers varied widely in whether they had an ID type at all, whether it was public, and what type it was. that said, the spec doesn't forbid us from adding extra driver-specific info to log messages, so we could go ahead and add one if we want. I think we may as well do it now if we want to, to save adding the test-specific code. 2 questions -

would we want to make this public API / expose it on Client, or only in logs?

in Swift, we just have an atomic counter we use to generate integer IDs; do you think that makes sense here too or is there some other way that would be better/more idiomatic to implement it?

I don't think its strictly necessary to include in the public API right now, though it could definitely be useful. I'd lean towards leaving it out for now in case something like this gets stabilized across drivers.

An ObjectID / UUID would also work fine and not require synchronization across client instances. As an aside, I wonder if we could use the Topology ID for this? We already publish it for TopologyOpening events, and AFAIK there is only one generated per client's lifetime.

good idea re: the topology ID. it does appear it is generated just once when the Topology is initialized.
I refactored to use that instead, and to add it to events unconditionally (i.e. not based on whether we are in tests).

I named it topologyId, because I figured that's the name we'll go with when including it in SDAM tracing events, and this will make it more obvious that it's the same type of ID and can be used to correlate SDAM events with other events.

Given that we're using topology ID now (which every driver should have), would that be something to consider upstreaming into the spec as a requirement?

good question; I started a thread in the spec channel about this.

src/client/mod.rs

…ore list

kmahar · 2022-10-14T18:02:02Z

src/test/spec/unified_runner/matcher.rs

+                    TracingEventValue::String(failure_str) => {
+                        // lazy_static saves us having to recompile this regex every time this
+                        // function is called.
+                        lazy_static! {


as an aside, I kind of assumed that a nested lazy_static like this would get re-evaluated on every function call, but this was explicitly recommended in the regex crate docs as a way to avoid re-compiling within helper methods that use regexes: https://docs.rs/regex/latest/regex/index.html#example-avoid-compiling-the-same-regex-in-a-loop

That's definitely going in the toolbox 🙂

Might be worth leaving a comment at the error definitions that if the text format changes these regexes will need to change as well.

added comments in the relevant places for command errors and I/O errors 👍

abr-egn

Looks good, just a few smallish comments.

abr-egn · 2022-10-18T14:27:37Z

src/event/command.rs

@@ -144,3 +144,19 @@ pub trait CommandEventHandler: Send + Sync {
    /// whenever a database command fails to complete successfully.
    fn handle_command_failed_event(&self, _event: CommandFailedEvent) {}
 }
+
+#[derive(Clone)]
+pub(crate) enum CommandEvent {


It looks like we now have mirrored event enums for both CommandEvent and SdamEvent in src/test/util/event.rs - might be worth folding these together.

good point. I've combined the test CommandEvent type into this one. I will tackle combining the SDAM ones when adding SDAM logging

abr-egn · 2022-10-18T14:50:44Z

src/test/spec/unified_runner/matcher.rs

+                    TracingEventValue::String(failure_str) => {
+                        // lazy_static saves us having to recompile this regex every time this
+                        // function is called.
+                        lazy_static! {


That's definitely going in the toolbox 🙂

Might be worth leaving a comment at the error definitions that if the text format changes these regexes will need to change as well.

abr-egn · 2022-10-18T14:53:34Z

src/test/spec/unified_runner/matcher.rs

+                        // are command errors (which should be redacted) and
+                        // I/O errors (which should not redacted). redaction of other errors is
+                        // unit-tested in src/test/spec/trace.rs.
+                        if COMMAND_FAILED_REGEX.is_match(failure_str) {


This can avoid the unwrap by being if let Some(captures) = COMMAND_FAILED_REGEX.captures(failure_str)

nice; updated here and for the similar case below with IO_ERROR_REGEX

patrickfreed

LGTM!

isabelatkinson

looks great! just a few very small comments

isabelatkinson · 2022-10-19T15:44:22Z

src/client/options/mod.rs

+    /// If truncation of a document at the exact specified length would occur in the middle
+    /// of a Unicode codepoint, the document will be truncated at the closest larger length
+    /// which falls on a boundary between codepoints.
+    /// Note that in cases where truncation occurs the output will not be valid JSON.


Suggested change

/// Note that in cases where truncation occurs the output will not be valid JSON.

/// Note that in cases where truncation occurs the output may not be valid JSON.

actually never mind, I got mixed up here. "will" is the right word to use

isabelatkinson · 2022-10-19T16:03:19Z

src/test/mod.rs

+    /// they are registered in.
+    /// By default this handler will collect no tracing events.
+    /// Its minimum severity levels can be configured on a per-component basis using
+    /// [`TracingHandler:set_levels`]. The test lock MUST be acquired in any test


Does it need to be acquired exclusively? I think we should call that out here if so.

yes; updated

isabelatkinson · 2022-10-19T17:25:14Z

src/test/spec/unified_runner/test_file.rs

@@ -164,6 +168,9 @@ pub(crate) struct Client {
    #[serde(default, deserialize_with = "deserialize_server_api_test_format")]
    pub(crate) server_api: Option<ServerApi>,
    pub(crate) store_events_as_entities: Option<Vec<StoreEventsAsEntity>>,
+    #[cfg(feature = "tracing-unstable")]
+    #[serde(default, deserialize_with = "deserialize_tracing_level_map")]
+    pub observe_log_messages: Option<HashMap<String, tracing::Level>>,


Suggested change

pub observe_log_messages: Option<HashMap<String, tracing::Level>>,

pub(crate) observe_log_messages: Option<HashMap<String, tracing::Level>>,

(small nit, i think we decided upon using pub(crate) for internal test stuff)

oh yeah - updated

kmahar added 2 commits October 5, 2022 10:49

add tracing support and command events

f654434

use relaxed, not canonical, extJSON

f8caf7a

kmahar commented Oct 5, 2022

View reviewed changes

kmahar marked this pull request as ready for review October 5, 2022 19:30

kmahar requested review from a team and patrickfreed and removed request for a team October 5, 2022 19:30

patrickfreed reviewed Oct 6, 2022

View reviewed changes

kmahar added 7 commits October 7, 2022 16:43

add RUST-1496 TODO

9d47a00

use camelcase for tracing event property names

8f13e8e

expand comments

b986b86

only generate command event once and only clone if needed

4657388

more robust error redaction

cf22534

sync redaction tests and incorporate updated error redaction logic

bd0fbf3

format

d288519

kmahar requested a review from patrickfreed October 11, 2022 23:46

kmahar added 3 commits October 12, 2022 12:54

incorporate latest spec changes

6aff38f

Merge remote-tracking branch 'origin' into RUST-580/logging-commands

2673195

add gridfs error to match

f15e1e6

kmahar commented Oct 12, 2022

View reviewed changes

src/test/spec/unified_runner/matcher.rs Outdated Show resolved Hide resolved

patrickfreed reviewed Oct 12, 2022

View reviewed changes

patrickfreed requested review from isabelatkinson and abr-egn October 12, 2022 23:37

kmahar added 6 commits October 13, 2022 13:59

add csfle to feature-combinations.sh

49c4af6

bump tracing-subscriber version and remove RUSTSEC-2021-0139 from ign…

7e06949

…ore list

remove redundant comment

21787e8

schema version bump

c657018

more robust redaction testing

38eeb34

as_test_file_entities -> test_file_entities

4e4e4bc

kmahar requested a review from patrickfreed October 14, 2022 17:58

kmahar commented Oct 14, 2022

View reviewed changes

kmahar added 4 commits October 14, 2022 15:58

fix compilation with csfle, remove unused import warning

4f518d4

format again

5e0c680

include topology ID in events rather than using entity ID

0586381

topology_id -> topologyId for consistency with other fields

1323f8e

abr-egn reviewed Oct 18, 2022

View reviewed changes

Abraham comments

761893c

kmahar requested a review from abr-egn October 18, 2022 19:54

patrickfreed approved these changes Oct 18, 2022

View reviewed changes

abr-egn approved these changes Oct 19, 2022

View reviewed changes

isabelatkinson reviewed Oct 19, 2022

View reviewed changes

Isabel comments

7e88293

kmahar requested a review from isabelatkinson October 19, 2022 21:50

isabelatkinson approved these changes Oct 19, 2022

View reviewed changes

kmahar merged commit 6164660 into mongodb:main Oct 20, 2022

kmahar deleted the RUST-580/logging-commands branch October 20, 2022 15:04

	/// Note that in cases where truncation occurs the output will not be valid JSON.
	/// Note that in cases where truncation occurs the output may not be valid JSON.

	pub observe_log_messages: Option<HashMap<String, tracing::Level>>,
	pub(crate) observe_log_messages: Option<HashMap<String, tracing::Level>>,

RUST-580 Add tracing integration and command tracing events #757

RUST-580 Add tracing integration and command tracing events #757

Uh oh!

Conversation

kmahar commented Oct 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickfreed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrickfreed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmahar Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kmahar Oct 17, 2022 •

edited

Loading