Skip to content

RUST-580 Add tracing integration and command tracing events #757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Oct 20, 2022

Conversation

kmahar
Copy link
Contributor

@kmahar kmahar commented Oct 5, 2022

This is the Rust implementation of the changes detailed in mongodb/specifications#1303. Various comments inline.

@@ -1,6 +1,7 @@
#!/bin/bash

set -o errexit
set -o xtrace
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple changes to this script:

  • switch to using the feature-combinations.sh script to produce feature combinations to test. as a side effect of this we no longer have separate compilation tasks based on runtime; there is just one task and between all the values in the feature combinations list we get coverage of both tokio and async-std.
  • xtrace seemed useful so that when there is a failure you can tell which combination of feature flags actually caused it.

@@ -548,7 +548,7 @@ functions:
working_dir: "src"
script: |
${PREPARE_SHELL}
ASYNC_RUNTIME=${ASYNC_RUNTIME} RUST_VERSION=${RUST_VERSION} MSRV=${MSRV} .evergreen/compile-only.sh
RUST_VERSION=${RUST_VERSION} MSRV=${MSRV} .evergreen/compile-only.sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to changes in the compile-only.sh script, we no longer need to pass a runtime here, and no longer need to include async runtime in the matrix definition below.

@@ -10,7 +10,7 @@ if [ "$SINGLE_THREAD" = true ]; then
OPTIONS="$OPTIONS --test-threads=1"
fi

FEATURE_FLAGS="zstd-compression,snappy-compression,zlib-compression,${TLS_FEATURE}"
FEATURE_FLAGS="zstd-compression,snappy-compression,zlib-compression,tracing-unstable,${TLS_FEATURE}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had planned to introduce separate tasks for testing tracing to avoid introducing any potential extra overhead from having a tracing subscriber registered in every test. however, I tried out just running the tracing tests alongside the other ones, and the total runtime for a full patch seemed to fall within the usual bounds; for the last 10 commits to main patches took anywhere from 14 days, 7 hours to 16 days, 2 hours computing time, and both of mine took around 15 days, 8 hours.

So there does not seem to be a ton of overhead. also, if we split out the tracing tasks we would need to separately install rustup, cargo, and driver deps for even more tasks which would probably cancel out any speedup in the other tests.

@@ -70,6 +70,11 @@ snappy-compression = ["snap"]
# DO NOT USE; see https://jira.mongodb.org/browse/RUST-569 for the status of CSFLE support in the Rust driver.
csfle = ["mongocrypt", "rayon"]

# DO NOT USE; see https://jira.mongodb.org/browse/RUST-580 for the status of tracing/logging support in the Rust driver.
# Enables support for emitting tracing events.
# TODO: pending https://github.com/tokio-rs/tracing/issues/2036 stop depending directly on log.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this doesn't end up getting in before we release I don't think it is a huge deal because we already transitively depend on log via a bunch of our dependencies so we're not adding anything extra to compile here.

@@ -40,7 +40,7 @@ pub type Result<T> = std::result::Result<T, Error>;
/// [`ErrorKind`](enum.ErrorKind.html) is wrapped in an `Arc` to allow the errors to be
/// cloned.
#[derive(Clone, Debug, Error)]
#[error("{kind}")]
#[error("Kind: {kind}, labels: {labels:?}")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we omitted the labels from the Display implementations so I added that. I currently use the Display impls to generate the string representations of errors to attach to "command failed" events. (tracing events only support attaching a few primitive types like strings and various numbers.)

otherwise, I wanted to avoid going down a rabbit hole of hashing out the ideal string representations for every error type in the driver so I just left those as-is. since they are just string representations I don't think there are any semver concerns, and if we find issues with them in the future we could improve them as we go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I have seen other Rust devs expressing the opinion that Display impls should be stable and Debug doesn't need to be, but I don't think I really agree with that personally either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it seems weird to me that any string representation besides e.g. a JSON one would be stable 🤔

receiver: tokio::sync::broadcast::Receiver<Event>,
}

impl EventSubscriber<'_> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this type has been made generic over the handler and event types and moved into src/test/util/subscriber.rs so we can reuse it for tracing events as well.
wait_for_successful_command_execution is only meaningful for an EventSubscriber<'_, EventHandler, Event> though so I left that method here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

// TODO: subject to change based on what exact version of truncation we decide to go with.
/// Truncates the given string at the closest UTF-8 character boundary >= the provided length.
/// If the new length is >= the current length, does nothing.
pub(crate) fn truncate_on_char_boundary(s: &mut String, new_len: usize) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there was some discussion about this on my WIP PR back a few months ago and a desire to avoid rolling our own truncation logic.

the spec does provide a lot of flexibility in how truncation is implemented:

Drivers MUST support configuring the maximum logged length for extended JSON documents in log messages.
The unit here is flexible and can be bytes, Unicode code points, code units, or graphemes, depending on
what a driver is able to accomplish with its language's string APIs. The default max length is 1000 of
whichever unit is selected.
If the chosen unit is anything other than a Unicode code point, the driver MUST ensure that it
gracefully handles cases where the truncation length falls mid code point, by either rounding the length
up or down to the closest code point boundary or using the Unicode replacement character, to avoid
producing invalid Unicode data.
Truncated extended JSON MUST have a trailing ellipsis ... appended to indicate to the user that
truncation occurred. The ellipsis MUST NOT count toward the the max length.

but honestly, I'm not sure this flexibility makes things any easier for us or that we can really do better until ceil_char_boundary lands.

What I have here is the "byte" approach; basically we find the first index >= the max length that is a char boundary (and therefore safe to truncate on), and call truncate.)

There's no way to determine/iterate over code points that I could find.And for graphemes, there is the Unicode segmentation crate but from what I understand using the APIs provided there would require every string gets split up into its individual graphemes and we would then have to merge it back together, which seems very inefficient.

So for now I've left this and tried to cover various edge cases in the tracing_truncation test. But I am open to opinions and other ideas for what to do here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me to keep this for now. Can we file a RUST ticket to track the ceil_char_boundary portion?

Also, if we did roll our own extJSON serializer that did support graceful truncation, would the spec allow for that? I think it would require a "truncated": true field or something to be added to the JSON instead of an ellipsis to ensure the JSON is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed RUST-1496 and added a reference to it in the TODO here.

we didn't mention anything along those lines in the spec. I think it's implied (though not explicitly stated) that drivers do truncation the "easy way". we chose this route because iirc, it was non-trivial to design and implement JSON truncation for the server logs, and I think Jeremy mentioned there have been some bugs in their truncation algorithm as well. so we didn't want to require drivers to do anything fancy like that given the complexity, and also given that it's not clear how valuable this would be - for the server where the entire log message is JSON it makes sense that they handled truncation off the bat because the messages might be useless if not parseable. but since this just applies to a command or reply within a log message, which we're not sure if there are many use cases for parsing, it's less clear whether producing valid JSON is worth the effort. as one data point, Java does the "dumb" truncation for their commands/replies and I don't believe anyone has requested a smarter approach. there is also an escape hatch for anyone who actually needs to see the full untruncated command (crank up the max document size).

if we end up having users ask for truncated yet parseable JSON, it would be additive to permit that in the spec. but I'd rather avoid permitting it off the bat as we'd need to e.g. create additional prose tests and consider questions like whether we need to standardize on the truncation algorithm (e.g. consider the document {"a": {"b": 1 }, "c" : 2 }; is it better to leave both top-level keys but truncate the subdocument, or cut out "c" but leave "a"'s value?), etc.

all that said: I think it's probably worth making the "dumb" truncation a requirement rather than an implication, and adding a rationale section to the spec explaining what I just wrote there, so I will do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for another clear + detailed explanation. Java not having any complaints seems like pretty good signal that the "dumb" approach should be good for us 👍

@kmahar kmahar marked this pull request as ready for review October 5, 2022 19:30
@kmahar kmahar requested review from a team and patrickfreed and removed request for a team October 5, 2022 19:30
Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I only have some minor suggestions about docstrings / comments and a few questions

@@ -40,7 +40,7 @@ pub type Result<T> = std::result::Result<T, Error>;
/// [`ErrorKind`](enum.ErrorKind.html) is wrapped in an `Arc` to allow the errors to be
/// cloned.
#[derive(Clone, Debug, Error)]
#[error("{kind}")]
#[error("Kind: {kind}, labels: {labels:?}")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I have seen other Rust devs expressing the opinion that Display impls should be stable and Debug doesn't need to be, but I don't think I really agree with that personally either.

receiver: tokio::sync::broadcast::Receiver<Event>,
}

impl EventSubscriber<'_> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

// TODO: subject to change based on what exact version of truncation we decide to go with.
/// Truncates the given string at the closest UTF-8 character boundary >= the provided length.
/// If the new length is >= the current length, does nothing.
pub(crate) fn truncate_on_char_boundary(s: &mut String, new_len: usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me to keep this for now. Can we file a RUST ticket to track the ceil_char_boundary portion?

Also, if we did roll our own extJSON serializer that did support graceful truncation, would the spec allow for that? I think it would require a "truncated": true field or something to be added to the JSON instead of an ellipsis to ensure the JSON is valid.

@kmahar kmahar requested a review from patrickfreed October 11, 2022 23:46
Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I only have a few really minor comments / questions, so going to tag in other reviews so they can get started too.

// TODO: subject to change based on what exact version of truncation we decide to go with.
/// Truncates the given string at the closest UTF-8 character boundary >= the provided length.
/// If the new length is >= the current length, does nothing.
pub(crate) fn truncate_on_char_boundary(s: &mut String, new_len: usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for another clear + detailed explanation. Java not having any complaints seems like pretty good signal that the "dumb" approach should be good for us 👍

src/trace.rs Outdated

/// Emits a tracing event at debug level. In tests, the provided client ID will be added as field to
/// each event to enable distinguishing which client produced the event. Outside of tests, the ID
/// will be ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if including client ID in non-tests scenarios would actually be useful too, since it would allow a user to distinguish between clients if they happen to use multiple in their applications. I feel like we may have discussed this before (and I apologize if that is the case!), but is there a reason why we wouldn't want this in the actual driver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we did talk about this a while back, but never actually reached a conclusion, so no worries! I agree a client ID seems like it would have some general usefulness. it would also help to easily identify from logs the cases where a user isn't properly caching clients since there would be multiple IDs being used over time.

I vaguely recall talking about requiring a client ID at some point in the spec process and we decided against requiring it since drivers varied widely in whether they had an ID type at all, whether it was public, and what type it was. that said, the spec doesn't forbid us from adding extra driver-specific info to log messages, so we could go ahead and add one if we want. I think we may as well do it now if we want to, to save adding the test-specific code. 2 questions -

  1. would we want to make this public API / expose it on Client, or only in logs?
  2. in Swift, we just have an atomic counter we use to generate integer IDs; do you think that makes sense here too or is there some other way that would be better/more idiomatic to implement it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't think its strictly necessary to include in the public API right now, though it could definitely be useful. I'd lean towards leaving it out for now in case something like this gets stabilized across drivers.
  2. An ObjectID / UUID would also work fine and not require synchronization across client instances. As an aside, I wonder if we could use the Topology ID for this? We already publish it for TopologyOpening events, and AFAIK there is only one generated per client's lifetime.

Copy link
Contributor Author

@kmahar kmahar Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea re: the topology ID. it does appear it is generated just once when the Topology is initialized.
I refactored to use that instead, and to add it to events unconditionally (i.e. not based on whether we are in tests).

I named it topologyId, because I figured that's the name we'll go with when including it in SDAM tracing events, and this will make it more obvious that it's the same type of ID and can be used to correlate SDAM events with other events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're using topology ID now (which every driver should have), would that be something to consider upstreaming into the spec as a requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question; I started a thread in the spec channel about this.

@kmahar kmahar requested a review from patrickfreed October 14, 2022 17:58
TracingEventValue::String(failure_str) => {
// lazy_static saves us having to recompile this regex every time this
// function is called.
lazy_static! {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an aside, I kind of assumed that a nested lazy_static like this would get re-evaluated on every function call, but this was explicitly recommended in the regex crate docs as a way to avoid re-compiling within helper methods that use regexes: https://docs.rs/regex/latest/regex/index.html#example-avoid-compiling-the-same-regex-in-a-loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely going in the toolbox 🙂

Might be worth leaving a comment at the error definitions that if the text format changes these regexes will need to change as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments in the relevant places for command errors and I/O errors 👍

Copy link
Contributor

@abr-egn abr-egn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few smallish comments.

@@ -144,3 +144,19 @@ pub trait CommandEventHandler: Send + Sync {
/// whenever a database command fails to complete successfully.
fn handle_command_failed_event(&self, _event: CommandFailedEvent) {}
}

#[derive(Clone)]
pub(crate) enum CommandEvent {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we now have mirrored event enums for both CommandEvent and SdamEvent in src/test/util/event.rs - might be worth folding these together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I've combined the test CommandEvent type into this one. I will tackle combining the SDAM ones when adding SDAM logging

TracingEventValue::String(failure_str) => {
// lazy_static saves us having to recompile this regex every time this
// function is called.
lazy_static! {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely going in the toolbox 🙂

Might be worth leaving a comment at the error definitions that if the text format changes these regexes will need to change as well.

// are command errors (which should be redacted) and
// I/O errors (which should not redacted). redaction of other errors is
// unit-tested in src/test/spec/trace.rs.
if COMMAND_FAILED_REGEX.is_match(failure_str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can avoid the unwrap by being if let Some(captures) = COMMAND_FAILED_REGEX.captures(failure_str)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice; updated here and for the similar case below with IO_ERROR_REGEX

@kmahar kmahar requested a review from abr-egn October 18, 2022 19:54
Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! just a few very small comments

/// If truncation of a document at the exact specified length would occur in the middle
/// of a Unicode codepoint, the document will be truncated at the closest larger length
/// which falls on a boundary between codepoints.
/// Note that in cases where truncation occurs the output will not be valid JSON.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note that in cases where truncation occurs the output will not be valid JSON.
/// Note that in cases where truncation occurs the output may not be valid JSON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually never mind, I got mixed up here. "will" is the right word to use

src/test/mod.rs Outdated
/// they are registered in.
/// By default this handler will collect no tracing events.
/// Its minimum severity levels can be configured on a per-component basis using
/// [`TracingHandler:set_levels`]. The test lock MUST be acquired in any test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be acquired exclusively? I think we should call that out here if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes; updated

@@ -164,6 +168,9 @@ pub(crate) struct Client {
#[serde(default, deserialize_with = "deserialize_server_api_test_format")]
pub(crate) server_api: Option<ServerApi>,
pub(crate) store_events_as_entities: Option<Vec<StoreEventsAsEntity>>,
#[cfg(feature = "tracing-unstable")]
#[serde(default, deserialize_with = "deserialize_tracing_level_map")]
pub observe_log_messages: Option<HashMap<String, tracing::Level>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub observe_log_messages: Option<HashMap<String, tracing::Level>>,
pub(crate) observe_log_messages: Option<HashMap<String, tracing::Level>>,

(small nit, i think we decided upon using pub(crate) for internal test stuff)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah - updated

@kmahar kmahar requested a review from isabelatkinson October 19, 2022 21:50
@kmahar kmahar merged commit 6164660 into mongodb:main Oct 20, 2022
@kmahar kmahar deleted the RUST-580/logging-commands branch October 20, 2022 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants