-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ENH] Reorganize load-service traces #4518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
@@ -1271,13 +1271,11 @@ impl LoadService { | |||
for declared in declared { | |||
if let Entry::Vacant(entry) = running.entry(declared.uuid) { | |||
tracing::info!("spawning workload {}", declared.uuid); | |||
let root = tracing::info_span!(parent: None, "workload"); | |||
let this = Arc::clone(self); | |||
let done = Arc::new(AtomicBool::new(false)); | |||
let done_p = Arc::clone(&done); | |||
let inhibit = Arc::clone(&self.inhibit); | |||
let task = tokio::task::spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Consider using a descriptive span name that includes the workload UUID for better tracing context. This would make debugging and log analysis easier by clearly identifying which workload a particular trace belongs to.
@@ -1408,30 +1376,45 @@ impl LoadService { | |||
.await | |||
.map_err(|err| Error::FailWorkload(err.to_string())) | |||
{ | |||
Ok(()) => Ok(()), | |||
Ok(()) => (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
It appears that you're returning ()
from the Ok path but not handling the error path properly. The pattern matching suggests you should either return something or propagate the error, but the error is effectively swallowed here. Consider adding explicit error handling or returning a Result from this closure.
Err(err) => { | ||
if err.to_string().contains("invalid request: No results") { | ||
if format!("{err:?}").contains("invalid request: No results") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
For consistent error handling, replace format!("{err:?}").contains()
with a more structured approach. String matching on error messages is fragile and could break if error messages change. Consider adding error types or codes that can be checked more reliably.
Reorganizing Load-Service Traces for Better Debugging This PR restructures the tracing architecture in the load-service by removing the single long-lasting 'workload' trace span that contained too many sub-spans. Instead, it creates individual 'step' root traces for each workload step, making spans easier to read and debug. The change also improves error handling and logging, with more descriptive tracing for different error conditions. Key Changes: Affected Areas: This summary was automatically generated by @propel-code-bot |
}; | ||
tx.send(tokio::spawn(fut)).await.unwrap(); | ||
let span = tracing::info_span!(parent: None, "step", workload_uuid = %spec.uuid); | ||
tokio::spawn(fut.instrument(span)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Using instrument(span)
without properly closing or dropping the span may lead to resource leaks. Ensure that the span is properly managed in the spawned task, especially when dealing with many workloads.
Err(err) => { | ||
if err.to_string().contains("invalid request: No results") { | ||
if format!("{err:?}").contains("invalid request: No results") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't make this drop. Not in this PR, but we should remove this case of error handling. It's error-masking.
cacb9d7
to
8984f88
Compare
Remove the parent "workload" trace for the entire workload. It's too long-lasting, and contains too many sub-spans. Instead, generate a "step" root trace for each workload step. Events for each workload step will be associated with the step's root trace.
28d2363
to
7d9863b
Compare
## Description of changes Remove the parent "workload" trace for the entire workload. It's too long-lasting, and contains too many sub-spans. Instead, generate a "step" root trace for each workload step. Events for each workload step will be associated with the step's root trace. This will make spans easier to read, and debug. ## Test plan Tested locally via tilt and jaeger. ## Documentation Changes N/A --------- Co-authored-by: propel-code-bot[bot] <203372662+propel-code-bot[bot]@users.noreply.github.com>
## Description of changes Remove the parent "workload" trace for the entire workload. It's too long-lasting, and contains too many sub-spans. Instead, generate a "step" root trace for each workload step. Events for each workload step will be associated with the step's root trace. This will make spans easier to read, and debug. ## Test plan Tested locally via tilt and jaeger. ## Documentation Changes N/A --------- Co-authored-by: propel-code-bot[bot] <203372662+propel-code-bot[bot]@users.noreply.github.com>
Description of changes
Remove the parent "workload" trace for the entire workload. It's too long-lasting, and contains too many sub-spans. Instead, generate a "step" root trace for each workload step. Events for each workload step will be associated with the step's root trace.
This will make spans easier to read, and debug.
Test plan
Tested locally via tilt and jaeger.
Documentation Changes
N/A