[ENH] Reorganize load-service traces #4518

jasonvigil · 2025-05-09T21:14:51Z

Description of changes

Remove the parent "workload" trace for the entire workload. It's too long-lasting, and contains too many sub-spans. Instead, generate a "step" root trace for each workload step. Events for each workload step will be associated with the step's root trace.

This will make spans easier to read, and debug.

Test plan

Tested locally via tilt and jaeger.

Documentation Changes

N/A

github-actions · 2025-05-09T21:15:01Z

propel-code-bot · 2025-05-09T21:15:20Z

rust/load/src/lib.rs

@@ -1271,13 +1271,11 @@ impl LoadService {
            for declared in declared {
                if let Entry::Vacant(entry) = running.entry(declared.uuid) {
                    tracing::info!("spawning workload {}", declared.uuid);
-                    let root = tracing::info_span!(parent: None, "workload");
                    let this = Arc::clone(self);
                    let done = Arc::new(AtomicBool::new(false));
                    let done_p = Arc::clone(&done);
                    let inhibit = Arc::clone(&self.inhibit);
                    let task = tokio::task::spawn(async move {


[BestPractice]

Consider using a descriptive span name that includes the workload UUID for better tracing context. This would make debugging and log analysis easier by clearly identifying which workload a particular trace belongs to.

propel-code-bot · 2025-05-09T21:15:21Z

rust/load/src/lib.rs

@@ -1408,30 +1376,45 @@ impl LoadService {
                        .await
                        .map_err(|err| Error::FailWorkload(err.to_string()))
                    {
-                        Ok(()) => Ok(()),
+                        Ok(()) => (),


[CriticalError]

It appears that you're returning () from the Ok path but not handling the error path properly. The pattern matching suggests you should either return something or propagate the error, but the error is effectively swallowed here. Consider adding explicit error handling or returning a Result from this closure.

propel-code-bot · 2025-05-09T21:15:22Z

rust/load/src/lib.rs

                        Err(err) => {
-                            if err.to_string().contains("invalid request: No results") {
+                            if format!("{err:?}").contains("invalid request: No results") {


[BestPractice]

For consistent error handling, replace format!("{err:?}").contains() with a more structured approach. String matching on error messages is fragile and could break if error messages change. Consider adding error types or codes that can be checked more reliably.

rust/load/src/lib.rs

propel-code-bot · 2025-05-09T21:15:28Z

Reorganizing Load-Service Traces for Better Debugging

This PR restructures the tracing architecture in the load-service by removing the single long-lasting 'workload' trace span that contained too many sub-spans. Instead, it creates individual 'step' root traces for each workload step, making spans easier to read and debug. The change also improves error handling and logging, with more descriptive tracing for different error conditions.

Key Changes:
• Removed parent 'workload' trace span in favor of individual 'step' spans
• Eliminated task reaper mechanism in favor of direct error handling
• Added workload UUID to trace spans for better identification
• Improved error logging with specific messages for different error types

Affected Areas:
• rust/load/src/lib.rs - workload tracing infrastructure

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-05-09T21:30:37Z

rust/load/src/lib.rs

                };
-                tx.send(tokio::spawn(fut)).await.unwrap();
+                let span = tracing::info_span!(parent: None, "step", workload_uuid = %spec.uuid);
+                tokio::spawn(fut.instrument(span));


[BestPractice]

Using instrument(span) without properly closing or dropping the span may lead to resource leaks. Ensure that the span is properly managed in the spawned task, especially when dealing with many workloads.

rescrv · 2025-05-09T21:41:51Z

rust/load/src/lib.rs

                        Err(err) => {
-                            if err.to_string().contains("invalid request: No results") {
+                            if format!("{err:?}").contains("invalid request: No results") {


I think we shouldn't make this drop. Not in this PR, but we should remove this case of error handling. It's error-masking.

Remove the parent "workload" trace for the entire workload. It's too long-lasting, and contains too many sub-spans. Instead, generate a "step" root trace for each workload step. Events for each workload step will be associated with the step's root trace.

## Description of changes Remove the parent "workload" trace for the entire workload. It's too long-lasting, and contains too many sub-spans. Instead, generate a "step" root trace for each workload step. Events for each workload step will be associated with the step's root trace. This will make spans easier to read, and debug. ## Test plan Tested locally via tilt and jaeger. ## Documentation Changes N/A --------- Co-authored-by: propel-code-bot[bot] <203372662+propel-code-bot[bot]@users.noreply.github.com>

jasonvigil requested a review from rescrv May 9, 2025 21:14

propel-code-bot bot reviewed May 9, 2025

View reviewed changes

rust/load/src/lib.rs Outdated Show resolved Hide resolved

propel-code-bot bot reviewed May 9, 2025

View reviewed changes

rescrv approved these changes May 9, 2025

View reviewed changes

jasonvigil enabled auto-merge (squash) May 9, 2025 21:45

jasonvigil mentioned this pull request May 9, 2025

[ENH] (cherry-pick) Reorganize load-service traces #4519

Closed

jasonvigil force-pushed the jason/cv-trace-improvements branch from cacb9d7 to 8984f88 Compare May 9, 2025 22:09

jasonvigil added 2 commits May 9, 2025 15:33

[BUG] Make chroma-load workload done an info event

7d9863b

jasonvigil force-pushed the jason/cv-trace-improvements branch from 28d2363 to 7d9863b Compare May 9, 2025 22:34

jasonvigil merged commit 8170f91 into main May 12, 2025
70 checks passed

jasonvigil deleted the jason/cv-trace-improvements branch May 12, 2025 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Reorganize load-service traces #4518

[ENH] Reorganize load-service traces #4518

Uh oh!

jasonvigil commented May 9, 2025

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

propel-code-bot bot May 9, 2025

Uh oh!

propel-code-bot bot May 9, 2025

Uh oh!

propel-code-bot bot May 9, 2025

Uh oh!

Uh oh!

propel-code-bot bot commented May 9, 2025 •

edited

Loading

Uh oh!

propel-code-bot bot May 9, 2025

Uh oh!

rescrv May 9, 2025

Uh oh!

Uh oh!

Uh oh!

[ENH] Reorganize load-service traces #4518

[ENH] Reorganize load-service traces #4518

Uh oh!

Conversation

jasonvigil commented May 9, 2025

Description of changes

Test plan

Documentation Changes

Uh oh!

github-actions bot commented May 9, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot May 9, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot May 9, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

propel-code-bot bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot bot May 9, 2025

Choose a reason for hiding this comment

Uh oh!

rescrv May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

propel-code-bot bot commented May 9, 2025 •

edited

Loading