Skip to content

TraceQL Metrics: right exemplars for histogram and quantiles #5145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ruslan-mikhailov
Copy link
Contributor

@ruslan-mikhailov ruslan-mikhailov commented May 20, 2025

What this PR does

  1. Fixes a bug with unmarshalling of TraceQL Metrics response in tests.
  2. Fixes a bug when for histogram and quantiles exemplars were put in wrong series.

Root cause

In this loop: https://github.com/ruslan-mikhailov/tempo/blob/17e20a43af0881ef0841d2e33eaeb190c422e549/pkg/traceql/engine_metrics.go#L1477
it iterates over the series set while putting all the exemplars into each resulting series without any filtration: https://github.com/ruslan-mikhailov/tempo/blob/17e20a43af0881ef0841d2e33eaeb190c422e549/pkg/traceql/engine_metrics.go#L1488

For example, if in total, we have the following exemplars:
E1 - span_attr=val_A
E2 - span_attr=val_A
E3 - span_attr=val_B
E4 - span_attr=val_C
For the following query:

{span.span_attr!=nil} | quantile_over_time(span:duration, 0.9) by(span.span_attr)

we have:
val_A time series exemplars: E1,E2,E3,E4
val_B time series exemplars: E1,E2,E3,E4
val_C time series exemplars: E1,E2,E3,E4

Expected results:
val_A time series exemplars: E1,E2
val_B time series exemplars: E3
val_C time series exemplars: E4

Impact

In Grafana UI, after filtering by series, it provides wrong exemplars from another series, misleading users and providing a bad UX.

How it has been tested

Unit tests (in the PR)

Check that all exemplars have an attribute label that matches the corresponding time series label.

Manual testing

  1. Run local tempo using local docker compose from example/docker-compose/local
  2. Run the following script to generate random traces
Python script
import random
import time
from datetime import datetime
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes

# Initialize trace provider
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "fake-service",
    ResourceAttributes.SERVICE_VERSION: "0.1.0",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("fake-service-tracer")

# Database systems to include in spans
DB_SYSTEMS = ["mongodb", "redis", "elasticsearch", "postgres"]

def generate_nested_spans(depth=0, max_depth=3):
    """Generate nested spans with random operations"""
    if depth >= max_depth:
        return
    
    operations = ["query", "update", "delete", "insert", "get", "set", "scan"]
    operation = random.choice(operations)
    
    # Decide if this should be a database span
    is_db_span = random.random() < 0.3 and depth > 0
    
    # Create context for the span
    span_name = f"operation.{operation}"
    if is_db_span:
        span_name = f"db.{operation}"
    
    with tracer.start_as_current_span(span_name) as span:
        # Add random attributes
        span.set_attribute("operation.type", operation)
        span.set_attribute("timestamp", datetime.now().isoformat())
        
        # Add DB specific attributes for some spans
        if is_db_span:
            db_system = random.choice(DB_SYSTEMS)
            span.set_attribute("db.system", db_system)
            span.set_attribute("db.operation", operation)
            span.set_attribute("db.instance", f"{db_system}-instance-{random.randint(1, 5)}")
            
            # Add specific attributes based on DB type
            if db_system == "mongodb":
                span.set_attribute("db.mongodb.collection", f"collection-{random.randint(1, 10)}")
            elif db_system == "postgres":
                span.set_attribute("db.sql.table", f"table-{random.randint(1, 10)}")
            elif db_system == "redis":
                span.set_attribute("db.redis.database_index", random.randint(0, 15))
        
        # Random delay to simulate work
        time.sleep(random.uniform(0.01, 0.1))
        
        # Randomly generate errors
        if random.random() < 0.1:
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.record_exception(Exception(f"Simulated error in {span_name}"))
        
        # Generate nested spans
        nested_count = random.randint(0, 3)
        for _ in range(nested_count):
            generate_nested_spans(depth + 1, max_depth)

def generate_trace():
    """Generate a complete trace with nested spans"""
    request_paths = ["/api/users", "/api/products", "/api/orders", "/api/search", "/api/auth"]
    methods = ["GET", "POST", "PUT", "DELETE"]
    
    path = random.choice(request_paths)
    method = random.choice(methods)
    
    with tracer.start_as_current_span(f"{method} {path}") as root_span:
        root_span.set_attribute("http.method", method)
        root_span.set_attribute("http.url", f"http://fake-service{path}")
        root_span.set_attribute("http.status_code", random.choice([200, 200, 200, 400, 500]))
        root_span.set_attribute("test.version", "1.0.0")
        
        # generate_nested_spans(max_depth=1)
        # Generate between 1-5 nested spans
        for _ in range(random.randint(1, 5)):
            generate_nested_spans(max_depth=5)

def main():
    print("Starting to generate fake traces...")
    try:
        while True:
            generate_trace()
            # Wait between 1-3 seconds between traces
            time.sleep(random.uniform(1, 3))
    except KeyboardInterrupt:
        print("Trace generation stopped.")

if __name__ == "__main__":
    main()
  1. Run the following query:
{span.db.system!=nil} | quantile_over_time(span:duration, 0.9) by(span.db.system)
  1. Check that span.db.system label in exemplars matches "promLabels" of the series

Main branch results:

            "promLabels": "{p=\"0.9\", \"span.db.system\"=\"elasticsearch\"}", <------ elasicsearch
            "exemplars": [
                {
                    "labels": [
                        {
                            "key": "trace:id",
                            "value": {
                                "stringValue": "21c22617e0fb45fcc3cfafb6b792c52"
                            }
                        },
                        {
                            "key": "span.db.system",
                            "value": {
                                "stringValue": "mongodb" <----- wrong span, this is for mongodb, not ES. 
                            }
                        },
                        {
                            "key": "duration",
                            "value": {
                                "stringValue": "210.047ms" <--- value is also from the span that belongs to mongodb
                            }
                        }
                    ],
                    "value": 0.210047,
                    "timestampMs": "1747659434445"
                },

After the fix, only exemplars like this:

            "promLabels": "{p=\"0.9\", \"span.db.system\"=\"elasticsearch\"}",
            "exemplars": [
                {
                    "labels": [
                        {
                            "key": "span.db.system",
                            "value": {
                                "stringValue": "elasticsearch"
                            }
                        },
                        {
                            "key": "duration",
                            "value": {
                                "stringValue": "286.236ms"
                            }
                        },
                        {
                            "key": "trace:id",
                            "value": {
                                "stringValue": "2c5cab26dda70db8f4fc5438f2ec1265"
                            }
                        }
                    ],
                    "value": 0.286236,
                    "timestampMs": "1747666540680"
                },

Which issue(s) this PR fixes

Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch 2 times, most recently from e707cd9 to 30d22d1 Compare May 20, 2025 11:00
@ruslan-mikhailov ruslan-mikhailov marked this pull request as ready for review May 20, 2025 11:33
@mdisibio mdisibio self-assigned this May 20, 2025
@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from 04755c4 to ed202d0 Compare May 23, 2025 14:41
@ruslan-mikhailov
Copy link
Contributor Author

+ review fixes

@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from ed202d0 to 68316ef Compare May 23, 2025 14:41
@ruslan-mikhailov
Copy link
Contributor Author

+ rebase from latest master

@ruslan-mikhailov ruslan-mikhailov requested a review from mdisibio May 23, 2025 15:30
This was referenced May 26, 2025
@ruslan-mikhailov
Copy link
Contributor Author

+ benchmarks

@ruslan-mikhailov
Copy link
Contributor Author

+ review fix

@ruslan-mikhailov ruslan-mikhailov requested a review from mdisibio May 27, 2025 12:14
@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from d00396c to fad8276 Compare May 27, 2025 12:18
@ruslan-mikhailov
Copy link
Contributor Author

+ gosec overflow linter mute for testing function

@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from fad8276 to 2bf9905 Compare May 27, 2025 12:43
@ruslan-mikhailov ruslan-mikhailov enabled auto-merge (squash) June 2, 2025 07:58
@ruslan-mikhailov ruslan-mikhailov disabled auto-merge June 2, 2025 08:01
@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from 2bf9905 to e4faa25 Compare June 2, 2025 08:03
@ruslan-mikhailov
Copy link
Contributor Author

+ rebase from latest main

@ruslan-mikhailov ruslan-mikhailov merged commit 2ef94ef into grafana:main Jun 2, 2025
20 checks passed
@ruslan-mikhailov ruslan-mikhailov deleted the bugfix/histogram-exemplars branch June 2, 2025 08:24
ruslan-mikhailov added a commit to ruslan-mikhailov/tempo that referenced this pull request Jun 2, 2025
…#5145)

* Minor: remove forgotten comment

* [Bugfix] TraceQL Metrics: right exemplars for histogram and quantiles

* Test fixes: correct unmarshaller

* Tests: send traces with different attributes

* Tests: by operation

* Test attribute values in exemplars

* Changelog

* Test attribute values in exemplars: additional cases

* Benchmarks: HistogramAggregator.Combine

* Minor review improvement
ruslan-mikhailov added a commit to ruslan-mikhailov/tempo that referenced this pull request Jun 2, 2025
…#5145)

* Minor: remove forgotten comment

* [Bugfix] TraceQL Metrics: right exemplars for histogram and quantiles

* Test fixes: correct unmarshaller

* Tests: send traces with different attributes

* Tests: by operation

* Test attribute values in exemplars

* Changelog

* Test attribute values in exemplars: additional cases

* Benchmarks: HistogramAggregator.Combine

* Minor review improvement
ruslan-mikhailov added a commit that referenced this pull request Jun 2, 2025
…5197)

* Minor: remove forgotten comment

* [Bugfix] TraceQL Metrics: right exemplars for histogram and quantiles

* Test fixes: correct unmarshaller

* Tests: send traces with different attributes

* Tests: by operation

* Test attribute values in exemplars

* Changelog

* Test attribute values in exemplars: additional cases

* Benchmarks: HistogramAggregator.Combine

* Minor review improvement
knylander-grafana pushed a commit to knylander-grafana/tempo-doc-work that referenced this pull request Jun 2, 2025
…#5145)

* Minor: remove forgotten comment

* [Bugfix] TraceQL Metrics: right exemplars for histogram and quantiles

* Test fixes: correct unmarshaller

* Tests: send traces with different attributes

* Tests: by operation

* Test attribute values in exemplars

* Changelog

* Test attribute values in exemplars: additional cases

* Benchmarks: HistogramAggregator.Combine

* Minor review improvement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants