TraceQL Metrics: right exemplars for histogram and quantiles #5145

ruslan-mikhailov · 2025-05-20T10:44:26Z

What this PR does

Fixes a bug with unmarshalling of TraceQL Metrics response in tests.
Fixes a bug when for histogram and quantiles exemplars were put in wrong series.

Root cause

In this loop: https://github.com/ruslan-mikhailov/tempo/blob/17e20a43af0881ef0841d2e33eaeb190c422e549/pkg/traceql/engine_metrics.go#L1477
it iterates over the series set while putting all the exemplars into each resulting series without any filtration: https://github.com/ruslan-mikhailov/tempo/blob/17e20a43af0881ef0841d2e33eaeb190c422e549/pkg/traceql/engine_metrics.go#L1488

For example, if in total, we have the following exemplars:
E1 - span_attr=val_A
E2 - span_attr=val_A
E3 - span_attr=val_B
E4 - span_attr=val_C
For the following query:

{span.span_attr!=nil} | quantile_over_time(span:duration, 0.9) by(span.span_attr)

we have:
val_A time series exemplars: E1,E2,E3,E4
val_B time series exemplars: E1,E2,E3,E4
val_C time series exemplars: E1,E2,E3,E4

Expected results:
val_A time series exemplars: E1,E2
val_B time series exemplars: E3
val_C time series exemplars: E4

Impact

In Grafana UI, after filtering by series, it provides wrong exemplars from another series, misleading users and providing a bad UX.

How it has been tested

Unit tests (in the PR)

Check that all exemplars have an attribute label that matches the corresponding time series label.

Manual testing

Run local tempo using local docker compose from example/docker-compose/local
Run the following script to generate random traces

Python script

import random
import time
from datetime import datetime
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes

# Initialize trace provider
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "fake-service",
    ResourceAttributes.SERVICE_VERSION: "0.1.0",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("fake-service-tracer")

# Database systems to include in spans
DB_SYSTEMS = ["mongodb", "redis", "elasticsearch", "postgres"]

def generate_nested_spans(depth=0, max_depth=3):
    """Generate nested spans with random operations"""
    if depth >= max_depth:
        return
    
    operations = ["query", "update", "delete", "insert", "get", "set", "scan"]
    operation = random.choice(operations)
    
    # Decide if this should be a database span
    is_db_span = random.random() < 0.3 and depth > 0
    
    # Create context for the span
    span_name = f"operation.{operation}"
    if is_db_span:
        span_name = f"db.{operation}"
    
    with tracer.start_as_current_span(span_name) as span:
        # Add random attributes
        span.set_attribute("operation.type", operation)
        span.set_attribute("timestamp", datetime.now().isoformat())
        
        # Add DB specific attributes for some spans
        if is_db_span:
            db_system = random.choice(DB_SYSTEMS)
            span.set_attribute("db.system", db_system)
            span.set_attribute("db.operation", operation)
            span.set_attribute("db.instance", f"{db_system}-instance-{random.randint(1, 5)}")
            
            # Add specific attributes based on DB type
            if db_system == "mongodb":
                span.set_attribute("db.mongodb.collection", f"collection-{random.randint(1, 10)}")
            elif db_system == "postgres":
                span.set_attribute("db.sql.table", f"table-{random.randint(1, 10)}")
            elif db_system == "redis":
                span.set_attribute("db.redis.database_index", random.randint(0, 15))
        
        # Random delay to simulate work
        time.sleep(random.uniform(0.01, 0.1))
        
        # Randomly generate errors
        if random.random() < 0.1:
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.record_exception(Exception(f"Simulated error in {span_name}"))
        
        # Generate nested spans
        nested_count = random.randint(0, 3)
        for _ in range(nested_count):
            generate_nested_spans(depth + 1, max_depth)

def generate_trace():
    """Generate a complete trace with nested spans"""
    request_paths = ["/api/users", "/api/products", "/api/orders", "/api/search", "/api/auth"]
    methods = ["GET", "POST", "PUT", "DELETE"]
    
    path = random.choice(request_paths)
    method = random.choice(methods)
    
    with tracer.start_as_current_span(f"{method} {path}") as root_span:
        root_span.set_attribute("http.method", method)
        root_span.set_attribute("http.url", f"http://fake-service{path}")
        root_span.set_attribute("http.status_code", random.choice([200, 200, 200, 400, 500]))
        root_span.set_attribute("test.version", "1.0.0")
        
        # generate_nested_spans(max_depth=1)
        # Generate between 1-5 nested spans
        for _ in range(random.randint(1, 5)):
            generate_nested_spans(max_depth=5)

def main():
    print("Starting to generate fake traces...")
    try:
        while True:
            generate_trace()
            # Wait between 1-3 seconds between traces
            time.sleep(random.uniform(1, 3))
    except KeyboardInterrupt:
        print("Trace generation stopped.")

if __name__ == "__main__":
    main()

Run the following query:

{span.db.system!=nil} | quantile_over_time(span:duration, 0.9) by(span.db.system)

Check that span.db.system label in exemplars matches "promLabels" of the series

Main branch results:

            "promLabels": "{p=\"0.9\", \"span.db.system\"=\"elasticsearch\"}", <------ elasicsearch
            "exemplars": [
                {
                    "labels": [
                        {
                            "key": "trace:id",
                            "value": {
                                "stringValue": "21c22617e0fb45fcc3cfafb6b792c52"
                            }
                        },
                        {
                            "key": "span.db.system",
                            "value": {
                                "stringValue": "mongodb" <----- wrong span, this is for mongodb, not ES. 
                            }
                        },
                        {
                            "key": "duration",
                            "value": {
                                "stringValue": "210.047ms" <--- value is also from the span that belongs to mongodb
                            }
                        }
                    ],
                    "value": 0.210047,
                    "timestampMs": "1747659434445"
                },

After the fix, only exemplars like this:

            "promLabels": "{p=\"0.9\", \"span.db.system\"=\"elasticsearch\"}",
            "exemplars": [
                {
                    "labels": [
                        {
                            "key": "span.db.system",
                            "value": {
                                "stringValue": "elasticsearch"
                            }
                        },
                        {
                            "key": "duration",
                            "value": {
                                "stringValue": "286.236ms"
                            }
                        },
                        {
                            "key": "trace:id",
                            "value": {
                                "stringValue": "2c5cab26dda70db8f4fc5438f2ec1265"
                            }
                        }
                    ],
                    "value": 0.286236,
                    "timestampMs": "1747666540680"
                },

Which issue(s) this PR fixes

Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/traceql/engine_metrics.go

ruslan-mikhailov · 2025-05-23T14:41:22Z

+ review fixes

ruslan-mikhailov · 2025-05-23T14:41:40Z

+ rebase from latest master

pkg/traceql/engine_metrics.go

ruslan-mikhailov · 2025-05-27T11:45:55Z

+ benchmarks

ruslan-mikhailov · 2025-05-27T11:58:57Z

+ review fix

ruslan-mikhailov · 2025-05-27T12:19:16Z

+ gosec overflow linter mute for testing function

ruslan-mikhailov · 2025-06-02T08:03:25Z

+ rebase from latest main

…#5145) * Minor: remove forgotten comment * [Bugfix] TraceQL Metrics: right exemplars for histogram and quantiles * Test fixes: correct unmarshaller * Tests: send traces with different attributes * Tests: by operation * Test attribute values in exemplars * Changelog * Test attribute values in exemplars: additional cases * Benchmarks: HistogramAggregator.Combine * Minor review improvement

…5197) * Minor: remove forgotten comment * [Bugfix] TraceQL Metrics: right exemplars for histogram and quantiles * Test fixes: correct unmarshaller * Tests: send traces with different attributes * Tests: by operation * Test attribute values in exemplars * Changelog * Test attribute values in exemplars: additional cases * Benchmarks: HistogramAggregator.Combine * Minor review improvement

…#5145) * Minor: remove forgotten comment * [Bugfix] TraceQL Metrics: right exemplars for histogram and quantiles * Test fixes: correct unmarshaller * Tests: send traces with different attributes * Tests: by operation * Test attribute values in exemplars * Changelog * Test attribute values in exemplars: additional cases * Benchmarks: HistogramAggregator.Combine * Minor review improvement

ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch 2 times, most recently from e707cd9 to 30d22d1 Compare May 20, 2025 11:00

ruslan-mikhailov marked this pull request as ready for review May 20, 2025 11:33

ruslan-mikhailov requested review from joe-elliott, mdisibio, mapno, yvrhdn, zalegrala, electron0zero, ie-pham, stoewer, javiermolinar and carles-grafana as code owners May 20, 2025 11:33

mdisibio self-assigned this May 20, 2025

mdisibio reviewed May 22, 2025

View reviewed changes

pkg/traceql/engine_metrics.go Outdated Show resolved Hide resolved

ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from 04755c4 to ed202d0 Compare May 23, 2025 14:41

ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from ed202d0 to 68316ef Compare May 23, 2025 14:41

ruslan-mikhailov requested a review from mdisibio May 23, 2025 15:30

mdisibio reviewed May 23, 2025

View reviewed changes

pkg/traceql/engine_metrics.go Show resolved Hide resolved

pkg/traceql/engine_metrics.go Outdated Show resolved Hide resolved

This was referenced May 26, 2025

[Epic] Query correctness #5165

Open

Exemplars UX improvements #5158

Merged

ruslan-mikhailov requested a review from mdisibio May 27, 2025 12:14

ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from d00396c to fad8276 Compare May 27, 2025 12:18

ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from fad8276 to 2bf9905 Compare May 27, 2025 12:43

mdisibio approved these changes May 29, 2025

View reviewed changes

ruslan-mikhailov enabled auto-merge (squash) June 2, 2025 07:58

ruslan-mikhailov disabled auto-merge June 2, 2025 08:01

ruslan-mikhailov added 10 commits June 2, 2025 10:01

Minor: remove forgotten comment

fda9a48

[Bugfix] TraceQL Metrics: right exemplars for histogram and quantiles

bbb6421

Test fixes: correct unmarshaller

977f284

Tests: send traces with different attributes

2ee6772

Tests: by operation

92be1fc

Test attribute values in exemplars

4398689

Changelog

cc8260b

Test attribute values in exemplars: additional cases

2bd96aa

Benchmarks: HistogramAggregator.Combine

2ce91ea

Minor review improvement

e4faa25

ruslan-mikhailov force-pushed the bugfix/histogram-exemplars branch from 2bf9905 to e4faa25 Compare June 2, 2025 08:03

ruslan-mikhailov merged commit 2ef94ef into grafana:main Jun 2, 2025
20 checks passed

ruslan-mikhailov deleted the bugfix/histogram-exemplars branch June 2, 2025 08:24

ruslan-mikhailov added backport release-v2.8 and removed backport release-v2.8 labels Jun 2, 2025

ruslan-mikhailov mentioned this pull request Jun 2, 2025

TraceQL Metrics: right exemplars for histogram and quantiles (#5145) #5197

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TraceQL Metrics: right exemplars for histogram and quantiles #5145

TraceQL Metrics: right exemplars for histogram and quantiles #5145

Uh oh!

ruslan-mikhailov commented May 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

ruslan-mikhailov commented May 23, 2025

Uh oh!

ruslan-mikhailov commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

ruslan-mikhailov commented May 27, 2025

Uh oh!

ruslan-mikhailov commented May 27, 2025

Uh oh!

ruslan-mikhailov commented May 27, 2025

Uh oh!

ruslan-mikhailov commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

TraceQL Metrics: right exemplars for histogram and quantiles #5145

TraceQL Metrics: right exemplars for histogram and quantiles #5145

Uh oh!

Conversation

ruslan-mikhailov commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Root cause

Impact

How it has been tested

Unit tests (in the PR)

Manual testing

Which issue(s) this PR fixes

Checklist

Uh oh!

Uh oh!

ruslan-mikhailov commented May 23, 2025

Uh oh!

ruslan-mikhailov commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

ruslan-mikhailov commented May 27, 2025

Uh oh!

ruslan-mikhailov commented May 27, 2025

Uh oh!

ruslan-mikhailov commented May 27, 2025

Uh oh!

ruslan-mikhailov commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

ruslan-mikhailov commented May 20, 2025 •

edited

Loading