|
| 1 | +# Sampling API |
| 2 | + |
| 3 | +*Status: proposed* |
| 4 | + |
| 5 | +## TL;DR |
| 6 | +This section tries to summarize all the changes proposed in this RFC: |
| 7 | + 1. Move the `Sampler` interface from the API to SDK package. |
| 8 | + 1. Add a new `SamplerHint` concept to the API package. |
| 9 | + 1. Add capability to record `Attributes` that can be used for sampling decision during the `Span` |
| 10 | + creation time. |
| 11 | + 1. Add capability to start building a `Span` with a delayed `build` method. This is useful for |
| 12 | + cases where some `Attributes` that are useful for sampling are not available when start building |
| 13 | + the `Span`. As an example in Java the current `Span.Builder` will use as a start time for the |
| 14 | + `Span` the moment when the builder is created and not the moment when the `build()` method is |
| 15 | + called. |
| 16 | + |
| 17 | +## Motivation |
| 18 | + |
| 19 | +Different users of OpenTelemetry, ranging from library developers, packaged infrastructure binary |
| 20 | +developers, application developers, operators, and telemetry system owners, have separate use cases |
| 21 | +for OpenTelemetry that have gotten muddled in the design of the original Sampling API. Thus, we need |
| 22 | +to clarify what APIs each should be able to depend upon, and how they will configure sampling and |
| 23 | +OpenTelemetry according to their needs. |
| 24 | + |
| 25 | + |
| 26 | + |
| 27 | +## Explanation |
| 28 | + |
| 29 | +We outline five different use cases (who may be overlapping sets of people), and how they should |
| 30 | +interact with OpenTelemetry: |
| 31 | + |
| 32 | +### Library developer |
| 33 | +Examples: gRPC, Express, Django developers. |
| 34 | + |
| 35 | + * They must only depend upon the OTel API and not upon the SDK. |
| 36 | + * They are shipping source code that will be linked into others' applications. |
| 37 | + * They have no explicit runtime control over the application. |
| 38 | + * They know some signal about what traces may be interesting (e.g. unusual control plane requests) |
| 39 | + or uninteresting (e.g. health-checks), but have to write fully generically. |
| 40 | + |
| 41 | +**Solution:** |
| 42 | + |
| 43 | + * On the start Span operation, the OpenTelemetry API will allow marking a span with one of three |
| 44 | + choices for the SamplingHint, with "don't care" as the default: [`don't care`, `suggest keeping`, |
| 45 | + `suggest discarding`] |
| 46 | + |
| 47 | +### Infrastructure package/binary developer |
| 48 | +Examples: HBase, Envoy developers. |
| 49 | + |
| 50 | + * They are shipping self-contained binaries that may accept YAML or similar run-time configuration, |
| 51 | + but are not expected to support extensibility/plugins beyond the default OTel SDK, OTel SDKTracer, |
| 52 | + and OTel wire format exporter. |
| 53 | + * They may have their own recommendations for sampling rates, but don't run the binaries in |
| 54 | + production, only provide packaged binaries. So their sampling rate configs, and sampling strategies |
| 55 | + need to a finite "built in" set from OpenTelemetry's SDK. |
| 56 | + * They need to deal with upstream sampling decisions made by services calling them. |
| 57 | + |
| 58 | +**Solution:** |
| 59 | + * Allow different sampling strategies by default in OTel SDK, all configurable easily via YAML or |
| 60 | + future flags, etc.: |
| 61 | + * Trust parent sampling decision (trusting & propagating parent SpanContext SampleBit) |
| 62 | + * Always keep |
| 63 | + * Never keep |
| 64 | + * Keep with 1/N probability |
| 65 | + |
| 66 | +### Application developer |
| 67 | +These are the folks we've been thinking the most about for OTel in general. |
| 68 | + |
| 69 | + * They have full control over the OTel implementation or SDK configuration. When using the SDK they |
| 70 | + can configure custom exporters, custom code/samplers, etc. |
| 71 | + * They can choose to implement runtime configuration via a variety of means (e.g. baking in feature |
| 72 | + flags, reading YAML files, etc.), or even configure the library in code. |
| 73 | + * They make heavy usage of OTel for instrumenting application-specific behavior, beyond what may be |
| 74 | + provided by the libraries they use such as gRPC, Django, etc. |
| 75 | + |
| 76 | +**Solution:** |
| 77 | + * Allow application developers to link in custom samplers or write their own when using the |
| 78 | + official SDK. |
| 79 | + * These might include dynamic per-field sampling to achieve a target rate |
| 80 | + (e.g. https://github.com/honeycombio/dynsampler-go) |
| 81 | + * Sampling decisions are made within the start Span operation, after attributes relevant to the |
| 82 | + span have been added to the Span start operation but before a concrete Span object exists (so that |
| 83 | + either a NoOpSpan can be made, or an actual Span instance can be produced depending upon the |
| 84 | + sampler's decision). |
| 85 | + * Span.IsRecording() needs to be present to allow costly span attribute/log computation to be |
| 86 | + skipped if the span is a NoOp span. |
| 87 | + |
| 88 | +### Application operator |
| 89 | +Often the same people as the application developers, but not necessarily |
| 90 | + |
| 91 | + * They care about adjusting sampling rates and strategies to meet operational needs, debugging, |
| 92 | + and cost. |
| 93 | + |
| 94 | +**Solution:** |
| 95 | + * Use config files or feature flags written by the application developers to control the |
| 96 | + application sampling logic. |
| 97 | + * Use the config files to configure libraries and infrastructure package behavior. |
| 98 | + |
| 99 | +### Telemetry infrastructure owner |
| 100 | +They are the people who provide an implementation for the OTel API by using the SDK with custom |
| 101 | +`Exporter`s, `Sampler`s, hooks, etc. or by writing a custom implementation, as well as running the |
| 102 | +infrastructure for collecting exported traces. |
| 103 | + |
| 104 | + * They care about a variety of things, including efficiency, cost effectiveness, and being able to |
| 105 | + gather spans in a way that makes sense for them. |
| 106 | + |
| 107 | +**Solution:** |
| 108 | + * Infrastructure owners receive information attached to the span, after sampling hooks have already |
| 109 | + been run. |
| 110 | + |
| 111 | +## Internal details |
| 112 | +The interface for the Sampler class takes in: |
| 113 | + * `TraceID` |
| 114 | + * `SpanID` |
| 115 | + * Parent `SpanContext` if any |
| 116 | + * `Links` |
| 117 | + * Initial set of `Attributes` for the `Span` being constructed |
| 118 | + |
| 119 | +It produces as an output: |
| 120 | +* A boolean indicating whether to sample or drop the span. |
| 121 | +* The new set of initial span Attributes (or passes along the SpanAttributes unmodified) |
| 122 | +* (under discussion in separate RFC) the SamplingRate float. |
| 123 | + |
| 124 | +## Trade-offs |
| 125 | + * We considered, instead of using the `SpanBuilder`, setting the sampler on the Span constructor, and |
| 126 | + requiring any `Attributes` to be populated prior to the start of the span's default start time. |
| 127 | + * We considered, instead of using the `SpanBuilder`, setting the `Sampler` and the `Attributes` |
| 128 | + used for the sampler before running an explicit MakeSamplingDecision() on the span. Attempts to |
| 129 | + create a child of the span would fail if MakeSamplingDecision() had not yet been run. |
| 130 | + * We considered allowing the sampling decision to be arbitrarily delayed. |
| 131 | + |
| 132 | +## Prior art and alternatives |
| 133 | +Prior art for Zipkin, and other Dapper based systems: all client-side sampling decisions are made at |
| 134 | +head. Thus, we need to retain compatibility with this. |
| 135 | + |
| 136 | +## Open questions |
| 137 | +This RFC does not necessarily resolve the question of how to propagate sampling rate values between |
| 138 | +different spans and processes. A separate RFC will be opened to cover this case. |
| 139 | + |
| 140 | +## Future possibilities |
| 141 | +In the future, we propose that library developers may be able to defer the decision on whether to |
| 142 | +recommend the trace be sampled or not sampled until mid-way through execution; |
| 143 | + |
| 144 | +## Related Issues |
| 145 | + * [opentelemetry-specification/189](https://github.com/open-telemetry/opentelemetry-specification/issues/189) |
| 146 | + * [opentelemetry-specification/187](https://github.com/open-telemetry/opentelemetry-specification/issues/187) |
| 147 | + * [opentelemetry-specification/164](https://github.com/open-telemetry/opentelemetry-specification/issues/164) |
| 148 | + * [opentelemetry-specification/125](https://github.com/open-telemetry/opentelemetry-specification/issues/125) |
| 149 | + * [opentelemetry-specification/87](https://github.com/open-telemetry/opentelemetry-specification/issues/87) |
| 150 | + * [opentelemetry-specification/66](https://github.com/open-telemetry/opentelemetry-specification/issues/66) |
| 151 | + * [opentelemetry-specification/65](https://github.com/open-telemetry/opentelemetry-specification/issues/65) |
| 152 | + * [opentelemetry-specification/53](https://github.com/open-telemetry/opentelemetry-specification/issues/53) |
| 153 | + * [opentelemetry-specification/33](https://github.com/open-telemetry/opentelemetry-specification/issues/33) |
| 154 | + * [opentelemetry-specification/32](https://github.com/open-telemetry/opentelemetry-specification/issues/32) |
| 155 | + * [opentelemetry-specification/31](https://github.com/open-telemetry/opentelemetry-specification/issues/31) |
0 commit comments