Skip to content
This repository was archived by the owner on Dec 6, 2024. It is now read-only.

Commit 370f306

Browse files
committed
Propose sampling API changes
1 parent 579ced5 commit 370f306

File tree

1 file changed

+155
-0
lines changed

1 file changed

+155
-0
lines changed

text/0006-sampling.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Sampling API
2+
3+
*Status: proposed*
4+
5+
## TL;DR
6+
This section tries to summarize all the changes proposed in this RFC:
7+
1. Move the `Sampler` interface from the API to SDK package.
8+
1. Add a new `SamplerHint` concept to the API package.
9+
1. Add capability to record `Attributes` that can be used for sampling decision during the `Span`
10+
creation time.
11+
1. Add capability to start building a `Span` with a delayed `build` method. This is useful for
12+
cases where some `Attributes` that are useful for sampling are not available when start building
13+
the `Span`. As an example in Java the current `Span.Builder` will use as a start time for the
14+
`Span` the moment when the builder is created and not the moment when the `build()` method is
15+
called.
16+
17+
## Motivation
18+
19+
Different users of OpenTelemetry, ranging from library developers, packaged infrastructure binary
20+
developers, application developers, operators, and telemetry system owners, have separate use cases
21+
for OpenTelemetry that have gotten muddled in the design of the original Sampling API. Thus, we need
22+
to clarify what APIs each should be able to depend upon, and how they will configure sampling and
23+
OpenTelemetry according to their needs.
24+
25+
![Personas](https://i.imgur.com/w1H0CfH.png)
26+
27+
## Explanation
28+
29+
We outline five different use cases (who may be overlapping sets of people), and how they should
30+
interact with OpenTelemetry:
31+
32+
### Library developer
33+
Examples: gRPC, Express, Django developers.
34+
35+
* They must only depend upon the OTel API and not upon the SDK.
36+
* They are shipping source code that will be linked into others' applications.
37+
* They have no explicit runtime control over the application.
38+
* They know some signal about what traces may be interesting (e.g. unusual control plane requests)
39+
or uninteresting (e.g. health-checks), but have to write fully generically.
40+
41+
**Solution:**
42+
43+
* On the start Span operation, the OpenTelemetry API will allow marking a span with one of three
44+
choices for the SamplingHint, with "don't care" as the default: [`don't care`, `suggest keeping`,
45+
`suggest discarding`]
46+
47+
### Infrastructure package/binary developer
48+
Examples: HBase, Envoy developers.
49+
50+
* They are shipping self-contained binaries that may accept YAML or similar run-time configuration,
51+
but are not expected to support extensibility/plugins beyond the default OTel SDK, OTel SDKTracer,
52+
and OTel wire format exporter.
53+
* They may have their own recommendations for sampling rates, but don't run the binaries in
54+
production, only provide packaged binaries. So their sampling rate configs, and sampling strategies
55+
need to a finite "built in" set from OpenTelemetry's SDK.
56+
* They need to deal with upstream sampling decisions made by services calling them.
57+
58+
**Solution:**
59+
* Allow different sampling strategies by default in OTel SDK, all configurable easily via YAML or
60+
future flags, etc.:
61+
* Trust parent sampling decision (trusting & propagating parent SpanContext SampleBit)
62+
* Always keep
63+
* Never keep
64+
* Keep with 1/N probability
65+
66+
### Application developer
67+
These are the folks we've been thinking the most about for OTel in general.
68+
69+
* They have full control over the OTel implementation or SDK configuration. When using the SDK they
70+
can configure custom exporters, custom code/samplers, etc.
71+
* They can choose to implement runtime configuration via a variety of means (e.g. baking in feature
72+
flags, reading YAML files, etc.), or even configure the library in code.
73+
* They make heavy usage of OTel for instrumenting application-specific behavior, beyond what may be
74+
provided by the libraries they use such as gRPC, Django, etc.
75+
76+
**Solution:**
77+
* Allow application developers to link in custom samplers or write their own when using the
78+
official SDK.
79+
* These might include dynamic per-field sampling to achieve a target rate
80+
(e.g. https://github.com/honeycombio/dynsampler-go)
81+
* Sampling decisions are made within the start Span operation, after attributes relevant to the
82+
span have been added to the Span start operation but before a concrete Span object exists (so that
83+
either a NoOpSpan can be made, or an actual Span instance can be produced depending upon the
84+
sampler's decision).
85+
* Span.IsRecording() needs to be present to allow costly span attribute/log computation to be
86+
skipped if the span is a NoOp span.
87+
88+
### Application operator
89+
Often the same people as the application developers, but not necessarily
90+
91+
* They care about adjusting sampling rates and strategies to meet operational needs, debugging,
92+
and cost.
93+
94+
**Solution:**
95+
* Use config files or feature flags written by the application developers to control the
96+
application sampling logic.
97+
* Use the config files to configure libraries and infrastructure package behavior.
98+
99+
### Telemetry infrastructure owner
100+
They are the people who provide an implementation for the OTel API by using the SDK with custom
101+
`Exporter`s, `Sampler`s, hooks, etc. or by writing a custom implementation, as well as running the
102+
infrastructure for collecting exported traces.
103+
104+
* They care about a variety of things, including efficiency, cost effectiveness, and being able to
105+
gather spans in a way that makes sense for them.
106+
107+
**Solution:**
108+
* Infrastructure owners receive information attached to the span, after sampling hooks have already
109+
been run.
110+
111+
## Internal details
112+
The interface for the Sampler class takes in:
113+
* `TraceID`
114+
* `SpanID`
115+
* Parent `SpanContext` if any
116+
* `Links`
117+
* Initial set of `Attributes` for the `Span` being constructed
118+
119+
It produces as an output:
120+
* A boolean indicating whether to sample or drop the span.
121+
* The new set of initial span Attributes (or passes along the SpanAttributes unmodified)
122+
* (under discussion in separate RFC) the SamplingRate float.
123+
124+
## Trade-offs
125+
* We considered, instead of using the `SpanBuilder`, setting the sampler on the Span constructor, and
126+
requiring any `Attributes` to be populated prior to the start of the span's default start time.
127+
* We considered, instead of using the `SpanBuilder`, setting the `Sampler` and the `Attributes`
128+
used for the sampler before running an explicit MakeSamplingDecision() on the span. Attempts to
129+
create a child of the span would fail if MakeSamplingDecision() had not yet been run.
130+
* We considered allowing the sampling decision to be arbitrarily delayed.
131+
132+
## Prior art and alternatives
133+
Prior art for Zipkin, and other Dapper based systems: all client-side sampling decisions are made at
134+
head. Thus, we need to retain compatibility with this.
135+
136+
## Open questions
137+
This RFC does not necessarily resolve the question of how to propagate sampling rate values between
138+
different spans and processes. A separate RFC will be opened to cover this case.
139+
140+
## Future possibilities
141+
In the future, we propose that library developers may be able to defer the decision on whether to
142+
recommend the trace be sampled or not sampled until mid-way through execution;
143+
144+
## Related Issues
145+
* [opentelemetry-specification/189](https://github.com/open-telemetry/opentelemetry-specification/issues/189)
146+
* [opentelemetry-specification/187](https://github.com/open-telemetry/opentelemetry-specification/issues/187)
147+
* [opentelemetry-specification/164](https://github.com/open-telemetry/opentelemetry-specification/issues/164)
148+
* [opentelemetry-specification/125](https://github.com/open-telemetry/opentelemetry-specification/issues/125)
149+
* [opentelemetry-specification/87](https://github.com/open-telemetry/opentelemetry-specification/issues/87)
150+
* [opentelemetry-specification/66](https://github.com/open-telemetry/opentelemetry-specification/issues/66)
151+
* [opentelemetry-specification/65](https://github.com/open-telemetry/opentelemetry-specification/issues/65)
152+
* [opentelemetry-specification/53](https://github.com/open-telemetry/opentelemetry-specification/issues/53)
153+
* [opentelemetry-specification/33](https://github.com/open-telemetry/opentelemetry-specification/issues/33)
154+
* [opentelemetry-specification/32](https://github.com/open-telemetry/opentelemetry-specification/issues/32)
155+
* [opentelemetry-specification/31](https://github.com/open-telemetry/opentelemetry-specification/issues/31)

0 commit comments

Comments
 (0)