Open
Description
Describe your environment
OS: Darwin 22.6.0
Python version: 3.13.2
SDK version: v1.31.1
API version: v1.31.1
opentelemetry-exporter-otlp v1.31.1
What happened?
After otel-collector
restart, the app
failed to send traces and did not recover from the poisoned state. All subsequent grpc requests are broken.
Steps to Reproduce
- Start the
otel-collector
(withgrpc
receiver):
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.123.0
ports:
- 1888:1888 # pprof extension
- 8888:8888 # Prometheus metrics exposed by the Collector
- 8889:8889 # Prometheus exporter metrics
- 13133:13133 # health_check extension
- 4317:4317 # OTLP gRPC receiver
- 4318:4318 # OTLP http receiver
- 55679:55679 # zpages extension
- Start your
instrumented app
to send traces viagrpc
. For example:
import os
import time
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.id_generator import RandomIdGenerator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
def main() -> None:
otel_collector_grpc_endpoint = "localhost:4317"
ping_interval = 3
trace_provider = TracerProvider()
trace_provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(
endpoint=os.path.join(otel_collector_grpc_endpoint, "v1/traces"),
insecure=True,
)
)
)
trace.set_tracer_provider(trace_provider)
while True:
traceparent = random_traceparent()
ping(traceparent)
time.sleep(ping_interval)
def ping(traceparent: str) -> None:
tracer = trace.get_tracer(__name__)
carrier = {"traceparent": traceparent}
print(carrier)
ctx = TraceContextTextMapPropagator().extract(carrier)
with tracer.start_as_current_span("ping", ctx):
pass
def random_traceparent() -> str:
to_hex = lambda s: hex(s)[2:]
gen = RandomIdGenerator()
trace_id = to_hex(gen.generate_trace_id()).zfill(32)
span_id = to_hex(gen.generate_span_id()).zfill(16)
return f"00-{trace_id}-{span_id}-01"
if __name__ == "__main__":
main()
- Pause the
otel-collector
for more thanping_interval
. - Resume the
otel-collector
. - Check that
app
fails to send new traces.
Expected Result
The instrumented app should send traces successfully after otel-collector
restart.
Actual Result
The app
fails to send new traces after otel-collector
resumed.
Additional context
After otel-collector
restart, the python app
gives logs like that (for all batches):
Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317/v1/traces, retrying in 8s.
The only way to fix the app
tracing is to restart the app
itself, which is painful.
Also http
mode OTLPSpanExporter
seems to be working correctly unlike grpc
.
Would you like to implement a fix?
None