Description
Describe your environment
OS: (e.g, Ubuntu)
Python version: 3.10.9
SDK version: 1.31.0
API version: 1.31.0
Opentelemetry collector: 0.115.1
Our application runs as a Kubernetes StatefulSet with 200 replicas using PeriodicExportingMetricReader for metrics export. During OpenTelemetry Collector redeployments, a subset of replicas persistently log:
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s
. These replicas fail to re-establish connection post-collector recovery, remaining in permanent retry state despite collector service restoration. But if I restart the application instance, it then recovered.
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.tornado import TornadoInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.environment_variables import OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, \
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics._internal.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.semconv.resource import ResourceAttributes
otel_metrics_exporter = ConsoleMetricExporter(out=open(os.devnull, 'w'), formatter=lambda metrics_data: "")
if os.getenv(OTEL_EXPORTER_OTLP_METRICS_ENDPOINT, None):
otel_metrics_exporter = OTLPMetricExporter(
insecure=True,
max_export_batch_size=512
)
otel_metrics_reader = PeriodicExportingMetricReader(otel_metrics_exporter, export_interval_millis=15000)
metrics.set_meter_provider(
MeterProvider(
resource=Resource.create(attributes={
ResourceAttributes.SERVICE_NAME: SERVICE_NAME,
ResourceAttributes.SERVICE_INSTANCE_ID: EG_REPLICA_ID,
ResourceAttributes.SERVICE_NAMESPACE: DEPLOYMENT_ENV
}),
metric_readers=[otel_metrics_reader]
)
)
otel_meter = metrics.get_meter(__name__)
def _net_connections_established(options: CallbackOptions):
connections = psutil.net_connections(kind='inet')
established = sum(1 for conn in connections if conn.status == 'ESTABLISHED')
yield Observation(int(established), {})
NET_CONNECTIONS_ESTABLISHED = otel_meter.create_observable_gauge(
f'net_connections_established',
unit='1',
callbacks=[_net_connections_established],
description='Current established connections count',
)
What happened?
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s
cannot be recovered, unless I restart the instance
Steps to Reproduce
Occasionally happen
Expected Result
Recover automatically
Actual Result
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s
cannot be recovered, unless I restart the instance. (Application instance, not opentelemetry instance)
Additional context
No response
Would you like to implement a fix?
None