Skip to content

Transient error StatusCode.UNAVAILABLE #4517

Open
@caydenwei

Description

@caydenwei

Describe your environment

OS: (e.g, Ubuntu)
Python version: 3.10.9
SDK version: 1.31.0
API version: 1.31.0
Opentelemetry collector: 0.115.1

Our application runs as a Kubernetes StatefulSet with 200 replicas using PeriodicExportingMetricReader for metrics export. During OpenTelemetry Collector redeployments, a subset of replicas persistently log:

Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s. These replicas fail to re-establish connection post-collector recovery, remaining in permanent retry state despite collector service restoration. But if I restart the application instance, it then recovered.

from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.tornado import TornadoInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.environment_variables import OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, \
    OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics._internal.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.semconv.resource import ResourceAttributes

otel_metrics_exporter = ConsoleMetricExporter(out=open(os.devnull, 'w'), formatter=lambda metrics_data: "")
if os.getenv(OTEL_EXPORTER_OTLP_METRICS_ENDPOINT, None):
    otel_metrics_exporter = OTLPMetricExporter(
        insecure=True,
        max_export_batch_size=512
    )

otel_metrics_reader = PeriodicExportingMetricReader(otel_metrics_exporter, export_interval_millis=15000)
metrics.set_meter_provider(
    MeterProvider(
        resource=Resource.create(attributes={
            ResourceAttributes.SERVICE_NAME: SERVICE_NAME,
            ResourceAttributes.SERVICE_INSTANCE_ID: EG_REPLICA_ID,
            ResourceAttributes.SERVICE_NAMESPACE: DEPLOYMENT_ENV
        }),
        metric_readers=[otel_metrics_reader]
    )
)

otel_meter = metrics.get_meter(__name__)


def _net_connections_established(options: CallbackOptions):
    connections = psutil.net_connections(kind='inet')
    established = sum(1 for conn in connections if conn.status == 'ESTABLISHED')
    yield Observation(int(established), {})


NET_CONNECTIONS_ESTABLISHED = otel_meter.create_observable_gauge(
    f'net_connections_established',
    unit='1',
    callbacks=[_net_connections_established],
    description='Current established connections count',
)

What happened?

Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s cannot be recovered, unless I restart the instance

Steps to Reproduce

Occasionally happen

Expected Result

Recover automatically

Actual Result

Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s cannot be recovered, unless I restart the instance. (Application instance, not opentelemetry instance)

Additional context

No response

Would you like to implement a fix?

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions