OpenTelemetry SDK Setup for Python

Implementing a production-grade observability pipeline begins with precise OpenTelemetry SDK configuration: the order in which you build the resource, create the provider, attach a processor, and register propagators determines whether traces arrive intact or fragment under load. This guide is part of the Distributed Tracing and OpenTelemetry in Python guide, and it details dependency management, provider initialization, and exporter routing for Python workloads. It feeds directly into framework integrations such as instrumenting Python web frameworks and the focused walkthrough for setting up OpenTelemetry in FastAPI, and it underpins the span lifecycle and attributes you record on top of it.

SDK initialization order The correct bootstrap sequence: build the resource, create the tracer provider, attach the batch span processor wrapping the OTLP exporter, register the composite propagator, then set the provider global before instrumentation attaches. 1. Resource service.name 2. Provider global once 3. Batch processor OTLP export 4. Propagator W3C + baggage 5. Instrument after global set
The fixed initialization order: resource, provider, batch processor, propagator, then framework instrumentation last.

Key implementation priorities are dependency isolation, global provider bootstrapping, semantic resource mapping, and OTLP exporter tuning. Get these four right and telemetry ingests reliably under high concurrency; get the order wrong and spans silently route to a no-op provider.

The most common failure in SDK setup is not a crash — it is silence. When initialization runs in the wrong order or the provider is never promoted to global, tracers return non-recording spans, the application behaves normally, and no error appears anywhere. Nothing reaches the collector, and the absence looks like a networking problem rather than a bootstrap bug. The discipline this guide enforces — a fixed five-step order, an explicit resource, a batch processor, and a global registration that happens before any instrumentation attaches — exists specifically to make that silent failure impossible.

Prerequisites

Pin every OpenTelemetry package to a bounded range so a transitive upgrade cannot change instrumentation behavior mid-deploy. The API and SDK version together, while contrib instrumentation packages track a separate 0.x beta line.

# pyproject.toml — production pinning strategy
[project]
dependencies = [
  "opentelemetry-api>=1.30.0,<2.0.0",
  "opentelemetry-sdk>=1.30.0,<2.0.0",
  "opentelemetry-exporter-otlp-proto-grpc>=1.30.0,<2.0.0",
  "opentelemetry-semantic-conventions>=0.51b0,<1.0.0",
]
export OTEL_SERVICE_NAME="payment-service"
export OTEL_EXPORTER_OTLP_ENDPOINT="otel-collector:4317"
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,team=platform"

The SDK reads these variables automatically, so they are the deployment-time source of truth that overrides the code defaults shown later. Set them in the orchestration manifest — a Kubernetes env block, an ECS task definition, a systemd unit — rather than baking values into the image, so the same artifact promotes cleanly from staging to production. Keep secrets out of OTEL_RESOURCE_ATTRIBUTES; it is replicated onto every span and is meant for low-cardinality routing dimensions, not credentials.

Concept and Architecture

The OpenTelemetry ecosystem strictly separates opentelemetry-api from opentelemetry-sdk. Libraries depend only on the API, which is a no-op until an application installs and configures the SDK. This decoupling lets you instrument a library without forcing a tracing runtime on its consumers, and it means your application — not your dependencies — owns the export pipeline.

This split has a direct consequence for version management. The opentelemetry-api and opentelemetry-sdk packages share a stable 1.x version line and must match, because the SDK implements the exact API surface the version declares; a mismatch raises ImportError at startup. The contrib instrumentation packages (opentelemetry-instrumentation-*) and the semantic-convention package track a separate 0.x beta line that advances faster, so pin them to their own bounded range rather than assuming they move in lockstep with the core. Keeping these two lines pinned independently is what prevents a routine dependency bump from silently changing which spans your libraries emit.

Four SDK objects do the work. The Resource is an immutable bag of identity attributes attached to the TracerProvider. The TracerProvider is the global factory that hands out tracers and owns the processor chain. A SpanProcessor receives spans as they end and decides how to export them; the BatchSpanProcessor is the production choice because it queues and flushes on a background thread. The exporter — here OTLP over gRPC — serializes spans to Protobuf and ships them to the collector. Because the provider is global state, fragmenting it across local instances complicates downstream querying and breaks service-topology generation, which is why one provider per process is the rule. Proper dependency resolution and a clean provider lifecycle directly shape how you later manage the span lifecycle and attributes across rolling deployments.

These objects form a one-directional pipeline. A tracer obtained from the provider creates a span; when the span ends, the provider hands it to every registered processor in turn; the batch processor enqueues it and, on its schedule, drains the queue into the exporter, which serializes the batch and writes it to the collector. Nothing on this path runs on your request thread except the cheap enqueue, which is the whole point: the expensive work — serialization, the network round trip, retries — happens on the processor's background daemon thread. Understanding this flow explains every tuning parameter later, because each one controls a different stage of the same pipeline.

A useful mental model is that the resource answers "who am I", the provider answers "where do tracers come from", the processor answers "when and how do finished spans leave", and the exporter answers "in what format and to where". Each is replaceable in isolation, which is why the same application code runs unchanged across local development, CI, and production — only the processor and exporter pair differs between environments.

The provider also owns the sampler. By default it samples every trace (ALWAYS_ON), which is fine in development but rarely what you want in production. Configure a ParentBased(TraceIdRatioBased(ratio)) sampler so the service honors an upstream sampling decision and only makes its own probabilistic choice for traces it roots. Set the sampler when you construct the provider, because — like the resource — it is fixed for the provider's lifetime. The OTLP exporter, by contrast, is the one piece you can swap freely: a console exporter for local debugging, an in-memory exporter for tests, and the gRPC OTLP exporter in production all plug into the same BatchSpanProcessor without other changes.

Step-by-Step Implementation

  1. Define the resource. Build it from environment-aware defaults using the official semantic-convention keys so a single codebase produces distinct identities per environment. Always include service.name, service.version, and deployment.environment. Using the ResourceAttributes constants instead of raw strings protects you from typos that would otherwise create silent duplicate dimensions — service.name and service_name are different keys to a backend, and only one will populate the service map. The resource you build here is merged with whatever OTEL_RESOURCE_ATTRIBUTES provides at runtime, so deployment manifests can add service.instance.id or cloud.region without a code change.
import os
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes

resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: os.getenv("OTEL_SERVICE_NAME", "payment-service"),
    ResourceAttributes.SERVICE_VERSION: os.getenv("SERVICE_VERSION", "2.4.1"),
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: os.getenv("DEPLOYMENT_ENV", "production"),
})
  1. Create the provider. Instantiate exactly one TracerProvider with the resource. Doing this during module import is safe; defer expensive resource detection to a startup hook in containers to keep cold starts fast.
from opentelemetry.sdk.trace import TracerProvider

provider = TracerProvider(resource=resource)
  1. Configure the OTLP exporter. Prefer gRPC in high-throughput services for its multiplexed connections and Protobuf framing; the HTTP/Protobuf exporter is the better choice only when a proxy or service mesh on the path cannot handle long-lived gRPC streams. Set a bounded timeout and keep insecure=False so a misconfigured TLS setting fails loudly instead of sending plaintext. The endpoint should point at a collector reachable on the local network — localhost for a sidecar, the node address for a daemonset, or an internal DNS name for a gateway pool — never directly at a public backend, because the collector is what provides the retries and buffering the exporter alone lacks.
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"),
    insecure=False,
    timeout=10,
)
  1. Attach the batch processor. Wrap the exporter in a BatchSpanProcessor with queue and batch sizes tuned to your concurrency, then register it on the provider. The processor flushes on a background daemon thread, so it is safe inside an asyncio event loop. The three knobs interact: max_queue_size caps memory and is the buffer that absorbs a brief collector outage, max_export_batch_size bounds the size of each OTLP request, and schedule_delay_millis caps how long a finished span waits before it is sent. Size the queue to roughly twice your peak concurrent spans so a traffic spike does not overflow it and start dropping spans, and keep the delay short enough that an unexpected crash costs only a few seconds of telemetry rather than a full batch interval.
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider.add_span_processor(BatchSpanProcessor(
    exporter,
    max_queue_size=2048,
    max_export_batch_size=512,
    schedule_delay_millis=5000,
))
  1. Set the global provider, then register propagators. Promote the provider to global state before any instrumentation attaches, and register a composite propagator so trace context and baggage both survive every hop. Routing always goes to a local collector first so buffering, retries, header injection, and sampling happen before data leaves your VPC — the same baseline the instrumenting Python web frameworks integrations build on. Order matters here as much as anywhere: set_tracer_provider must run before the first get_tracer call or any instrument_app invocation, because a tracer captured against the default provider stays bound to it for its lifetime even after you later set the real one.
from opentelemetry import trace
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator

trace.set_tracer_provider(provider)
set_global_textmap(CompositePropagator([
    TraceContextTextMapPropagator(),
    W3CBaggagePropagator(),
]))

Configuration Reference

Parameter / env var Type Default Production-recommended
OTEL_EXPORTER_OTLP_ENDPOINT string localhost:4317 local collector address, e.g. otel-collector:4317
max_queue_size int 2048 2× peak concurrent spans in flight
max_export_batch_size int 512 512; raise only if the collector keeps pace
schedule_delay_millis int 5000 2000–5000 to amortize network I/O
OTLPSpanExporter.timeout int (s) 10 5–10 so retries cannot block indefinitely
OTLPSpanExporter.insecure bool False False in production; TLS to the collector
OTEL_SERVICE_NAME string unknown_service explicit service identifier per deployment

Async and Concurrency Considerations

Trace continuity across HTTP, gRPC, and async queues requires the propagators registered in step five; without W3CBaggagePropagator in the composite, baggage is silently dropped on every outbound call. The batch processor's background thread is event-loop-safe, but long-running workers still need disciplined span lifecycle management — an unclosed span pins its context and leaks memory across the worker's lifetime.

Use contextvars to keep span context isolated across concurrent tasks, and copy the context before scheduling work on an executor so child spans parent correctly rather than starting orphan roots. For non-instrumented libraries, wrap external calls in tracer.start_as_current_span() with explicit error handling so a raised exception still closes the span. The deeper mechanics of header injection, extraction, and baggage limits live in context propagation and baggage.

The batch processor is safe for asyncio precisely because it never blocks the event loop: the loop only enqueues spans, and a separate OS thread performs the gRPC export. This does mean the export thread competes for the GIL, but the contention is negligible because serialization is fast and the network call releases the GIL while waiting. The one operation that does block is force_flush, which you should call only during graceful shutdown — never on the request path — to drain the queue before the process exits. In an ASGI application this belongs in the lifespan shutdown handler, exactly the pattern shown for setting up OpenTelemetry in FastAPI and reused across the instrumenting Python web frameworks integrations.

A practical workflow is to develop against the console exporter (Example 3), then flip a single environment flag to route the same code through the OTLP exporter and a real collector. Because the only thing that changes between the two is the processor and exporter pair, your span names, attributes, and propagation behavior are identical in both, so what you verify locally is exactly what runs in production.

Production Code Examples

Example 1: Production SDK Initialization with OTLP Exporter

This assembles the full bootstrap — resource, provider, tuned batch processor, global registration — in the order the diagram prescribes.

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes
# pip install "opentelemetry-sdk>=1.30.0,<2.0.0" \
#   "opentelemetry-exporter-otlp-proto-grpc>=1.30.0,<2.0.0"

resource = Resource.create({                                  # 1. identity
    ResourceAttributes.SERVICE_NAME: os.getenv("SERVICE_NAME", "payment-service"),
    ResourceAttributes.SERVICE_VERSION: os.getenv("SERVICE_VERSION", "2.4.1"),
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: os.getenv("DEPLOYMENT_ENV", "production"),
})
provider = TracerProvider(resource=resource)                  # 2. provider
exporter = OTLPSpanExporter(                                   # 3. exporter
    endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"),
    insecure=False, timeout=10,
)
provider.add_span_processor(BatchSpanProcessor(               # 4. async batch export
    exporter, max_queue_size=2048, max_export_batch_size=512, schedule_delay_millis=5000,
))
trace.set_tracer_provider(provider)                           # 5. global, before instrumenting
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_transaction") as span:
    span.set_attribute("payment.amount_cents", 4999)

Expected Output:

# Spans queue in memory and flush asynchronously to the OTLP endpoint.
# Representative payload received by the collector:
{
  "resourceSpans": [{
    "resource": {"attributes": [
      {"key": "service.name", "value": {"stringValue": "payment-service"}},
      {"key": "deployment.environment", "value": {"stringValue": "production"}}
    ]},
    "scopeSpans": [{"spans": [{
      "name": "process_transaction",
      "kind": "SPAN_KIND_INTERNAL",
      "status": {"code": "STATUS_CODE_OK"}
    }]}]
  }]
}

Example 2: Async-Compatible Context Propagation Setup

This registers W3C-compliant propagators globally so trace and baggage headers survive context switches in asyncio and concurrent.futures.

from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator

# Combine trace context and baggage, then register before instrumentation attaches.
set_global_textmap(CompositePropagator([
    TraceContextTextMapPropagator(),
    W3CBaggagePropagator(),
]))

Expected Output:

# Headers the SDK now injects on outbound HTTP/gRPC requests:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
baggage: user_id=usr_98765,tenant_id=acme_corp

Example 3: Local Verification with the Console Exporter

Before pointing at a collector, confirm the pipeline emits spans at all by swapping in the console exporter. This is the fastest way to prove your bootstrap order is correct, because a misconfigured provider prints nothing.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
# pip install "opentelemetry-sdk>=1.30.0,<2.0.0"

# SimpleSpanProcessor is fine here: this is a local debugging path, not production.
provider = TracerProvider(resource=Resource.create({"service.name": "bootstrap-check"}))
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

with trace.get_tracer(__name__).start_as_current_span("smoke-test") as span:
    span.set_attribute("check.ok", True)

Expected Output:

{
  "name": "smoke-test",
  "context": {"trace_id": "0x0af7651916cd43dd8448eb211c80319c", "span_id": "0xb7ad6b7169203331"},
  "kind": "SpanKind.INTERNAL",
  "attributes": {"check.ok": true},
  "resource": {"attributes": {"service.name": "bootstrap-check"}},
  "status": {"status_code": "UNSET"}
}

Seeing this JSON on stdout proves the resource, provider, and tracer are wired correctly. Once it appears, switch the processor to BatchSpanProcessor and the exporter to OTLP for production, leaving everything else unchanged.

Common Mistakes

Spans Vanish Because Instrumentation Ran First

Error signature: No spans reach the collector despite a healthy exporter; tracers return non-recording spans. Root cause: Auto-instrumentation or get_tracer ran before set_tracer_provider, so tracers bound to the default no-op provider. Remediation: Bootstrap the SDK at the very top of process startup and call set_tracer_provider before importing or attaching any framework instrumentation.

Event Loop Stalls Under Load

Error signature: Request latency spikes and SpanExportError: Export timed out during peak traffic. Root cause: A SimpleSpanProcessor is exporting synchronously on every span end, blocking the event loop on network I/O. Remediation: Replace it with BatchSpanProcessor, size max_queue_size to roughly twice peak concurrency, and keep schedule_delay_millis between 2000 and 5000.

Service Map Shows unknown_service

Error signature: Traces aggregate under unknown_service or the executable name; topology generation fails. Root cause: No Resource with service.name was attached before constructing the provider. Remediation: Build the Resource explicitly and set OTEL_SERVICE_NAME in deployment manifests so both code and environment agree.

Baggage Silently Disappears Across Services

Error signature: traceparent propagates correctly but custom baggage keys never reach downstream services. Root cause: A custom propagator was registered without including W3CBaggagePropagator. Remediation: Always register baggage inside a CompositePropagator alongside the trace-context propagator.

Global Provider Contamination in Tests

Error signature: Duplicate spans or cross-test telemetry bleed between test runs. Root cause: The global provider persists across test cases and worker restarts. Remediation: Reset the provider and use an in-memory span exporter per test, and initialize providers only after worker forking in multi-process servers.

Duplicate Spans After a Fork

Error signature: Every request appears twice in the backend, or the exporter deadlocks shortly after startup under Gunicorn. Root cause: The provider and its exporter connection were created in the master process and inherited by forked workers, so multiple workers share one background thread and gRPC channel. Remediation: Build the provider inside a post-fork hook (Gunicorn's post_fork, or a per-worker startup callback) so each worker owns an independent exporter connection and batch queue.

Frequently Asked Questions

How do I handle SDK initialization in a multi-process worker environment?

Initialize the provider after the worker process forks, using a post-fork or per-worker startup hook. This gives each process its own exporter connections and batch buffers and avoids sharing file descriptors across the fork.

What is the performance impact of synchronous versus asynchronous exporters?

Synchronous export blocks the calling thread on network I/O for every span, adding latency to each request. The batch processor flushes on a background daemon thread, which removes export from the request path and is the only safe choice for production.

Can I mix auto-instrumentation with manual SDK setup?

Yes, but your manual provider must be set as the global provider before auto-instrumentation attaches, otherwise spans go to the no-op default. Disable overlapping framework instrumentations to avoid duplicate spans.

How do I configure fallback behavior when the collector is unreachable?

Set a bounded exporter timeout so retries cannot pile up, rely on the batch queue to absorb short outages, and let spans drop once the queue fills rather than blocking the application. Alert on sustained export failures from the collector's own health metrics.