Python Metrics and Instrumentation: A Production Guide for SREs

Metrics are the cheapest, highest-leverage observability signal for Python backends: a handful of counters and histograms tell you request rate, error rate, and latency distribution at a fraction of the storage cost of traces or logs. This guide is for backend engineers and SREs who need to instrument production Python services correctly, and it links the focused guides that go deeper: instrumenting with the prometheus_client library, recording metrics with the OpenTelemetry metrics SDK, choosing metric types and controlling cardinality, and deciding between OpenTelemetry and Prometheus for Python metrics. Metrics also close the loop with the other two signals: they pair with distributed tracing in Python through exemplars, and with structured logging fundamentals through shared trace_id correlation.

The two production metric paths: a Prometheus-scraped exposition endpoint (pull) and an OpenTelemetry SDK that pushes OTLP to a collector.

Key architectural principles:

Treat metrics as aggregates, not events: a metric is a pre-aggregated number per time series, never one data point per request.
Keep label sets small and bounded; cardinality, not volume, is what breaks a metrics backend.
Use histograms with deliberately chosen buckets so latency SLOs can be computed server-side across all instances.
Pick one transport model per service intentionally — pull (scrape) or push (OTLP) — and make multiprocess aggregation explicit when you run worker pools.

Foundational Architecture & Metric Standards

A metric is a numeric measurement identified by a name and a set of key/value labels, sampled over time. The combination of metric name plus a unique set of label values defines a single time series. This is the unit your backend stores, indexes, and queries, and it is the unit that determines cost. Understanding that one metric expands into many time series — one per label combination — is the single most important mental model for instrumentation.

There are two dominant collection models in the Python ecosystem. The Prometheus pull model has each process expose a plain-text /metrics endpoint; a Prometheus server scrapes that endpoint on a fixed interval (commonly every 15 seconds) and stores the result. The application is passive — it only maintains current values in an in-process registry. The OpenTelemetry push model runs a metrics SDK inside the process that periodically reads instrument values and exports them over OTLP to a collector or backend. The application is active — it owns the export cadence.

The pull model gives the monitoring system control over scrape timing, makes liveness obvious (a failed scrape is a signal), and needs no per-app egress configuration. The push model fits short-lived jobs, serverless functions, and environments where the app cannot be reached by a scraper, and it unifies metrics with the same OTLP pipeline used for traces and logs. The detailed trade-offs are covered in OpenTelemetry vs Prometheus for Python metrics.

Both models share the same four instrument families. A counter is monotonic — it only goes up (or resets to zero on restart) — and answers "how many" questions: requests served, errors raised, bytes written. A gauge can go up or down and captures a current value: queue depth, in-flight requests, memory in use. A histogram buckets observations of a value (usually latency or size) into ranges so quantiles can be computed later. A summary computes quantiles in-process at observation time. Choosing correctly between these is consequential enough that it has its own guide on picking counter, gauge, histogram, and summary.

The Prometheus exposition format is the wire contract for the pull model. It is line-oriented UTF-8 text with # HELP and # TYPE comments followed by metric_name{label="value"} number samples. Because it is just text over HTTP, anything that can serve a response can expose metrics, which is why the format became a de facto standard well beyond Prometheus itself.

Instrumentation Strategy & SDK Configuration

The prometheus_client library is the canonical Python implementation of the pull model. You declare instruments once at module scope, mutate them inside request handlers, and expose the default registry over HTTP. Because instruments live in a global REGISTRY by default, declaring the same metric name twice raises a duplicate-timeseries error — a deliberate guard against accidental double registration. The companion guide on instrumenting with prometheus_client walks through registry management and the framework integrations, including adding Prometheus metrics to Flask and exposing custom application metrics.

The OpenTelemetry path centers on a MeterProvider configured with one or more metric readers. The PeriodicExportingMetricReader collects instrument values on an interval and hands them to an exporter such as OTLPMetricExporter. You acquire a Meter from the provider and create instruments — create_counter, create_histogram, create_up_down_counter, create_observable_gauge — from that meter. Observable (asynchronous) instruments take a callback that the reader invokes at collection time, which is the right pattern for values you sample rather than increment, like resident memory or pool size. The mechanics are detailed in the OpenTelemetry metrics SDK guide, with focused walkthroughs on exporting OTLP metrics to the collector and recording counters and histograms.

Whichever SDK you choose, resource attributes such as service.name and deployment.environment must be attached so that series from different deployments stay distinguishable. In Prometheus these typically arrive as scrape-target labels or relabel rules; in OpenTelemetry they are set on the Resource passed to the MeterProvider, identically to how the tracing guide sets resources on the TracerProvider. Sharing one resource definition across signals is what makes cross-signal correlation work.

Label & Cardinality Discipline

Labels are the most powerful and most dangerous feature of dimensional metrics. A label like method or status_code has a small, fixed set of values and produces a manageable number of series. A label like user_id, session_id, a raw URL path with embedded IDs, or a full exception message is unbounded: it grows without limit as traffic flows, and each new value permanently allocates another time series. This is the number-one cause of Prometheus out-of-memory incidents.

The discipline is simple to state and easy to violate: every label must draw from a small, enumerable set known at design time. Normalize before labeling. Replace /orders/84321 with a route template /orders/{id}. Bucket a continuous quantity into ranges rather than labeling the raw value. Strip user-supplied strings entirely. The dedicated guide on controlling label cardinality in Prometheus covers route normalization, allow-lists, and how to find offending series before they cost you an outage.

A practical ceiling: estimate the cartesian product of all label values per metric before shipping it. A metric with method (5) × status (6) × endpoint (40) is 1,200 series — fine. The same metric with endpoint replaced by raw path is unbounded — a latent incident. When in doubt, drop the label; you can always add a dimension later, but you cannot cheaply reclaim the memory a bad one has already cost.

Histogram Bucket Design for Latency SLOs

A histogram is only as useful as its bucket boundaries. prometheus_client ships default buckets tuned for sub-second web latency (.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10 seconds), but defaults rarely match a specific SLO. If your SLO is "99% of requests under 300 ms," you need a bucket boundary at or very near 0.3, because Prometheus computes quantiles by linear interpolation within a bucket — a boundary far from your target produces a quantile estimate that is wrong by exactly the bucket width.

Design buckets around the thresholds you actually report on. Place explicit boundaries at your SLO targets (for example 0.1, 0.3, 0.5, 1, 2, 5) and add resolution where your traffic concentrates. Logarithmically spaced buckets give good coverage across orders of magnitude when latency varies widely. Remember that every bucket is a separate time series multiplied by every label combination, so a 12-bucket histogram with 1,200 label combinations is over 14,000 series for one metric — bucket count is a cardinality lever too.

The reason histograms beat summaries for SLOs is aggregation. Histogram buckets are counters, and counters from many instances can be summed by the backend before computing a quantile, giving you a correct fleet-wide p99. Summary quantiles are computed per process and cannot be averaged into a meaningful global quantile. For any latency metric you intend to alert on across a fleet, use a histogram.

Multiprocess Collection Under Gunicorn and Uvicorn

The pull model has a sharp edge with WSGI/ASGI worker pools. Under gunicorn (or uvicorn with multiple workers), each worker is a separate OS process with its own in-memory registry. A single scrape of one shared port reaches exactly one randomly chosen worker, so counters appear to bounce around and reset — you are sampling one worker's view, not the service total.

prometheus_client solves this with multiprocess mode. You set the PROMETHEUS_MULTIPROC_DIR environment variable to a writable directory; each worker writes its metric state to memory-mapped files there. The scrape endpoint then builds a fresh CollectorRegistry, attaches a MultiProcessCollector pointed at that directory, and aggregates across all workers at scrape time. Counters and histograms sum correctly; gauges support modes like livesum, max, and liveall to control how per-process values combine.

The OpenTelemetry push model sidesteps this differently: each worker process runs its own MeterProvider and exports independently over OTLP, tagging exports with a process or instance identifier. The backend aggregates across instances. There is no shared-file dance, but you must ensure the SDK is initialized after the worker forks, because exporter background threads and HTTP connections do not survive fork(). Initialize in a gunicorn post_fork hook or an ASGI lifespan startup, never at import time before the master forks.

Scraping and the /metrics Endpoint

For the pull model, the contract is a single HTTP GET /metrics returning exposition text with a Content-Type of text/plain; version=0.0.4. prometheus_client provides start_http_server(port) for a standalone thread, make_wsgi_app() to mount inside an existing WSGI app, and generate_latest(registry) to render the body yourself for custom routes. Mounting inside your app reuses its port and TLS; a separate port isolates metrics from request traffic and lets you firewall it independently.

Keep the endpoint cheap. It should read current registry state and serialize — never trigger database queries or recompute expensive values inline. For values that are expensive to sample (queue depth from an external system, for instance), refresh them on a background timer or via an observable instrument and let the endpoint serve the cached number. A /metrics handler that blocks on I/O will time out scrapes under load and create gaps exactly when you most need data.

Data Volume and Cost Control

Metrics cost is dominated by active series count, not request rate, because storage and memory scale with the number of distinct time series a backend must hold in its head block. The levers are therefore all about series count: fewer labels, bounded label values, fewer histogram buckets, and dropping series you never query. Audit your metrics periodically and delete instruments and labels nothing alerts or dashboards on — unused series are pure cost.

On the pull side, control cost with scrape interval and metric_relabel_configs that drop noisy series at ingestion. On the push side, the OpenTelemetry SDK supports views that rename instruments, drop attributes, or change histogram bucket boundaries before export, plus delta vs cumulative temporality choices that affect backend storage. Exemplars add a small, bounded cost (a trace_id attached to a sampled bucket) and are worth it for the trace correlation they unlock. The same trace_id injected into your structured logs lets a latency spike jump straight to the offending logs and traces.

Production Code Examples

Prometheus Instrumentation with a Latency Histogram

Declares instruments once at module scope, records request count and latency with bounded labels, and exposes them on a dedicated metrics port.

# pip install "prometheus-client>=0.20.0,<1.0.0"
import time
from prometheus_client import Counter, Histogram, start_http_server

# Bounded labels only: method and a normalized route template, plus status class.
REQUESTS = Counter(
    "http_requests_total",
    "Total HTTP requests.",
    ["method", "route", "status"],
)
LATENCY = Histogram(
    "http_request_duration_seconds",
    "Request latency in seconds.",
    ["method", "route"],
    # Buckets placed around a 300ms p99 SLO target.
    buckets=(0.05, 0.1, 0.2, 0.3, 0.5, 1.0, 2.5),
)


def handle(method: str, route: str) -> int:
    start = time.perf_counter()
    status = "200"
    try:
        time.sleep(0.12)  # simulated work
        return 200
    finally:
        LATENCY.labels(method, route).observe(time.perf_counter() - start)
        REQUESTS.labels(method, route, status).inc()


if __name__ == "__main__":
    start_http_server(9100)  # serves GET /metrics on :9100
    handle("GET", "/orders/{id}")
    handle("GET", "/orders/{id}")

Expected Output: A scrape of http://localhost:9100/metrics returns exposition text:

# HELP http_requests_total Total HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="GET",route="/orders/{id}",status="200"} 2.0
# HELP http_request_duration_seconds Request latency in seconds.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.2",method="GET",route="/orders/{id}"} 0.0
http_request_duration_seconds_bucket{le="0.3",method="GET",route="/orders/{id}"} 2.0
http_request_duration_seconds_bucket{le="+Inf",method="GET",route="/orders/{id}"} 2.0
http_request_duration_seconds_count{method="GET",route="/orders/{id}"} 2.0
http_request_duration_seconds_sum{method="GET",route="/orders/{id}"} 0.24...

OpenTelemetry Metrics SDK with OTLP Export

Builds a MeterProvider with a periodic reader and OTLP exporter, then records a counter and a histogram from a meter.

# pip install "opentelemetry-sdk>=1.30.0,<2.0.0" \
#   "opentelemetry-exporter-otlp-proto-grpc>=1.30.0,<2.0.0"
import os
import time
from opentelemetry import metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

resource = Resource.create({
    "service.name": os.getenv("OTEL_SERVICE_NAME", "order-service"),
    "deployment.environment": os.getenv("DEPLOY_ENV", "production"),
})

# Reader collects every 10s and pushes over OTLP. Initialize AFTER any fork.
reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317")),
    export_interval_millis=10_000,
)
metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[reader]))

meter = metrics.get_meter("order-service")
requests = meter.create_counter("http.server.requests", unit="1")
latency = meter.create_histogram("http.server.duration", unit="s")


def handle(route: str) -> None:
    start = time.perf_counter()
    time.sleep(0.12)
    attrs = {"http.route": route, "http.status_code": 200}
    latency.record(time.perf_counter() - start, attrs)
    requests.add(1, attrs)


handle("/orders/{id}")

Expected Output: No console output; every 10 seconds the reader exports an OTLP payload. A debug collector logs a representative metric:

{
  "name": "http.server.duration",
  "unit": "s",
  "histogram": {
    "dataPoints": [
      {
        "attributes": {"http.route": "/orders/{id}", "http.status_code": 200},
        "count": 1,
        "sum": 0.121,
        "bucketCounts": [0, 0, 1, 0]
      }
    ],
    "aggregationTemporality": "CUMULATIVE"
  }
}

Multiprocess Aggregation Under Gunicorn

Aggregates per-worker metric files into one registry at scrape time so a multi-worker deployment reports correct service-wide totals.

# pip install "prometheus-client>=0.20.0,<1.0.0" "gunicorn>=21.2.0,<23.0.0"
# Run with: PROMETHEUS_MULTIPROC_DIR=/tmp/prom gunicorn -w 4 app:app
import os
from prometheus_client import CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST
from prometheus_client import multiprocess


def metrics_app(environ, start_response):
    # Build a fresh registry per scrape and aggregate all worker files.
    registry = CollectorRegistry()
    multiprocess.MultiProcessCollector(registry)  # reads PROMETHEUS_MULTIPROC_DIR
    data = generate_latest(registry)
    start_response("200 OK", [("Content-Type", CONTENT_TYPE_LATEST)])
    return [data]


def child_exit(server, worker):
    # Required so a dead worker's series are cleaned up.
    multiprocess.mark_process_dead(worker.pid)

Expected Output: A scrape returns counters summed across all four workers rather than one worker's partial count, e.g. http_requests_total{...} 812.0 instead of a number that resets each scrape.

Common Mistakes

Putting unbounded values in labels: Labeling with user_id, raw URLs, request IDs, or exception messages creates an unbounded number of time series and is the most common cause of Prometheus running out of memory. Normalize to bounded templates before labeling.
Running multiple gunicorn/uvicorn workers without multiprocess mode: Without PROMETHEUS_MULTIPROC_DIR and MultiProcessCollector, each scrape hits one random worker, so counters look like they reset and totals are silently wrong.
Using a summary when you need fleet-wide quantiles: Summary quantiles are computed per process and cannot be aggregated, so your "p99" is really one instance's p99. Use a histogram so the backend can compute a correct global quantile.
Default histogram buckets that miss the SLO threshold: Quantiles are interpolated within a bucket, so if no boundary sits near your SLO target the p99 estimate is off by the bucket width. Place explicit boundaries at the thresholds you alert on.
Initializing the OTel MeterProvider before the worker forks: Exporter threads and connections do not survive fork(), so a provider built at import time silently stops exporting in forked workers. Initialize in a post_fork hook or lifespan startup.
Doing expensive work in the /metrics handler: Querying a database or recomputing values inline blocks scrapes and creates data gaps under load. Sample expensive values on a background timer or observable instrument and serve the cached number.

Frequently Asked Questions

Should I use prometheus_client or the OpenTelemetry metrics SDK for a new Python service?

If your platform already runs Prometheus and you want the simplest path, prometheus_client and a scraped /metrics endpoint is the least friction. If you want a single vendor-neutral pipeline shared with traces and logs, use the OpenTelemetry metrics SDK with OTLP export. Both can coexist because the OTel SDK ships a Prometheus exporter.

Why are my Prometheus metrics empty or wrong under gunicorn with multiple workers?

Each gunicorn worker is a separate process with its own in-memory registry, so a scrape only ever hits one random worker. Set PROMETHEUS_MULTIPROC_DIR and use prometheus_client.multiprocess.MultiProcessCollector so counts and histograms are aggregated across all workers.

How many label combinations is too many for a single metric?

Each unique combination of label values creates a separate time series stored independently. A few thousand series per metric is usually fine; tens or hundreds of thousands from unbounded labels like user_id, raw URLs, or request IDs will overwhelm Prometheus memory. Keep label values to a small, bounded set.

What is the difference between a histogram and a summary in prometheus_client?

A histogram counts observations into predefined buckets and lets Prometheus compute quantiles across many instances server-side, which is what you want for SLOs. A summary computes quantiles locally in the process and cannot be aggregated across instances. Prefer histograms for latency.

How do I correlate a metric spike with a specific trace?

Use exemplars: prometheus_client and the OpenTelemetry SDK can attach a trace_id to a sampled histogram observation. Your backend then links a point on the latency graph directly to the trace that produced it, joining the metrics and tracing signals.

Frequently Asked Questions

Related Guides