Metric Types and Cardinality Control in Python
Choosing the right metric type and bounding label cardinality are the two decisions that determine whether a Python service produces cheap, queryable telemetry or an unmaintainable time-series explosion. This guide covers counter, gauge, histogram, and summary semantics, when each applies, how labels multiply series count, the high-cardinality anti-patterns that destroy Prometheus, histogram bucket design for latency SLOs, the quantile trade-off between summaries and histograms, and recording rules for query-time aggregation. It is part of the Python Metrics and Instrumentation guide. For the library mechanics referenced throughout, see Prometheus client instrumentation and the OpenTelemetry metrics SDK guides, and for two focused deep dives see controlling label cardinality in Prometheus and choosing between counter, gauge, histogram, and summary.
Prerequisites
Install pinned client libraries. The examples use the official Prometheus client and the OpenTelemetry metrics SDK.
pip install "prometheus-client>=0.20.0,<1.0.0"
pip install "opentelemetry-sdk>=1.30.0,<2.0.0" \
"opentelemetry-exporter-otlp-proto-grpc>=1.30.0,<2.0.0"
A Prometheus server (>=2.50,<3.0) scraping the exposition endpoint is assumed for the recording-rule and relabeling material. Python 3.10+ is assumed for the type annotations used in the snippets.
Concept and Architecture
A metric in the Prometheus data model is a named numeric measurement plus an optional set of key-value labels. The unit of storage is the time series: one append-only stream of timestamped samples, identified by the metric name and the exact set of label values. The cost of a metric on the backend is therefore not the number of metric names you define but the number of distinct label combinations those metrics emit. This is the single fact that drives every design decision below.
Prometheus distinguishes four metric types. A Counter is monotonic: it only increases and resets to zero when the process restarts, which is why queries always wrap it in rate() or increase(). A Gauge is a free-floating snapshot that rises and falls. A Histogram observes values into a fixed set of cumulative buckets and exposes _bucket, _sum, and _count series so that quantiles can be computed at query time. A Summary computes configurable quantiles inside the client process and exposes them directly alongside _sum and _count. The semantic differences and a full decision table live in choosing between counter, gauge, histogram, and summary.
The reason histograms dominate latency monitoring is aggregatability. Because every histogram exports raw bucket counts, the Prometheus server can sum buckets across all replicas of a service and then apply histogram_quantile() to the merged result. Summary quantiles are computed per process and cannot be averaged or summed without statistical nonsense, so they describe one instance only. For service-level objectives spanning a fleet, that distinction is decisive.
The four types also differ in what they cost the process that emits them. A Counter or Gauge is a single atomic number, so observing it is a lock-free increment in the common case and the exposition footprint is one line per series. A Histogram allocates one counter per bucket plus a sum and a count, so its observe path is a bucket lookup and an increment, still cheap, but its exposition footprint is the bucket count plus two extra series per label combination. A Summary is the heaviest: it maintains a streaming quantile estimator per series, which costs more CPU per observation and more memory to hold the sketch. These costs scale with cardinality, so a histogram with many buckets attached to a high-cardinality label is doubly expensive — once for the buckets and once for the label fan-out. The right type is therefore a joint decision about semantics and cost, never semantics alone.
Two more model details matter in production. First, the OpenTelemetry data model maps onto these same shapes — a Counter maps to a monotonic Sum, a Gauge to a Gauge, and a Histogram to an explicit-bucket Histogram — but OpenTelemetry adds the notion of temporality (cumulative versus delta) that the Prometheus exposition format does not expose directly; the OpenTelemetry metrics SDK guide covers that mapping. Second, none of these types tolerate an unbounded label, because every type multiplies its base series count by the cardinality of its labels, and for histograms that multiplier is applied to every bucket.
Step-by-Step Implementation
Step 1 — Define a counter for event totals. Counters answer "how many" over time. Keep label sets small and bounded; the method and status fields below are bounded enumerations, not free text.
from prometheus_client import Counter
# method and status are bounded enumerations -> safe, low cardinality
REQUESTS = Counter(
"http_requests_total",
"Total HTTP requests processed",
labelnames=("method", "status"),
)
REQUESTS.labels(method="GET", status="200").inc()
REQUESTS.labels(method="POST", status="500").inc()
Step 2 — Define a gauge for current state. Gauges represent a value at a moment: queue depth, in-flight requests, connection pool size. Use inc, dec, or set.
from prometheus_client import Gauge
IN_FLIGHT = Gauge(
"http_requests_in_flight",
"Requests currently being served",
)
IN_FLIGHT.inc() # request started
# ... handle request ...
IN_FLIGHT.dec() # request finished
Step 3 — Define a histogram with SLO-aligned buckets. The default buckets are general-purpose. For a latency SLO you must place bucket boundaries on the thresholds you actually report against, because histogram_quantile() interpolates linearly within a bucket and is only as precise as the boundary spacing.
from prometheus_client import Histogram
# Buckets chosen around a 250ms p99 SLO target, in seconds.
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"Request latency in seconds",
labelnames=("route",),
buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
)
with REQUEST_LATENCY.labels(route="/checkout").time():
handle_checkout()
Step 4 — Use a summary only for single-instance, client-side quantiles. A summary is appropriate when you genuinely need a quantile from one process and will never aggregate it, and when you can tolerate the higher CPU cost of streaming quantile estimation.
from prometheus_client import Summary
# Quantiles are computed in-process; they describe THIS replica only.
GC_PAUSE = Summary(
"gc_pause_seconds",
"Garbage collection pause duration",
)
with GC_PAUSE.time():
run_gc_cycle()
Step 5 — Compute query-time quantiles from the histogram. With buckets in place, the server derives the p99 across all replicas. This is the query a latency SLO panel runs.
histogram_quantile(
0.99,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)
Configuration Reference
| Concern | Type / setting | Default | Recommended |
|---|---|---|---|
| Event totals | Counter |
n/a | Always pair with rate() in queries |
| Current snapshot | Gauge |
n/a | Use for depths, pools, in-flight counts |
| Distribution, aggregatable | Histogram buckets |
(.005 … 10.0) |
Override to straddle SLO thresholds |
| Distribution, single instance | Summary quantiles |
none | Avoid for fleet SLOs |
| Label cardinality | labelnames |
none | Bounded enumerations only |
| Multiprocess exposition | PROMETHEUS_MULTIPROC_DIR |
unset | Set for gunicorn/uwsgi workers |
| OTel histogram boundaries | explicit bucket view | SDK default | Set via View + ExplicitBucketHistogramAggregation |
| Server query cost | recording rule interval | none | Precompute heavy aggregations |
For the OpenTelemetry equivalents of these instruments and exporter wiring, see recording counters and histograms with OpenTelemetry.
Async and Concurrency Considerations
The Prometheus Python client metric objects are process-global and thread-safe: inc, dec, set, and observe use atomic operations, so multiple threads or coroutines can update the same instrument without a lock in application code. That safety does not extend across processes. Under a multi-process server such as gunicorn or uwsgi, each worker holds its own copy of every counter, and a default start_http_server would report only the worker that happened to serve the scrape. The fix is multiprocess mode: set PROMETHEUS_MULTIPROC_DIR to a writable directory and have each worker write its samples to memory-mapped files that a single collector aggregates at scrape time.
import os
from prometheus_client import CollectorRegistry, multiprocess, generate_latest
# Each gunicorn worker writes to PROMETHEUS_MULTIPROC_DIR; a registry
# backed by MultiProcessCollector merges them for the scrape response.
def metrics_app(environ, start_response):
registry = CollectorRegistry()
multiprocess.MultiProcessCollector(registry)
data = generate_latest(registry)
start_response("200 OK", [("Content-Type", "text/plain")])
return [data]
Multiprocess mode imposes a cardinality discipline of its own. Gauges must declare a multiprocess mode (livesum, liveall, min, max) because there is no single live value across workers, and each worker's distinct label children are stored separately, so the on-disk file set grows with the union of label combinations across all workers. High-cardinality labels therefore hurt twice under multiprocess: once in stored series and once in mmap file count and aggregation time on every scrape.
For asyncio services, observation is safe from any coroutine, but timing must respect the event loop. Use time.perf_counter() deltas around awaited work rather than the blocking with histogram.time(): form when the timed region contains awaits, because the context manager measures wall time correctly but does not yield, and wrapping a large awaited block in it is fine while wrapping the whole handler can mask where time is actually spent. Record one observation per logical operation; emitting an observation inside a tight inner loop multiplies sample volume without adding signal.
Label Cardinality and Why It Explodes
The number of time series a metric produces is the product of the distinct value counts of all its labels. A metric with a method label (about 5 values) and a status label (about 6 values) tops out near 30 series — trivial. Add a customer_id label drawn from a hundred thousand customers and the same metric now produces hundreds of thousands of series, each one a permanent in-memory index entry on the Prometheus server. Series count, not sample rate, is what exhausts Prometheus memory.
Three values are almost always cardinality bombs and must never become labels: user or customer IDs, request or trace IDs, and any free-form string such as a full URL path with embedded IDs, an email address, or an error message. Each of these is effectively unbounded, so every new request mints a new series that is never reclaimed. The same caution applies to span attributes, as covered in span lifecycle and attributes; high-cardinality identifiers belong in traces and structured logs, where storage is per-event rather than per-series.
Bound every label to a small, predictable set. Replace raw URL paths with the matched route template (/users/{id}, not /users/8123). Map open-ended error strings to a closed enum of error classes. If a value can grow with traffic or with your customer base, it does not belong in a label. The mechanics of detecting and trimming offending labels server-side — including metric_relabel_configs and dropping or aggregating labels with relabeling — are covered in controlling label cardinality in Prometheus.
Histogram Bucket Design for Latency SLOs
Histogram accuracy is entirely a function of bucket placement. histogram_quantile() assumes values are uniformly distributed within each bucket and interpolates linearly between boundaries. If your SLO is "99% of checkout requests under 250ms" but your nearest bucket boundaries are 100ms and 500ms, the computed p99 can be off by hundreds of milliseconds because the estimator has no resolution between those edges.
Place a boundary exactly on each SLO threshold, then add a few boundaries on either side to capture the shape of the distribution. Spacing should be roughly geometric across the operating range and tight near the threshold you report against. Remember the cost: every bucket is an extra time series, and that count is multiplied by every other label on the histogram. A histogram with 12 buckets and a route label of 20 values produces 240 _bucket series plus _sum and _count. Keep bucket counts modest and route cardinality bounded together.
A useful starting layout for HTTP latency is a roughly geometric ladder from ten milliseconds to a few seconds with extra resolution clustered around the SLO threshold. If the SLO is a 250ms p99, boundaries at 100ms, 200ms, 250ms, 300ms, and 500ms give the estimator three nearby edges to interpolate against, which keeps the computed p99 within a small fraction of the true value. Boundaries far above the threshold still matter for catching tail blowups, but they need not be dense. Resist the temptation to add buckets everywhere "to be safe": each one is a permanent series multiplied by every label, and twenty buckets on a metric with a moderate route label can quietly become the largest metric in the system.
Validate bucket placement empirically rather than by intuition. After deploying a new layout, compare the histogram-derived p99 against a short-lived summary or against raw request logs for the same window. If they disagree by more than your error tolerance at the SLO threshold, the buckets are too coarse there and need a tighter boundary. This is the same discipline applied to span attribute limits described in span lifecycle and attributes: measure the cost and accuracy of your telemetry shape, do not assume it.
Native histograms (the newer Prometheus exponential-bucket format) sidestep manual boundary selection by storing exponentially spaced buckets compactly, but classic explicit buckets remain the portable default for the Python client and are what most existing dashboards expect.
Summary vs Histogram Trade-offs
| Property | Histogram | Summary |
|---|---|---|
| Where quantiles compute | Prometheus server, query time | Client process, scrape time |
| Aggregatable across replicas | Yes, via bucket sums | No |
| Configurable quantiles after the fact | Yes | No, fixed at definition |
| Client CPU cost | Low (bucket increment) | Higher (streaming estimation) |
| Error bound | Bucket-width interpolation | Per-quantile target error |
| Best for | Fleet-wide SLOs, burn rate | Single-instance diagnostics |
The practical rule: reach for a histogram by default, and only choose a summary when you need a precise quantile from exactly one process and will never aggregate it.
Recording Rules for Query-Time Aggregation
histogram_quantile() over a high-cardinality rate() of bucket series is one of the most expensive queries a dashboard can run, and re-running it on every panel refresh is wasteful. Recording rules precompute the expression on the server at a fixed interval and write the result to a new, lower-cardinality series that dashboards and alerts read cheaply.
groups:
- name: http_slo
interval: 30s
rules:
# Precompute per-route p99 once; panels read this series directly.
- record: route:http_request_duration_seconds:p99
expr: |
histogram_quantile(
0.99,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)
Expected Output: querying the recorded series returns the precomputed quantile per route.
route:http_request_duration_seconds:p99{route="/checkout"} 0.214
route:http_request_duration_seconds:p99{route="/search"} 0.087
Recording rules reduce query cost and the cardinality of the derived series your dashboards touch, but they do not change the cardinality of the raw scraped series. To cut the raw series count you must fix the instrumentation or relabel at scrape time.
Production Code Examples
This end-to-end Flask-style handler uses the right type for each measurement, keeps every label bounded, and exposes the metrics on a dedicated port. For the framework integration details, see instrumenting Flask with Prometheus metrics.
import time
from prometheus_client import Counter, Gauge, Histogram, start_http_server
REQUESTS = Counter(
"http_requests_total", "Total HTTP requests",
labelnames=("method", "route", "status"),
)
IN_FLIGHT = Gauge(
"http_requests_in_flight", "In-flight requests",
labelnames=("route",),
)
LATENCY = Histogram(
"http_request_duration_seconds", "Request latency",
labelnames=("route",),
buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
)
def handle(method: str, route: str) -> int:
# route is the matched template, never the raw path -> bounded label
IN_FLIGHT.labels(route=route).inc()
start = time.perf_counter()
try:
status = 200
time.sleep(0.03)
return status
finally:
LATENCY.labels(route=route).observe(time.perf_counter() - start)
REQUESTS.labels(method=method, route=route, status="200").inc()
IN_FLIGHT.labels(route=route).dec()
if __name__ == "__main__":
start_http_server(8000) # exposition endpoint on :8000/metrics
handle("GET", "/checkout")
Expected Output: scraping http://localhost:8000/metrics returns the exposition text.
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",route="/checkout",status="200"} 1.0
# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{route="/checkout",le="0.025"} 0.0
http_request_duration_seconds_bucket{route="/checkout",le="0.05"} 1.0
http_request_duration_seconds_bucket{route="/checkout",le="+Inf"} 1.0
http_request_duration_seconds_sum{route="/checkout"} 0.0306
http_request_duration_seconds_count{route="/checkout"} 1.0
http_requests_in_flight{route="/checkout"} 0.0
Common Mistakes
Putting an unbounded identifier in a label
Symptom: Prometheus memory climbs continuously and prometheus_tsdb_head_series grows without bound. Root cause: a label such as user_id or request_id mints a new series per request. Remediation: remove the label from the instrument and move the identifier into traces or logs; if it is already deployed, drop it at scrape time with metric_relabel_configs.
Using default histogram buckets for a specific SLO
Symptom: the dashboard p99 disagrees with measured latency at the SLO threshold. Root cause: no bucket boundary sits on the threshold, so the interpolated quantile is coarse. Remediation: override buckets to place a boundary on the SLO value and tighten spacing nearby.
Choosing a Summary then trying to aggregate it
Symptom: a fleet-wide p99 panel returns meaningless or per-instance values. Root cause: summary quantiles are computed in each process and cannot be merged. Remediation: switch to a Histogram and compute the quantile with histogram_quantile() over summed buckets.
Treating a Counter as a Gauge
Symptom: graphs show jagged drops to zero on every deploy. Root cause: reading a counter's raw value instead of its rate, so restarts look like data loss. Remediation: always query counters through rate() or increase(), which handle resets correctly.
Per-worker metrics under gunicorn without multiprocess mode
Symptom: counters appear to halve or jump randomly between scrapes, and totals are far below reality. Root cause: each worker holds an independent copy and the scrape hits one worker at a time. Remediation: enable multiprocess mode with PROMETHEUS_MULTIPROC_DIR and a MultiProcessCollector, and declare a multiprocess mode on every gauge.
When to Reach for Each Type
A short field guide ties the semantics back to everyday decisions. Reach for a Counter whenever you would naturally say "number of," because the rate of a counter is the throughput or error rate you actually want on a dashboard. Reach for a Gauge whenever a single instantaneous reading is the answer and the value can fall as well as rise; if you find yourself resetting a counter to model a falling value, you wanted a gauge. Reach for a Histogram whenever the question is "how is this value distributed across requests," especially for latency and payload size, and whenever the answer must hold across more than one replica. Reach for a Summary only in the narrow case of a precise quantile from a single, long-lived process where aggregation will never apply, accepting its higher CPU cost in exchange for a tight per-quantile error bound.
The cardinality lens overrides all of the above when they conflict. A type that is semantically perfect but attached to an unbounded label is the wrong choice, because the series explosion will cost more than the missing signal. When that tension appears, keep the high-cardinality dimension out of metrics entirely and recover it from traces or structured logs, then choose the metric type for the bounded view that remains. This is why the type decision and the cardinality decision in this guide are two halves of one design step, not separate concerns.
Frequently Asked Questions
How many time series does one labeled metric actually create?
One time series exists for every unique combination of metric name and label values. A metric with two labels of 50 and 20 distinct values produces up to 1000 series, and adding a third label multiplies that figure again.
Should I use a Histogram or a Summary for latency SLOs?
Use a Histogram. Histogram buckets are aggregatable across instances on the Prometheus server, so you can compute global quantiles and burn rates, while Summary quantiles are pre-computed per process and cannot be merged.
Why are user IDs and request IDs bad as metric labels?
They are unbounded high-cardinality values. Each new ID creates a fresh permanent time series, which inflates server memory, slows queries, and can crash Prometheus. Keep that detail in traces and structured logs instead.
What is the difference between a Counter and a Gauge?
A Counter only ever increases and resets to zero on restart, so it answers how many events have happened over time. A Gauge can go up and down and represents a current value such as queue depth or memory usage.
Do recording rules reduce cardinality?
Recording rules precompute expensive expressions and can aggregate away labels, lowering query cost and the cardinality of the derived series. They do not reduce the cardinality of the raw series being scraped.