Controlling Label Cardinality in Prometheus

Q: How do I find which metric is causing a cardinality blowup?

Query topk(10, count by (__name__)({__name__=~".+"})) to rank metric names by series count, then use count by (label) per offending metric to find the label with the most distinct values.

Q: Can I drop a label without changing my application code?

Yes. Use metric_relabel_configs in the Prometheus scrape config to drop or rewrite labels at ingestion time. The series is reduced before it is stored, with no redeploy of the service required.

Q: Does dropping a label aggregate the series or discard data?

Dropping a label with labeldrop collapses all series that differ only in that label into one, summing compatible samples on counters. It removes the dimension entirely rather than discarding the metric.

Q: What is a safe upper bound for label cardinality?

Aim for any single metric to stay in the low thousands of series. A label whose value set grows with traffic or customer count is unbounded and should never be a label regardless of the current number.

A single label carrying an unbounded value such as a user ID or a raw URL path can turn one metric into hundreds of thousands of permanent time series and exhaust a Prometheus server's memory. This guide shows how to identify the blowup, bound label values in Python, and drop or aggregate offending labels with relabeling. It is part of the Python Metrics and Instrumentation guide and supports the metric types and cardinality control guide; for the underlying type semantics see choosing between counter, gauge, histogram, and summary.

A relabeling stage rewrites unbounded path values into one bounded route series.

Prerequisites

pip install "prometheus-client>=0.20.0,<1.0.0"

A Prometheus server (>=2.50,<3.0) with edit access to its scrape configuration is required for the relabeling sections. The diagnostic queries run in the Prometheus expression browser. No application environment variables are needed beyond the standard exposition port.

Why Cardinality Is the Cost Driver

Prometheus stores one independent append-only stream — a time series — for every unique combination of metric name and label values. Each active series occupies an entry in the in-memory head index plus per-sample storage, so the resource a series consumes is roughly constant regardless of how often it changes. The consequence is blunt: a metric scraped once every fifteen seconds with a million label combinations costs vastly more than a metric scraped every second with ten combinations. Sample frequency is cheap; series count is expensive.

The number of series a metric emits is the product of the distinct value counts of all its labels. Two bounded labels of five and six values produce at most thirty series. Introduce one unbounded label — a user ID, a request ID, a raw path with embedded identifiers, an email address, or a full error message — and the product becomes the size of that unbounded set, which grows with traffic and never shrinks. Series are not reclaimed when a label value stops appearing; they age out of the head block only after the retention window, so a brief spike of unique IDs leaves a lingering trail of dead series. This is why an unbounded label is a one-way ratchet on memory, and why the durable fix always lives in the application rather than in the query layer.

Implementation

Step 1 — Confirm a blowup exists and rank by metric. In the Prometheus expression browser, rank metric names by series count. A handful of names usually account for the bulk of the head series.

topk(10, count by (__name__)({__name__=~".+"}))

Step 2 — Find the offending label on that metric. For the worst metric, count distinct values per label to identify which dimension is unbounded.

count(count by (path) (http_requests_total))

A result in the tens of thousands for a single label means that label is unbounded. Compare it against prometheus_tsdb_head_series to gauge the share of total series the metric consumes. Two complementary signals confirm the diagnosis. The scrape_samples_scraped series reports how many samples a target returns per scrape; a target that returns hundreds of thousands of samples is emitting a high-cardinality metric at the source. The Prometheus TSDB status page (Status, then TSDB Status in the UI) lists the highest-cardinality label names and label-value pairs directly, which often pinpoints the offender faster than ad hoc queries.

When several labels share blame, count them together to see the multiplicative effect rather than inspecting each in isolation.

count(count by (method, route, status, user_id) (http_requests_total))

If removing user_id from that grouping drops the count by three orders of magnitude, you have isolated the unbounded dimension and can target it precisely in the steps below.

Step 3 — Bound the label in application code. The durable fix is to never emit the unbounded value. Map raw paths to the matched route template and free-form strings to a closed enum before they reach a label.

from prometheus_client import Counter

REQUESTS = Counter(
    "http_requests_total", "Total HTTP requests",
    labelnames=("method", "route", "status"),
)

# Bound the label: use the route TEMPLATE, not the concrete path.
def record(method: str, raw_path: str, route_template: str, status: int) -> None:
    REQUESTS.labels(
        method=method,
        route=route_template,          # "/users/{id}", not "/users/8123"
        status=str(status),
    ).inc()

Bounding in code has a second benefit beyond cardinality: the exposition payload itself shrinks, so each scrape transfers and parses fewer bytes, which lowers both client memory in the Python process and scrape duration on the server. A process that builds a fresh labels() child for every unique identifier also leaks memory inside the client, because the client retains every child it has ever created for the lifetime of the process. Bounding the label set therefore caps client-side memory as well as server-side series count.

Step 4 — Drop a label at scrape time when you cannot redeploy. If the bad label is already in production, remove the dimension server-side. metric_relabel_configs runs after scrape and before storage, so it shrinks series before they are indexed. This is a stopgap that buys time for a code fix, not a substitute for it: the application keeps generating the wide payload on every scrape, so client memory and network cost are unchanged even though stored series shrink.

scrape_configs:
  - job_name: "python-app"
    static_configs:
      - targets: ["app:8000"]
    metric_relabel_configs:
      # Collapse the unbounded "path" dimension entirely.
      - regex: "path"
        action: labeldrop

Step 5 — Rewrite a label to a bounded form instead of dropping it. When the dimension is useful but the raw value is too granular, replace it with a regex-extracted bounded value rather than discarding it.

    metric_relabel_configs:
      # Rewrite /users/123 -> /users/{id} into a new "route" label.
      - source_labels: [path]
        regex: "(/users/)[0-9]+"
        target_label: route
        replacement: "${1}{id}"
      - regex: "path"
        action: labeldrop

Step 6 — Allowlist known-good values when a pattern is hard to express. Some labels carry a small set of legitimate values mixed with occasional garbage from malformed requests. Rather than enumerate every bad value, keep only the good ones and let everything else fall into a single bounded bucket via a default. A keep action on the label drops samples whose value is not in the allowed set.

    metric_relabel_configs:
      # Keep only the known HTTP methods; anything else is dropped.
      - source_labels: [method]
        regex: "GET|POST|PUT|DELETE|PATCH"
        action: keep

Step 7 — Drop an entire noisy metric. When a whole metric is not worth its cardinality, drop it with a __name__ match so it is never stored.

    metric_relabel_configs:
      - source_labels: [__name__]
        regex: "expensive_debug_metric"
        action: drop

Configuration Options

Relabel action	Effect	Use when
`labeldrop`	Removes a label, merging series that differ only by it	The dimension is pure noise
`replace`	Writes a derived value into a target label	The raw value is too granular
`keep`	Keeps only series whose labels match	Allowlisting known-good values
`drop` (on `__name__`)	Discards the whole series	An entire metric is too costly
`labelkeep`	Keeps only listed labels, drops the rest	Whitelisting a metric's dimensions

These run under metric_relabel_configs (post-scrape, affects stored data). Distinguish them from relabel_configs, which run during target discovery and shape which targets are scraped, not which samples are kept. Order matters: rules execute top to bottom, and each operates on the label set produced by the preceding rule. A replace that derives a bounded label must therefore appear before the labeldrop that removes the source it reads from, or the source will already be gone. When a regex in a replace does not match, the rule is a no-op and the target label is left untouched, so an allowlisting keep rule is often safer than a brittle replace for values that do not follow a single pattern.

For very large fleets, consider enforcing limits rather than relying solely on relabeling. The scrape-level sample_limit rejects an entire scrape that returns more than a configured number of samples, turning a silent cardinality leak into a loud, alertable failure before it floods the TSDB. Pair it with the diagnostic queries above so a tripped limit points directly at the responsible job.

Verification

After reloading the Prometheus config (curl -X POST http://localhost:9090/-/reload), re-run the per-label count. The unbounded dimension should be gone or collapsed to a small bounded set.

count(count by (path) (http_requests_total))

Expected Output: the path label no longer exists after labeldrop, and the series collapse to the bounded route set.

# before relabeling
count by (route) (http_requests_total)  -> {route="/users/{id}"}  142031 series

# after labeldrop / replace + reload
count by (route) (http_requests_total)  -> {route="/users/{id}"}  1 series

Watch prometheus_tsdb_head_series flatten within a couple of scrape intervals once the offending series stop being created and the old ones age out of the head block. The flattening is not instant: relabeling stops new series from being added immediately, but series already in the head block persist until they fall outside the retention window and the block is compacted. Expect total head series to plateau within one scrape interval and to decline gradually over the following hours as stale series expire. If memory does not plateau, a second metric or a second label is still unbounded, so repeat the ranking query from Step 1 to find it.

A good closing check is to confirm the relabeling did not silently break a dashboard. Run the queries your panels use against the rewritten label and verify they still return the expected shape. A replace that introduced a new route label, for example, means panels grouping by the old path label now return nothing, so those panels must be updated to group by route as part of the same change.

Common Mistakes

Relabeling but leaving the source intact Symptom: head series do not drop after adding a replace rule. Root cause: the original high-cardinality label is still being stored alongside the new bounded one. Remediation: add a labeldrop for the source label after the replace rule.

Fixing it only at scrape time Symptom: cardinality returns whenever the scrape config is reset or the metric is scraped by another server. Root cause: the application still emits the unbounded value, so relabeling is a patch, not a fix. Remediation: bound the label in code as the primary fix and treat relabeling as defense in depth.

Dropping a label that breaks counter aggregation Symptom: after a labeldrop, a counter shows impossible decreases or doubled rates. Root cause: collapsing a label merges several monotonic series into one, and if those series had independent resets the merged stream is no longer monotonic. Remediation: drop labels on counters only when the merged series remains sensible, and prefer aggregating with sum by (...) at query time when correctness across resets matters.

Frequently Asked Questions

How do I find which metric is causing a cardinality blowup?