Prometheus Client Instrumentation in Python

Exposing numeric telemetry from a Python service starts with the official prometheus_client library: it owns the in-process registry, the four metric types, and the text exposition format that a Prometheus server scrapes. Backend engineers reach for it because it has no broker, no agent, and no push step in the common case — the process simply holds counters in memory and renders them on demand. This guide is part of the Python Metrics and Instrumentation guide, and it pairs with the deeper material on choosing the right metric type and controlling cardinality and on how the same workload looks under the OpenTelemetry metrics SDK.

The hard part is rarely the first counter. It is the lifecycle: where instruments live, which registry owns them, how the /metrics endpoint is wired into a real WSGI or ASGI app, and how prefork servers like gunicorn aggregate values across workers without resetting them on every scrape.

Instruments write into a registry; the metrics endpoint renders the text exposition that Prometheus scrapes, with an optional multiprocess directory bridging prefork workers.

Prerequisites & Dependency Pinning

The exposition format and the multiprocess API are stable within the 0.x series, but minor releases have changed default collector behavior. Pin the range so a scrape that worked in staging renders identically in production.

pip install "prometheus-client>=0.20.0,<1.0.0"

For an in-app endpoint you also need a server. Flask ships WSGI; FastAPI and Starlette are ASGI. The client provides handlers for both, so no extra exposition dependency is required.

# WSGI app (Flask, Django)
pip install "gunicorn>=21.2.0,<23.0.0"
# ASGI app (FastAPI, Starlette)
pip install "uvicorn>=0.29.0,<0.35.0"

Concept & Architecture

A prometheus_client deployment has three moving parts: instruments, a registry, and an exposition endpoint.

Instruments are the four metric types. A Counter only goes up and resets on process restart — use it for totals like requests served or errors raised. A Gauge goes up and down — use it for instantaneous values like in-flight requests, queue depth, or memory. A Histogram buckets observations and exposes _bucket, _sum, and _count series so quantiles can be computed server-side, which aggregates correctly across many instances. A Summary computes quantiles in-process and exposes _sum and _count plus optional quantile series, but its quantiles cannot be averaged across instances. The trade-offs between these are covered in depth under choosing between Counter, Gauge, Histogram, and Summary.

The registry is the container that knows about every instrument. By default each instrument registers itself with the module-level REGISTRY at construction, which is why you almost never touch the registry directly. You pass an explicit CollectorRegistry only for test isolation or for multiprocess aggregation.

Default collectors are registered automatically: ProcessCollector exports process_cpu_seconds_total, process_resident_memory_bytes, and open file descriptors on Linux, while PlatformCollector and GCCollector export interpreter and garbage-collector internals. You get these for free the moment you import the library.

Labels turn one metric name into many time series. Calling .labels(method="GET", status="200") returns a child you then increment or observe. Each unique label-value combination is a distinct series stored in memory and on the wire, which is why uncontrolled label values are the primary cause of cardinality blowups — see controlling label cardinality in Prometheus before you label anything with a user ID or URL path.

The Exposition Format

What a scrape returns is the Prometheus text exposition format: a flat, line-oriented document where each line is one sample. A metric carries an optional # HELP line (human description), a # TYPE line (counter, gauge, histogram, or summary), and one or more sample lines of the shape name{label="value",...} number. The format is deliberately dumb — there is no nesting, no timestamps in the common case, and no compression — which is what makes it cheap to generate and trivial to debug with curl.

The composite types expand into several series. A Histogram named x emits x_bucket{le="..."} for each bucket boundary plus a synthetic le="+Inf", an x_sum, and an x_count. A Summary named y emits y_sum, y_count, and optionally y{quantile="..."} lines. Counters render with a _total suffix appended automatically, so a Counter("requests_total", ...) and Counter("requests", ...) both expose requests_total. Knowing this expansion is what lets you predict series count: a histogram with 10 buckets and one label of 4 values is 4 * (10 + 1 + 2) = 52 series, not one.

generate_latest(registry) produces this document as bytes and CONTENT_TYPE_LATEST is the matching Content-Type header (text/plain; version=0.0.4; charset=utf-8). Every exposition path — start_http_server, the WSGI/ASGI apps, a hand-rolled route — ultimately calls generate_latest, so the rendered bytes are identical regardless of how you serve them.

Default collectors and what they cost

Importing the library registers three collectors against REGISTRY automatically. ProcessCollector exposes process_cpu_seconds_total, process_resident_memory_bytes, process_virtual_memory_bytes, process_start_time_seconds, and process_open_fds on Linux — invaluable for spotting a leaking worker without any extra code. PlatformCollector exposes static python_info with version and implementation labels. GCCollector exposes python_gc_collections_total, python_gc_objects_collected_total, and generation gauges.

These are nearly free at normal cardinality, but two caveats matter. First, under multiprocess mode they are per-process and are deliberately excluded from the merged scrape, so if you need process memory you expose it per worker on a side channel. Second, in a tightly scraped, many-replica fleet the default series add up; you can unregister a collector with REGISTRY.unregister(GC_COLLECTOR) if you do not use it. Import the singletons (PROCESS_COLLECTOR, PLATFORM_COLLECTOR, GC_COLLECTOR) from prometheus_client to unregister selectively.

Step-by-Step Implementation

Step 1 — Define instruments once, at import time. Instruments are stateful objects keyed by name within the registry. Constructing the same metric twice raises ValueError: Duplicated timeseries, so declare them at module scope and import them where needed rather than recreating them per request.

# metrics.py — single source of truth for instrument objects
from prometheus_client import Counter, Gauge, Histogram

# Counter: monotonically increasing total of handled requests
REQUESTS = Counter(
    "http_requests_total",
    "Total HTTP requests processed",
    ["method", "status"],          # labels: keep values bounded
)

# Gauge: instantaneous count of requests currently being served
IN_PROGRESS = Gauge(
    "http_requests_in_progress",
    "HTTP requests currently in flight",
)

# Histogram: latency distribution with explicit, domain-tuned buckets
LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency in seconds",
    ["method"],
    buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
)

Step 2 — Record observations from your code. Increment counters, set or shift gauges, and observe histogram values. The time() context manager on a histogram measures wall-clock duration and observes it on exit; track_inprogress() increments a gauge on entry and decrements on exit.

from metrics import REQUESTS, IN_PROGRESS, LATENCY

def handle(method: str):
    with IN_PROGRESS.track_inprogress():           # +1 on enter, -1 on exit
        with LATENCY.labels(method=method).time():  # observes elapsed seconds
            status = do_work()
            REQUESTS.labels(method=method, status=status).inc()

Step 3 — Expose the endpoint. For a standalone worker with no web server, start_http_server spins up a background thread serving the exposition on a dedicated port. It is the simplest path for batch jobs, Celery workers, and scripts.

from prometheus_client import start_http_server
import time

start_http_server(8000)            # serves /metrics (and /) on :8000
while True:
    handle("GET")
    time.sleep(1)

Step 4 — Or mount a scrape handler in an existing app. When you already run a web server, do not open a second port unless you must. The client exposes generate_latest() (the rendered bytes) and CONTENT_TYPE_LATEST (the correct content type), plus ready-made WSGI and ASGI apps you can mount on /metrics. This shares the app's port, TLS, and auth. The concrete WSGI wiring for a real framework is walked through in instrumenting Flask with Prometheus metrics.

# ASGI: mount the client's app on a Starlette/FastAPI route
from prometheus_client import make_asgi_app
from fastapi import FastAPI

app = FastAPI()
app.mount("/metrics", make_asgi_app())   # WSGI equiv: make_wsgi_app()

Step 5 — Enable multiprocess mode for prefork servers. Gunicorn and multi-worker uvicorn fork the process, so each worker has its own in-memory registry. A single scrape hits one random worker and sees only that worker's data. The fix is a shared on-disk directory plus a MultiProcessCollector, covered in the configuration section below.

Configuration Reference

Setting / API	Where it applies	Purpose	Default / Notes
`PROMETHEUS_MULTIPROC_DIR`	env var	Directory for per-worker metric files	Must exist and be writable; required for prefork aggregation
`start_http_server(port, addr)`	standalone	Background HTTP server thread	`addr` defaults to `0.0.0.0`; one port per process
`make_wsgi_app(registry)`	WSGI	App that renders the exposition	Mount on `/metrics`; uses `REGISTRY` if omitted
`make_asgi_app(registry)`	ASGI	ASGI scrape app	Mount with `app.mount("/metrics", ...)`
`generate_latest(registry)`	manual	Returns exposition as bytes	Pair with `CONTENT_TYPE_LATEST`
`CollectorRegistry()`	isolation	Standalone registry	Pass to instruments via `registry=`
`multiprocess.MultiProcessCollector(reg)`	multiproc	Aggregates per-worker files	Use a fresh `CollectorRegistry` for the scrape
`Histogram(..., buckets=...)`	instrument	Fixed bucket boundaries	Default buckets target web latency in seconds
`Gauge(..., multiprocess_mode=...)`	gauge	Cross-worker reduction	`all`, `liveall`, `min`, `max`, `sum`, `livesum`

Multiprocess wiring

Under multiprocess mode, instruments write to memory-mapped files in PROMETHEUS_MULTIPROC_DIR instead of an in-memory registry. The scrape endpoint must build a fresh registry and attach a MultiProcessCollector that reads and merges those files. The default ProcessCollector and GCCollector are per-process and do not aggregate, so a clean registry avoids double-counting.

# metrics_endpoint.py — multiprocess-aware scrape handler
import os
from prometheus_client import CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST
from prometheus_client import multiprocess

def render_metrics() -> tuple[bytes, str]:
    if os.environ.get("PROMETHEUS_MULTIPROC_DIR"):
        registry = CollectorRegistry()                  # fresh, empty registry
        multiprocess.MultiProcessCollector(registry)    # merges per-worker files
    else:
        from prometheus_client import REGISTRY as registry
    return generate_latest(registry), CONTENT_TYPE_LATEST

A gunicorn child_exit hook must call multiprocess.mark_process_dead(worker.pid) so a dead worker's gauge files are cleaned up; otherwise stale series linger across restarts.

# gunicorn.conf.py
from prometheus_client import multiprocess

def child_exit(server, worker):
    multiprocess.mark_process_dead(worker.pid)

Async & Concurrency Considerations

The client is thread-safe: inc(), set(), and observe() take internal locks, so concurrent threads sharing one instrument are correct without extra synchronization. Under asyncio, a single event loop runs instrument mutations on one thread, so there is no contention and no need to offload to an executor — recording a metric is a cheap in-memory operation.

The endpoint is the subtlety. generate_latest() walks the entire registry and renders text; under multiprocess mode it reads every per-worker file. On a registry with high cardinality this is measurable, so keep the scrape interval sane (10–30s) and keep label cardinality bounded. For ASGI apps, make_asgi_app() handles the request without blocking the loop on instrument reads, but the file-merge in multiprocess mode is synchronous I/O — acceptable at normal cardinality, painful if you have hundreds of thousands of series.

For per-request HTTP latency, prefer a Histogram over a Summary. Summary quantiles are computed per-process and cannot be aggregated across uvicorn workers or replicas, so a load-balanced p99 from summaries is meaningless. Histograms aggregate cleanly because the buckets sum across instances.

Production Code Examples

This end-to-end example runs an ASGI service under uvicorn with multiprocess mode enabled, records request totals and latency, and exposes an aggregated scrape endpoint.

# app.py
import os
import random
import time
from fastapi import FastAPI, Request, Response
from prometheus_client import Counter, Histogram, Gauge
from prometheus_client import CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST
from prometheus_client import multiprocess

# 1. Instruments declared once at import time
REQUESTS = Counter(
    "http_requests_total", "Total HTTP requests", ["method", "status"]
)
LATENCY = Histogram(
    "http_request_duration_seconds", "Request latency in seconds", ["method"],
    buckets=(0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5),
)
IN_PROGRESS = Gauge(
    "http_requests_in_progress", "In-flight requests",
    multiprocess_mode="livesum",          # sum live workers under multiproc
)

app = FastAPI()

# 2. Middleware records every request without per-route boilerplate
@app.middleware("http")
async def record_metrics(request: Request, call_next):
    method = request.method
    IN_PROGRESS.inc()
    start = time.perf_counter()
    try:
        response = await call_next(request)
        status = str(response.status_code)
        return response
    finally:
        IN_PROGRESS.dec()
        LATENCY.labels(method=method).observe(time.perf_counter() - start)
        REQUESTS.labels(method=method, status=status).inc()

# 3. Business route
@app.get("/work")
async def work():
    time.sleep(random.uniform(0.01, 0.2))
    return {"ok": True}

# 4. Multiprocess-aware scrape endpoint
@app.get("/metrics")
async def metrics():
    if os.environ.get("PROMETHEUS_MULTIPROC_DIR"):
        registry = CollectorRegistry()
        multiprocess.MultiProcessCollector(registry)
    else:
        from prometheus_client import REGISTRY as registry
    data = generate_latest(registry)
    return Response(content=data, media_type=CONTENT_TYPE_LATEST)

Run it with the multiprocess directory set so several uvicorn workers aggregate:

export PROMETHEUS_MULTIPROC_DIR=/tmp/prom_multiproc
mkdir -p "$PROMETHEUS_MULTIPROC_DIR" && rm -f "$PROMETHEUS_MULTIPROC_DIR"/*
uvicorn app:app --workers 4 --port 8080

Expected Output: scraping http://localhost:8080/metrics after a few requests returns the text exposition format:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 42.0
# HELP http_request_duration_seconds Request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",le="0.01"} 3.0
http_request_duration_seconds_bucket{method="GET",le="0.05"} 11.0
http_request_duration_seconds_bucket{method="GET",le="0.1"} 24.0
http_request_duration_seconds_bucket{method="GET",le="0.25"} 41.0
http_request_duration_seconds_bucket{method="GET",le="+Inf"} 42.0
http_request_duration_seconds_sum{method="GET"} 4.713
http_request_duration_seconds_count{method="GET"} 42.0
# HELP http_requests_in_progress In-flight requests
# TYPE http_requests_in_progress gauge
http_requests_in_progress 0.0

Note that under multiprocess mode the default process_* and python_gc_* series are absent from this scrape, because the fresh registry holds only the MultiProcessCollector. Expose those separately per worker if you need them.

Standalone worker with a custom collector

A background worker with no web server uses start_http_server directly. When a value lives in an external system rather than being mutated by your code — a queue depth in Redis, a row count in a database — write a custom collector that reads it at scrape time instead of maintaining a gauge on every change. This avoids drift between the reported value and reality, and it means the expensive read only happens when Prometheus actually scrapes.

# worker.py
import time
from prometheus_client import start_http_server, Counter
from prometheus_client.core import GaugeMetricFamily, REGISTRY

JOBS = Counter("worker_jobs_total", "Jobs handled", ["result"])

class QueueDepthCollector:
    # collect() runs once per scrape; yield one metric family per call
    def collect(self):
        depth = read_queue_depth()                 # external read, scrape-time
        g = GaugeMetricFamily(
            "worker_queue_depth", "Pending jobs in the broker queue",
            labels=["queue"],
        )
        g.add_metric(["default"], depth)
        yield g

REGISTRY.register(QueueDepthCollector())           # custom collector, no state

def main():
    start_http_server(9000)                         # exposition on :9000
    while True:
        result = run_one_job()
        JOBS.labels(result=result).inc()
        time.sleep(0.5)

Expected Output: a scrape of :9000/metrics interleaves the counter, the scrape-time gauge, and the default process collectors:

# HELP worker_jobs_total Jobs handled
# TYPE worker_jobs_total counter
worker_jobs_total{result="ok"} 137.0
worker_jobs_total{result="retry"} 4.0
# HELP worker_queue_depth Pending jobs in the broker queue
# TYPE worker_queue_depth gauge
worker_queue_depth{queue="default"} 12.0
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 5.1273728e+07

The custom collector pattern is the right tool whenever the value is owned elsewhere; for values your own code changes, a plain Gauge with .inc()/.dec() is simpler and cheaper.

Common Mistakes

Duplicated timeseries on import

Error signature: ValueError: Duplicated timeseries in CollectorRegistry: ... Root cause: the same metric name is constructed twice — typically because a module that defines instruments is imported under two different names, or instruments are created inside a function that runs per request. Remediation: define every instrument exactly once at module scope in a dedicated metrics.py and import the objects. For test suites that reload modules, construct instruments against a dedicated CollectorRegistry() you can discard between cases.

Counters reset on every scrape under gunicorn

Error signature: values jump between scrapes and never accumulate; http_requests_total looks tiny relative to real traffic. Root cause: prefork workers each hold a private in-memory registry, and the scrape lands on a random worker. Remediation: set PROMETHEUS_MULTIPROC_DIR, render through a MultiProcessCollector, and add the child_exit hook calling mark_process_dead. Clear the directory on startup so a redeploy does not resurrect stale files.

Using a Summary for cross-instance latency

Error signature: dashboards show plausible per-pod quantiles but the service-wide p99 is wrong or unaggregatable. Root cause: Summary quantiles are computed in-process and cannot be averaged across replicas or workers. Remediation: switch to a Histogram with buckets tuned to your latency SLO, and compute quantiles in PromQL with histogram_quantile. Reserve Summary for single-process tools where in-process quantiles are acceptable.

Frequently Asked Questions

Should I use start_http_server or a /metrics route inside my app?

Use start_http_server for scripts, workers, and batch jobs that have no web server of their own. Use an in-app /metrics route mounted on WSGI or ASGI when you already run a web framework, so scraping shares the same port, TLS, and access controls.

Why do my counters reset to zero under gunicorn?

Each prefork worker keeps its own in-memory registry, so the scrape hits a random worker and sees only that worker's values. Set PROMETHEUS_MULTIPROC_DIR and expose a MultiProcessCollector so the values aggregate across all workers.

What is the difference between a Histogram and a Summary?

A Histogram records observations into fixed buckets you define and computes quantiles server-side, so it aggregates correctly across instances. A Summary computes streaming quantiles in-process and cannot be aggregated across instances, so prefer Histogram for latency in distributed services.

Do I need to register metrics with a registry explicitly?

No. By default every instrument registers itself with the global REGISTRY at construction time. You only pass a custom registry when you need isolation, such as in tests or in multiprocess mode.

Frequently Asked Questions

Related Guides