Best practices for log rotation in Python

Implementing reliable log rotation in Python requires precise Handler Architecture configuration to prevent disk exhaustion, log loss, and I/O bottlenecks. This guide delivers production-tested patterns for backend engineers and SREs. It focuses on thread-safe rotation, multi-process synchronization, and minimal latency overhead. Proper rotation setup is foundational to Python Logging Fundamentals and Structured Data and ensures seamless integration with observability pipelines.

Key implementation priorities:

Select handler type based on retention policy vs. throughput requirements
Enforce file locking to prevent corruption in multi-worker deployments
Decouple rotation from synchronous write paths to maintain low p99 latency
Validate rotation behavior under load before production rollout

Handler Selection: RotatingFileHandler vs TimedRotatingFileHandler

Use RotatingFileHandler when disk capacity is the primary constraint. It enforces strict maxBytes limits and predictable archival cycles. Configure backupCount explicitly to cap storage consumption.

Use TimedRotatingFileHandler for compliance-driven retention windows. Set when='midnight' and interval=1 to align with audit requirements. This handler relies on system clock synchronization.

Avoid mixing Python rotation with external log shippers like Fluent Bit or Vector. Race conditions during file moves trigger FileNotFoundError: [Errno 2] No such file or directory. The shipper attempts to read a truncated inode while the Python process renames it.

Always validate rotation boundaries against your service SLA. Time-based handlers may delay rollover if the process is idle. This causes unexpected disk spikes during traffic bursts.

Multi-Process and Thread-Safe Rotation Patterns

The standard RotatingFileHandler lacks multiprocessing safeguards. Concurrent writes during doRollover() produce interleaved JSON lines. High-throughput workers frequently trigger OSError: [Errno 11] Resource temporarily unavailable.

Implement OS-level advisory locking via fcntl.flock() to serialize access. Each worker must acquire an exclusive lock before truncating or renaming the active log file. This guarantees atomic state transitions.

Disable copyTruncate in external logrotate configurations. Python retains the original file descriptor. Truncating the file causes silent log loss until the process restarts or receives SIGHUP.

Ensure atomic file moves by verifying os.rename() semantics on your target filesystem. Network mounts (NFS/EFS) may break atomicity. Use temporary staging directories to prevent partial writes during rollover.

I/O Optimization and Buffering Strategies

Disable line buffering in production environments. Rely on the OS page cache to batch disk writes and reduce syscall overhead. Configure handlers with mode='a' and encoding='utf-8'.

Decouple log emission from disk I/O using logging.handlers.QueueHandler and QueueListener. This prevents request thread blocking during synchronous rollover operations. It isolates the critical path from storage latency.

Set explicit flush intervals to balance durability and latency. Overly aggressive flushing increases I/O wait times. Delayed flushing risks data loss during OOM kills or container preemption.

Monitor file descriptor counts with /proc/self/fd or lsof. Rotation-induced FD leaks eventually trigger OSError: [Errno 24] Too many open files. This crashes the service abruptly.

Diagnostics and Validation Workflows

Enable internal diagnostics by setting logging.raiseExceptions = True. This surfaces silent formatter or handler failures directly to stderr during development. It prevents masked configuration errors.

Validate backup file naming conventions post-rotation. Mismatched suffixes indicate misconfigured retention policies. External interference often corrupts the expected .log.1 sequence.

Execute synthetic load tests using locust or k6 to measure p95 latency spikes during rollover. Target sub-millisecond impact on the critical request path. Verify W3C Trace Context propagation remains intact.

Audit file permissions and ownership in containerized deployments. Default umask settings may restrict log aggregation agents. This breaks downstream OTel Collector ingestion pipelines.

Production Code Examples

The following implementation combines thread-safe JSON formatting, QueueHandler decoupling, and fcntl-based file locking. It is fully async-safe and compliant with W3C Trace Context standards.

import logging
import json
import fcntl
import os
import queue
from logging.handlers import RotatingFileHandler, QueueHandler, QueueListener

# Async-safe queue for decoupled I/O
log_queue = queue.Queue(maxsize=10000)

class JSONFormatter(logging.Formatter):
 def format(self, record):
 log_obj = {
 "timestamp": self.formatTime(record),
 "level": record.levelname,
 "message": record.getMessage(),
 "module": record.module,
 "trace_id": getattr(record, "trace_id", None),
 "span_id": getattr(record, "span_id", None)
 }
 return json.dumps(log_obj, separators=(",", ":"))

class SafeRotatingFileHandler(RotatingFileHandler):
 def _open(self):
 stream = super()._open()
 fcntl.flock(stream.fileno(), fcntl.LOCK_EX)
 return stream

 def doRollover(self):
 if self.stream:
 fcntl.flock(self.stream.fileno(), fcntl.LOCK_UN)
 self.stream.close()
 super().doRollover()
 self._open()

def setup_production_logger(log_path: str = "/var/log/app/service.log"):
 handler = SafeRotatingFileHandler(
 filename=log_path,
 maxBytes=50 * 1024 * 1024,
 backupCount=5,
 encoding="utf-8"
 )
 handler.setFormatter(JSONFormatter())

 listener = QueueListener(log_queue, handler, respect_handler_level=True)
 listener.start()

 queue_handler = QueueHandler(log_queue)
 logger = logging.getLogger("app")
 logger.setLevel(logging.INFO)
 logger.addHandler(queue_handler)
 return logger, listener

if __name__ == "__main__":
 logger, listener = setup_production_logger()
 # Simulate W3C Trace Context injection
 logger.info("Payment processed", extra={"trace_id": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01", "span_id": "00f067aa0ba902b7"})
 listener.stop()

Expected Output (/var/log/app/service.log):

{"timestamp":"2024-05-15 10:00:00,123","level":"INFO","message":"Payment processed","module":"__main__","trace_id":"00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01","span_id":"00f067aa0ba902b7"}

Common Mistakes

Relying on default FileHandler without rotation in long-running services Unbounded log growth exhausts disk space and triggers OOM kills. Always enforce maxBytes and backupCount to cap storage consumption.

Using copytruncate=true with OS logrotate on Python apps Python holds an open file descriptor. copytruncate truncates the file, but the process continues writing to the same inode. This causes log loss until restart or explicit reopen.

Synchronous rotation blocking the main request thread Rollover involves file moves and metadata updates. Under high load, this causes latency spikes. Offload to async queues or background workers using QueueHandler.

Ignoring file descriptor leaks during rotation Improper stream closure leaves dangling FDs. This eventually hits ulimit -n thresholds and crashes the service. Always verify stream.close() executes post-rollover.

FAQ

How do I prevent log loss during rotation in multi-process Python applications? Use a concurrent-safe handler with explicit file locking (fcntl/LockFile). Avoid copytruncate. Ensure each worker reopens the file descriptor post-rotation via SIGHUP or programmatic handler reset.

Should I use Python's built-in rotation or rely on OS-level logrotate? For containerized or ephemeral environments, use Python's built-in RotatingFileHandler with explicit size limits. For bare-metal/VM deployments, OS logrotate with postrotate scripts is preferred for centralized management.

How can I verify rotation integrity without impacting production performance? Enable internal logging diagnostics. Monitor backupCount file creation. Track FD usage. Run synthetic load tests to measure rollover latency impact before deployment.