Case study

Case Study: Python SaaS app returning 502 errors behind NGINX and Gunicorn

This example shows how a production infrastructure problem can be investigated methodically, improved safely and turned into clearer operational practice.

Context

A Python SaaS application was running behind NGINX with Gunicorn serving the app through systemd. Users were seeing intermittent 502 Bad Gateway errors, but the failures did not look like a normal application crash.

The site could run normally for hours, then suddenly produce a burst of 502s before recovering. There were no straightforward Python tracebacks explaining the outage. NGINX showed upstream failures, while Gunicorn workers appeared to disappear, restart or stop responding after memory usage had climbed over time.

The problem

NGINX was returning 502 errors because the upstream Gunicorn worker had vanished, closed the connection unexpectedly or stopped responding.
The failures were not being caused by NGINX itself. NGINX was only reporting that the Python backend had become unavailable.
Individual Gunicorn workers were slowly increasing in memory usage during the day.
The memory growth appeared to come from a mix of in-process caching, request-level objects being retained too long and long-lived database/session objects.
When memory pressure became high enough, the Linux OOM killer terminated the largest Gunicorn workers to protect the server.
Because the workers were killed by the operating system, the application did not produce a useful Python-level exception.

Our approach

Matched NGINX upstream 502 errors with Gunicorn worker restarts, dropped connections and rising per-worker memory usage.
Checked journalctl, kernel logs and OOM killer events to confirm the workers were being killed outside the Python application.
Reviewed caching, database/session lifecycle, worker count and systemd limits to reduce avoidable memory growth.
Added controlled Gunicorn worker recycling with max-requests and max-requests-jitter, then monitored worker RSS, OOM events and upstream failures.

Hands-on outcomes

502 errors stoppedGunicorn workers were no longer being killed unpredictably by the operating system during memory pressure.

Root cause made visibleThe team could see the link between worker memory growth, OOM events and NGINX upstream failures.

Controlled worker recyclingWorker restarts became planned and graceful using max-requests and jitter, rather than emergency kills by Linux.

Better incident checksFuture 502 investigations included system logs, memory growth and worker behaviour, not just NGINX configuration.

Relevant technologies and keywords

These are the main technologies, solutions and search terms connected to this case study.

PythonGunicornNGINXsystemd502 Bad GatewayOOM killerMemory leakmax-requestsLinuxSaaS engineering assistanceWorker recyclingUpstream errors

Want assist with a similar issue?

Send the symptoms, affected system, recent changes and organisation impact. We will suggest the most appropriate route: emergency engineering assistance, a fixed-scope engineering fix, an infrastructure review or a wider project.

Discuss your project

Case Study: Python SaaS app returning 502 errors behind NGINX and Gunicorn

Context

The problem

Our approach

Hands-on outcomes

Relevant technologies and keywords

Related solutions

Gunicorn Application Server Consulting

NGINX Web Server Consulting

Full-stack web apps

Emergency Server Engineering

Observability Setup

Want assist with a similar issue?