Case study
Case Study: Python SaaS app returning 502 errors behind NGINX and Gunicorn
This example shows how a production infrastructure problem can be investigated methodically, improved safely and turned into clearer operational practice.
Context
A Python SaaS application was running behind NGINX with Gunicorn serving the app through systemd. Users were seeing intermittent 502 Bad Gateway errors, but the failures did not look like a normal application crash.
The site could run normally for hours, then suddenly produce a burst of 502s before recovering. There were no straightforward Python tracebacks explaining the outage. NGINX showed upstream failures, while Gunicorn workers appeared to disappear, restart or stop responding after memory usage had climbed over time.
The problem
- NGINX was returning 502 errors because the upstream Gunicorn worker had vanished, closed the connection unexpectedly or stopped responding.
- The failures were not being caused by NGINX itself. NGINX was only reporting that the Python backend had become unavailable.
- Individual Gunicorn workers were slowly increasing in memory usage during the day.
- The memory growth appeared to come from a mix of in-process caching, request-level objects being retained too long and long-lived database/session objects.
- When memory pressure became high enough, the Linux OOM killer terminated the largest Gunicorn workers to protect the server.
- Because the workers were killed by the operating system, the application did not produce a useful Python-level exception.
Our approach
- Matched NGINX upstream 502 errors with Gunicorn worker restarts, dropped connections and rising per-worker memory usage.
- Checked
journalctl, kernel logs and OOM killer events to confirm the workers were being killed outside the Python application. - Reviewed caching, database/session lifecycle, worker count and systemd limits to reduce avoidable memory growth.
- Added controlled Gunicorn worker recycling with
max-requestsandmax-requests-jitter, then monitored worker RSS, OOM events and upstream failures.
Hands-on outcomes
max-requests and jitter, rather than emergency kills by Linux.Relevant technologies and keywords
These are the main technologies, solutions and search terms connected to this case study.
Related solutions
Relevant solutions for similar infrastructure problems.
Want assist with a similar issue?
Send the symptoms, affected system, recent changes and organisation impact. We will suggest the most appropriate route: emergency engineering assistance, a fixed-scope engineering fix, an infrastructure review or a wider project.