Debugging a Production Incident at 3 AM
A single Slack alert kicked off a late-night incident response. The root cause was embarrassingly simple.
The Slack alert went off at 2:47 AM
I was asleep. Like, deep asleep. Then my phone started buzzing like crazy. 47 unread messages in the Slack channel. The Grafana dashboard was solid red. API response times had jumped from the usual 120ms to 14,300ms.
I opened my laptop in my pajamas. Honestly, this is the worst moment. Your brain isn't even awake yet, but you have to read logs.
First suspect: the database
Naturally, I assumed it was a DB issue. Must be a slow query, right? Dug through the PostgreSQL logs -- no slow queries. Connection pool looked normal. CPU usage was only at 12%.
(At this point, 30 minutes had already passed.)
Checked Redis too. Memory usage normal, no evictions, connections healthy. I genuinely had no idea what was wrong.
The culprit was an external API timeout
Found it much later. The payment module was calling an external payment gateway's API, and their servers weren't responding. The timeout was set to 30 seconds. Thirty seconds. I asked who set it to 30 seconds, and it was me.
The real problem was that these API calls were being processed synchronously. When the payment gateway slowed down, all our server's worker threads got blocked there, and every other request piled up behind them.
But why at 3 AM?
Turns out the payment gateway had scheduled server maintenance at 2 AM. They'd sent a notice -- by email. That email went to the account of a former employee who'd already left. Nobody read it.
This is the truly absurd part. It wasn't a technical problem. It was a communication problem.
Emergency fixes
Reduced the timeout from 30 seconds to 3. Slapped on a circuit breaker pattern in a hurry. Added fallback logic so that if the external API fails, only payment-related features get disabled while everything else keeps running normally.
Deployed at 4:23 AM. Total time to service recovery: 1 hour and 36 minutes.
What came out of the postmortem
We did a postmortem the next day. Seven action items came out of it -- 3 were technical, 4 were process-related. Updating the contact email for external service integrations, writing an incident response runbook, documenting all external API dependencies.
Technically, changing the timeout to 3 seconds and adding a circuit breaker was all it took. But nobody had done it for 6 months. Nobody cares until things blow up.
(Myself included.)
Habits that changed after that
Every time I see code that calls an external API, I check the timeout settings first. What's the default, is there a fallback, is there a circuit breaker. One late-night incident was enough to make this a habit.
But honestly, I'm still anxious. I never know where the next incident will come from. I think I just heard a Slack notification sound. Probably just my imagination.