Production Incident Response: A Developer's Runbook

Production incidents are stressful. Having a documented process means you don't have to think about procedure while you're simultaneously trying to fix a problem and communicate with stakeholders. This runbook is a template — adapt it to your team.

Severity Levels

Level	Definition	Response time
SEV-1	Production down, all users affected, revenue impact	Immediate, 24/7
SEV-2	Major feature broken, many users affected	Within 30 min
SEV-3	Minor feature degraded, some users affected	Within business hours
SEV-4	Minor issue, minimal user impact	Next sprint

Phase 1: Detection & Triage (first 5 minutes)

□ Acknowledge the alert / confirm the incident is real
□ Determine severity level
□ Assign an Incident Commander (IC) — one person in charge
□ Open an incident channel (#incident-YYYY-MM-DD-short-desc)
□ Post initial update: "We are investigating reports of [X]. More info in 10 min."
□ Check status page — update if SEV-1 or SEV-2

Phase 2: Investigation (ongoing)

□ What changed recently? (recent deploys, config changes, external services)
   git log --since="2 hours ago" --oneline
   Check deployment logs

□ Check dashboards and logs
   Error rate spike? When did it start?
   Which specific requests are failing?
   Is it all users or a subset?

□ Hypothesis: what might be causing this?
□ Test hypothesis: can you reproduce it?
□ Share findings in incident channel — narrate what you're doing

Phase 3: Mitigation (ASAP)

□ Can we rollback? (fastest path to recovery)
   git revert HEAD && git push
   Feature flag to disable the affected feature
   Revert config change

□ Can we route around it?
   Disable a failing external service integration
   Scale up servers to handle load
   Serve cached responses while DB is slow

□ Is the fix safe to deploy right now, or riskier than the current issue?

Phase 4: Resolution & Communication

□ Confirm the issue is resolved (monitor for 10-15 min)
□ Post resolution: "The issue has been resolved. [Brief description of fix]."
□ Update status page to "Resolved"
□ Notify stakeholders
□ Schedule postmortem within 48 hours (SEV-1/SEV-2)

Status Page Update Templates

# Investigating
We are aware of reports of [issue description] and are currently investigating.
Next update: [time].

# Identified
We have identified the cause of [issue description] and are working on a fix.
Affected: [scope of impact]. Next update: [time].

# Resolved
The [issue] has been resolved as of [time]. 
The root cause was [brief explanation].
We are conducting a post-incident review to prevent recurrence.

Blameless Postmortem Template

# Incident Postmortem: [Title]
Date: YYYY-MM-DD | Severity: SEV-X | Duration: X hours

## Summary
[2-3 sentence description of what happened and impact]

## Timeline
- HH:MM — [what happened]
- HH:MM — [who was alerted, what action was taken]
- HH:MM — [mitigation applied]
- HH:MM — [resolved]

## Root Cause
[Technical description of the actual cause]

## Contributing Factors
[What conditions made this possible / made it worse]

## What Went Well
- [Thing that helped]

## What Could Have Gone Better
- [Gap identified]

## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add monitoring for X | @alice | 2025-08-15 |
| Write runbook for Y  | @bob   | 2025-08-20 |