Incidents
Document and track service incidents
Last updated: March 15, 2026
Incidents
Incidents document service issues and their resolution.
File Structure
Place incidents in __status__/incidents/:
__status__/
└── incidents/
├── 2025-01-28-api-outage.md
├── 2025-01-15-database-latency.md
└── 2025-01-10-cdn-issue.md
Filename format: YYYY-MM-DD-short-description.md
Basic Format
---
title: API Service Outage
status: resolved
severity: major
affected_components:
- api
created_at: 2025-01-28T10:00:00Z
resolved_at: 2025-01-28T12:30:00Z
---
## Update - 12:30 UTC
Issue resolved. The root cause was identified as a misconfigured load balancer.
## Update - 11:15 UTC
Identified the issue. Deploying fix.
## Update - 10:00 UTC
Investigating reports of API unavailability.
Frontmatter Fields
| Field | Type | Required | Description |
|---|---|---|---|
title |
string | Yes | Incident title |
status |
string | Yes | Current status |
severity |
string | Yes | Severity level |
affected_components |
[]string | No | Component IDs |
created_at |
datetime | Yes | Start time (ISO 8601) |
updated_at |
datetime | No | Last update time |
resolved_at |
datetime | No | Resolution time |
Status Values
| Status | Description | Display |
|---|---|---|
investigating |
Initial investigation | Yellow |
identified |
Root cause found | Orange |
monitoring |
Fix applied, monitoring | Blue |
resolved |
Fully resolved | Green |
Severity Levels
| Severity | Description | Impact |
|---|---|---|
minor |
Low impact | Some users affected |
major |
Significant impact | Most users affected |
critical |
Service down | All users affected |
Update Timeline
Write updates as H2 headings with timestamps:
## Update - 14:30 UTC
Service fully restored. Post-mortem scheduled.
## Update - 13:45 UTC
Fix deployed to production. Monitoring.
## Update - 12:00 UTC
Identified root cause as database connection pool exhaustion.
## Update - 11:30 UTC
Investigating elevated error rates on API endpoints.
Updates display newest first on the status page.
Timestamp Formats
## Update - 14:30 UTC
## Update - 2:30 PM UTC
## Update - January 28, 14:30 UTC
## Update - 2025-01-28 14:30 UTC
Affected Components
Link incidents to components:
affected_components:
- api
- web
- mobile-api
Component IDs must match those in components.yaml.
Effects:
- Components show incident indicator
- Uptime calculations include downtime
- Filtering by component
Examples
Investigating
---
title: Elevated API Latency
status: investigating
severity: minor
affected_components:
- api
created_at: 2025-01-28T15:00:00Z
---
## Update - 15:00 UTC
We are investigating reports of slow API responses. Some users may experience delays.
Identified
---
title: Database Connection Issues
status: identified
severity: major
affected_components:
- database
- api
created_at: 2025-01-28T10:00:00Z
---
## Update - 10:45 UTC
Root cause identified: connection pool exhaustion due to long-running queries. Working on fix.
## Update - 10:00 UTC
Investigating database connectivity issues affecting API responses.
Monitoring
---
title: CDN Outage
status: monitoring
severity: critical
affected_components:
- cdn
created_at: 2025-01-28T08:00:00Z
---
## Update - 09:30 UTC
CDN service restored. Monitoring for stability.
## Update - 08:30 UTC
Identified issue with CDN provider. Working with vendor.
## Update - 08:00 UTC
CDN serving errors for static assets.
Resolved
---
title: Authentication Service Outage
status: resolved
severity: major
affected_components:
- api
- web
created_at: 2025-01-28T06:00:00Z
resolved_at: 2025-01-28T07:15:00Z
---
## Update - 07:15 UTC
Service fully restored. Root cause was an expired TLS certificate on the auth service.
## Update - 06:45 UTC
Certificate renewed. Restarting services.
## Update - 06:15 UTC
Identified expired certificate causing auth failures.
## Update - 06:00 UTC
Users unable to log in. Investigating.
Post-Mortem
Include post-mortem in resolved incidents:
---
title: Major API Outage
status: resolved
severity: critical
affected_components:
- api
created_at: 2025-01-28T10:00:00Z
resolved_at: 2025-01-28T14:00:00Z
---
## Post-Mortem
**Duration:** 4 hours
**Root Cause:** A deployment introduced a memory leak that caused API servers to crash under load.
**Impact:** API was unavailable for all users during the incident.
**Timeline:**
- 10:00 - Deployment completed
- 10:15 - First alerts triggered
- 10:30 - Rollback initiated
- 14:00 - Service restored
**Action Items:**
- Add memory usage monitoring
- Improve deployment canary process
- Update runbook for similar incidents
## Update - 14:00 UTC
Service fully restored after rollback.
## Update - 10:30 UTC
Rolling back recent deployment.
## Update - 10:00 UTC
Investigating API unavailability.
Display
Incidents appear on:
- Dashboard - Active and recent incidents
- History Page - All incidents, grouped by month
- Individual Pages - Full incident details
- RSS Feed - Updates for subscribers