Incident Posture
What happens when things go wrong
Last reviewed: 2025-01-25
Incident Posture
This document describes exactly what happens when ΔOS components fail. No marketing language—just the failure modes and their consequences.
You need to know what breaks and how. We're documenting failure because understanding failure is how you build trust.
Failure Philosophy
Fail Closed, Not Open
When ΔOS cannot evaluate an Intent, the default is to block, not allow.
// If evaluation fails for any reason
intent.judgment = 'block';
intent.blockReason = 'Evaluation unavailable';
This means false negatives (blocking valid actions) rather than false positives (allowing dangerous actions).
Degrade Gracefully
Partial failures reduce capability rather than causing total outage:
| State | Intent Evaluation | Audit Recording | Escalation Routing | Value Attribution |
|---|---|---|---|---|
| Healthy | ✓ | ✓ | ✓ | ✓ |
| Partial Failure | ✓ | ✓ | ✓ | – |
| Total Failure | – | ✓ | – | – |
Failure Scenarios
1. Evaluation Service Unavailable
LIM Evaluation Cannot Complete
Intents cannot be evaluated against policy. All new Intents are blocked until service recovers.
Service auto-recovers when underlying infrastructure stabilizes. Manual override remains available for urgent actions.
Your options during this failure:
- Use manual override for critical actions
- Wait for service recovery
- Activate emergency approval workflow
2. Audit Service Unavailable
Audit Trail Cannot Record
Intents can be evaluated but decisions cannot be recorded. System enters audit-pending mode.
System queues audit records locally. Upon recovery, records are flushed to permanent storage with original timestamps.
Important: Decisions made during this period are still valid—they're just recorded later.
3. Escalation Routing Fails
Cannot Route to Human Reviewers
Escalated Intents cannot reach human queues. Escalated actions are blocked.
Escalations are queued. Humans receive backlog when routing recovers. SLA timers pause during outage.
4. Evidence Collection Fails
Cannot Gather Evidence for Evaluation
LIMs cannot access context needed for evaluation. Conservative judgments applied.
System retries evidence collection. Falls back to conservative evaluation rules when evidence unavailable.
5. Total System Failure
ΔOS Completely Unavailable
No governance functions available. Agents cannot submit Intents.
Full recovery required. Agents configured with fail-closed will halt. Kill switch remains available via infrastructure controls.
Response Procedures
Automatic Responses
| Condition | Automatic Response |
|---|---|
| Evaluation latency > 5s | Alert + log degradation |
| Evaluation error rate > 1% | Alert + investigate |
| Audit lag > 1 minute | Alert + monitor queue |
| Escalation queue > 100 | Alert + increase routing capacity |
Human Responses
| Severity | Response Time | Actions |
|---|---|---|
| P0 (Critical) | 15 minutes | All hands, customer notification |
| P1 (High) | 1 hour | On-call response, status update |
| P2 (Medium) | 4 hours | Next business day if after hours |
| P3 (Low) | 24 hours | Scheduled maintenance window |
What You Should Do
Configure Fail Behavior
deltaos.configure({
onEvaluationUnavailable: 'block', // or 'allow-with-audit'
onAuditUnavailable: 'queue', // or 'block'
onEscalationUnavailable: 'block', // or 'allow-with-flag'
healthCheckInterval: '30s'
});
Monitor Health Endpoints
const health = await deltaos.health.check();
// {
// evaluation: { status: 'healthy', latencyP99: '45ms' },
// audit: { status: 'healthy', queueDepth: 0 },
// escalation: { status: 'healthy', pendingCount: 3 }
// }
Set Up Alerts
await deltaos.alerts.configure({
channels: ['pagerduty:team-oncall', 'slack:#governance-alerts'],
thresholds: {
evaluationLatencyP99: '500ms',
auditQueueDepth: 1000,
escalationBacklog: 50
}
});
Recovery Verification
After any incident, verify:
- Audit completeness — No gaps in the audit trail
- Judgment consistency — Replay sample of decisions
- Escalation processing — All queued escalations handled
- Value attribution — Metrics caught up
const verification = await deltaos.recovery.verify({
incident: 'INC-2025-001',
timeRange: { start, end }
});
console.log(verification);
// {
// auditComplete: true,
// gapsFound: 0,
// decisionReplayMatch: 100%,
// escalationsProcessed: 47,
// valueAttributionLag: '0s'
// }
See Also
- Guarantees — What we always provide
- SLOs — Measurable commitments
- Authority Boundaries — Decision authority during failure