Incident Posture

What happens when things go wrong

Last reviewed: 2025-01-25

Incident Posture

This document describes exactly what happens when ΔOS components fail. No marketing language—just the failure modes and their consequences.

ℹ️
Why This Document Exists

You need to know what breaks and how. We're documenting failure because understanding failure is how you build trust.

Failure Philosophy

Fail Closed, Not Open

When ΔOS cannot evaluate an Intent, the default is to block, not allow.

// If evaluation fails for any reason
intent.judgment = 'block';
intent.blockReason = 'Evaluation unavailable';

This means false negatives (blocking valid actions) rather than false positives (allowing dangerous actions).

Degrade Gracefully

Partial failures reduce capability rather than causing total outage:

StateIntent EvaluationAudit RecordingEscalation RoutingValue Attribution
Healthy
Partial Failure
Total Failure

Failure Scenarios

1. Evaluation Service Unavailable

LIM Evaluation Cannot Complete

What Happens

Intents cannot be evaluated against policy. All new Intents are blocked until service recovers.

Capability Status
Intent submission: degradedPolicy evaluation: blockedAudit recording: normalHuman override: normal
Public Message
Automated governance temporarily unavailable. Actions require manual approval.
Recovery Path

Service auto-recovers when underlying infrastructure stabilizes. Manual override remains available for urgent actions.

Your options during this failure:

  • Use manual override for critical actions
  • Wait for service recovery
  • Activate emergency approval workflow

2. Audit Service Unavailable

Audit Trail Cannot Record

What Happens

Intents can be evaluated but decisions cannot be recorded. System enters audit-pending mode.

Capability Status
Intent submission: degradedPolicy evaluation: normalAudit recording: blockedValue attribution: blocked
Public Message
Governance active but audit recording delayed. All decisions will be recorded when service recovers.
Recovery Path

System queues audit records locally. Upon recovery, records are flushed to permanent storage with original timestamps.

Important: Decisions made during this period are still valid—they're just recorded later.

3. Escalation Routing Fails

Cannot Route to Human Reviewers

What Happens

Escalated Intents cannot reach human queues. Escalated actions are blocked.

Capability Status
Allow judgments: normalBlock judgments: normalEscalate judgments: blockedHuman notification: blocked
Public Message
Human review temporarily unavailable. Actions requiring approval are queued.
Recovery Path

Escalations are queued. Humans receive backlog when routing recovers. SLA timers pause during outage.

4. Evidence Collection Fails

Cannot Gather Evidence for Evaluation

What Happens

LIMs cannot access context needed for evaluation. Conservative judgments applied.

Capability Status
Intent submission: normalFull policy evaluation: degradedAudit recording: normal
Public Message
Operating with reduced context. Some actions may require manual approval.
Recovery Path

System retries evidence collection. Falls back to conservative evaluation rules when evidence unavailable.

5. Total System Failure

ΔOS Completely Unavailable

What Happens

No governance functions available. Agents cannot submit Intents.

Capability Status
Intent submission: blockedPolicy evaluation: blockedAudit recording: blockedAll governance: blocked
Public Message
Governance infrastructure unavailable. Agent actions blocked pending recovery.
Recovery Path

Full recovery required. Agents configured with fail-closed will halt. Kill switch remains available via infrastructure controls.

Response Procedures

Automatic Responses

ConditionAutomatic Response
Evaluation latency > 5sAlert + log degradation
Evaluation error rate > 1%Alert + investigate
Audit lag > 1 minuteAlert + monitor queue
Escalation queue > 100Alert + increase routing capacity

Human Responses

SeverityResponse TimeActions
P0 (Critical)15 minutesAll hands, customer notification
P1 (High)1 hourOn-call response, status update
P2 (Medium)4 hoursNext business day if after hours
P3 (Low)24 hoursScheduled maintenance window

What You Should Do

Configure Fail Behavior

deltaos.configure({
  onEvaluationUnavailable: 'block',  // or 'allow-with-audit'
  onAuditUnavailable: 'queue',       // or 'block'
  onEscalationUnavailable: 'block',  // or 'allow-with-flag'
  healthCheckInterval: '30s'
});

Monitor Health Endpoints

const health = await deltaos.health.check();
// {
//   evaluation: { status: 'healthy', latencyP99: '45ms' },
//   audit: { status: 'healthy', queueDepth: 0 },
//   escalation: { status: 'healthy', pendingCount: 3 }
// }

Set Up Alerts

await deltaos.alerts.configure({
  channels: ['pagerduty:team-oncall', 'slack:#governance-alerts'],
  thresholds: {
    evaluationLatencyP99: '500ms',
    auditQueueDepth: 1000,
    escalationBacklog: 50
  }
});

Recovery Verification

After any incident, verify:

  1. Audit completeness — No gaps in the audit trail
  2. Judgment consistency — Replay sample of decisions
  3. Escalation processing — All queued escalations handled
  4. Value attribution — Metrics caught up
const verification = await deltaos.recovery.verify({
  incident: 'INC-2025-001',
  timeRange: { start, end }
});

console.log(verification);
// {
//   auditComplete: true,
//   gapsFound: 0,
//   decisionReplayMatch: 100%,
//   escalationsProcessed: 47,
//   valueAttributionLag: '0s'
// }

See Also