How does ΔOS integrate with existing agents?

ΔOS provides a lightweight SDK that wraps agent actions. Agents submit intents to ΔOS, receive judgments, and report outcomes. Integration typically takes less than a day for most agent frameworks.

What deployment options does ΔOS support?

ΔOS supports Cloud (fully managed SaaS), VPC (deployed in your virtual private cloud), Hybrid (control plane in cloud, data plane in your infrastructure), and On-Premises (fully deployed in your data center) deployment modes.

ΔOS pricing scales with governance capacity and verified value, not headcount. Self-serve evaluation is available with read-only access and simulated data.

Incident Posture

What happens when things go wrong

Last reviewed: 2025-01-25

Incident Posture

This document describes exactly what happens when ΔOS components fail. No marketing language—just the failure modes and their consequences.

ℹ️

Why This Document Exists

You need to know what breaks and how. We're documenting failure because understanding failure is how you build trust.

Failure Philosophy

Fail Closed, Not Open

When ΔOS cannot evaluate an Intent, the default is to block, not allow.

// If evaluation fails for any reason
intent.judgment = 'block';
intent.blockReason = 'Evaluation unavailable';

This means false negatives (blocking valid actions) rather than false positives (allowing dangerous actions).

Degrade Gracefully

Partial failures reduce capability rather than causing total outage:

State	Intent Evaluation	Audit Recording	Escalation Routing	Value Attribution
Healthy	✓	✓	✓	✓
Partial Failure	✓	✓	✓	–
Total Failure	–	✓	–	–

Failure Scenarios

1. Evaluation Service Unavailable

LIM Evaluation Cannot Complete

What Happens

Intents cannot be evaluated against policy. All new Intents are blocked until service recovers.

Capability Status

Intent submission: degradedPolicy evaluation: blockedAudit recording: normalHuman override: normal

Public Message

Automated governance temporarily unavailable. Actions require manual approval.

Recovery Path

Service auto-recovers when underlying infrastructure stabilizes. Manual override remains available for urgent actions.

Your options during this failure:

Use manual override for critical actions
Wait for service recovery
Activate emergency approval workflow

2. Audit Service Unavailable

Audit Trail Cannot Record

What Happens

Intents can be evaluated but decisions cannot be recorded. System enters audit-pending mode.

Capability Status

Intent submission: degradedPolicy evaluation: normalAudit recording: blockedValue attribution: blocked

Public Message

Governance active but audit recording delayed. All decisions will be recorded when service recovers.

Recovery Path

System queues audit records locally. Upon recovery, records are flushed to permanent storage with original timestamps.

Important: Decisions made during this period are still valid—they're just recorded later.

3. Escalation Routing Fails

Cannot Route to Human Reviewers

What Happens

Escalated Intents cannot reach human queues. Escalated actions are blocked.

Capability Status

Allow judgments: normalBlock judgments: normalEscalate judgments: blockedHuman notification: blocked

Public Message

Human review temporarily unavailable. Actions requiring approval are queued.

Recovery Path

Escalations are queued. Humans receive backlog when routing recovers. SLA timers pause during outage.

4. Evidence Collection Fails

Cannot Gather Evidence for Evaluation

What Happens

LIMs cannot access context needed for evaluation. Conservative judgments applied.

Capability Status

Intent submission: normalFull policy evaluation: degradedAudit recording: normal

Public Message

Operating with reduced context. Some actions may require manual approval.

Recovery Path

System retries evidence collection. Falls back to conservative evaluation rules when evidence unavailable.

5. Total System Failure

ΔOS Completely Unavailable

What Happens

No governance functions available. Agents cannot submit Intents.

Capability Status

Intent submission: blockedPolicy evaluation: blockedAudit recording: blockedAll governance: blocked

Public Message

Governance infrastructure unavailable. Agent actions blocked pending recovery.

Recovery Path

Full recovery required. Agents configured with fail-closed will halt. Kill switch remains available via infrastructure controls.

Response Procedures

Automatic Responses

Condition	Automatic Response
Evaluation latency > 5s	Alert + log degradation
Evaluation error rate > 1%	Alert + investigate
Audit lag > 1 minute	Alert + monitor queue
Escalation queue > 100	Alert + increase routing capacity

Human Responses

Severity	Response Time	Actions
P0 (Critical)	15 minutes	All hands, customer notification
P1 (High)	1 hour	On-call response, status update
P2 (Medium)	4 hours	Next business day if after hours
P3 (Low)	24 hours	Scheduled maintenance window

What You Should Do

Configure Fail Behavior

deltaos.configure({
  onEvaluationUnavailable: 'block',  // or 'allow-with-audit'
  onAuditUnavailable: 'queue',       // or 'block'
  onEscalationUnavailable: 'block',  // or 'allow-with-flag'
  healthCheckInterval: '30s'
});

Monitor Health Endpoints

const health = await deltaos.health.check();
// {
//   evaluation: { status: 'healthy', latencyP99: '45ms' },
//   audit: { status: 'healthy', queueDepth: 0 },
//   escalation: { status: 'healthy', pendingCount: 3 }
// }

Set Up Alerts

await deltaos.alerts.configure({
  channels: ['pagerduty:team-oncall', 'slack:#governance-alerts'],
  thresholds: {
    evaluationLatencyP99: '500ms',
    auditQueueDepth: 1000,
    escalationBacklog: 50
  }
});

Recovery Verification

After any incident, verify:

Audit completeness — No gaps in the audit trail
Judgment consistency — Replay sample of decisions
Escalation processing — All queued escalations handled
Value attribution — Metrics caught up

const verification = await deltaos.recovery.verify({
  incident: 'INC-2025-001',
  timeRange: { start, end }
});

console.log(verification);
// {
//   auditComplete: true,
//   gapsFound: 0,
//   decisionReplayMatch: 100%,
//   escalationsProcessed: 47,
//   valueAttributionLag: '0s'
// }

Incident Posture

Failure Philosophy

Fail Closed, Not Open

Degrade Gracefully

Failure Scenarios

1. Evaluation Service Unavailable

LIM Evaluation Cannot Complete

2. Audit Service Unavailable

Audit Trail Cannot Record

3. Escalation Routing Fails

Cannot Route to Human Reviewers

4. Evidence Collection Fails

Cannot Gather Evidence for Evaluation

5. Total System Failure

ΔOS Completely Unavailable

Response Procedures

Automatic Responses

Human Responses

What You Should Do

Configure Fail Behavior

Monitor Health Endpoints

Set Up Alerts

Recovery Verification

See Also