LLM-Agent-First Health Check Protocol Specification
Authoritative standard for agent-native health monitoring and recovery Version: 1.0.0 | Created: 2025-12-20 Status: DRAFT - Pending Implementation Supersedes: Portions of VERSION-GUARD-FINAL-DESIGN.md
RFC 2119 Conformance
| Keyword | Meaning |
|---|---|
| MUST | Absolute requirement. Non-compliance is a specification violation. |
| MUST NOT | Absolute prohibition. |
| SHOULD | Recommended but not mandatory. Valid reasons may exist to ignore. |
| MAY | Optional. Implementations can choose to include or omit. |
Related Specifications
| Document | Relationship |
|---|---|
| LLM-AGENT-FIRST-SPEC.md | Parent specification for all LLM-agent-first design |
| VERSION-GUARD-FINAL-DESIGN.md | SUPERSEDED for error output format |
| VERSION-GUARD-SPEC.md | Original version guard proposal |
Executive Summary
Problem Statement
The VERSION-GUARD-FINAL-DESIGN.md contains critical violations of LLM-Agent-First principles:- Warnings suppressed in non-TTY mode - Agents receive no notification of schema mismatch
- Errors output as plain text - No JSON error envelope for machine parsing
- Single exit code (25) - Cannot differentiate between recovery actions
- No comprehensive health check - Only reactive validation, no proactive monitoring
- No multi-agent coordination - Lock exists but no agent identity or handoff protocol
Solution Overview
This specification defines:ct healthcommand - Proactive health monitoring with structured output- Granular exit codes - Semantic codes enabling agent decision-making
- JSON error output - All errors as structured JSON from day 1
- Multi-agent coordination - Agent identity, ownership, and handoff protocols
- Self-healing patterns - Auto-fix with dry-run and backup guarantees
Part 1: Critical Gaps in VERSION-GUARD-FINAL-DESIGN
Gap 1: JSON Error Output Deferred to Phase 2
Current Design (Line 274):Gap 2: Warning Suppression in Non-TTY
Current Design:Gap 3: Single Exit Code Insufficient
Current Design: OnlyEXIT_MIGRATION_REQUIRED=25
Problem: Agent cannot differentiate between:
- “Run migrate” (schema outdated)
- “Upgrade CLI” (project ahead of CLI)
- “Wait and retry” (migration in progress)
- “Human required” (major version mismatch)
Part 2: Exit Code Architecture
Exit Code Ranges
Complete Exit Code Table
Exit Code Recoverability Matrix
| Exit Code | Name | Recoverable | Recovery Action |
|---|---|---|---|
| 30 | SCHEMA_OUTDATED | Yes | ct migrate run |
| 31 | SCHEMA_INCOMPATIBLE | No | Human: major migration |
| 32 | SCHEMA_AHEAD | Yes | Upgrade CLI |
| 33 | SCHEMA_CORRUPT | No | Human: fix JSON manually |
| 34 | SCHEMA_UNKNOWN | No | Human: investigate |
| 35 | MIGRATION_IN_PROGRESS | Yes | Wait 1s, retry (max 5) |
| 36 | MIGRATION_FAILED | Yes | ct restore --latest |
| 37 | MIGRATION_ROLLBACK | No | Human: manual rollback |
| 40 | LOCK_HELD | Yes | Wait 100ms, retry (max 3) |
| 41 | SESSION_OWNED | No | Human: coordinate agents |
| 42 | TASK_CLAIMED | Yes | Request handoff or wait |
| 43 | HANDOFF_PENDING | Yes | Wait for handoff completion |
| 44 | AGENT_CONFLICT | No | Human: resolve conflict |
| 50 | HEALTH_ERROR | Partial | ct health --fix |
| 51 | HEALTH_WARNING | Yes | ct health --fix (optional) |
| 52 | HEALTH_UNFIXABLE | No | Human intervention required |
| 53 | FIX_FAILED | Yes | ct restore, retry |
| 54 | FIX_PARTIAL | Partial | Review, retry remaining |
Error Code Mapping
Part 3: Health Check Command
Command Specification
Health Check Categories
1. Schema Health (schema)
| Check ID | Description | Auto-Fixable |
|---|---|---|
schema.version.compatibility | CLI vs project version | Yes (minor) |
schema.version.parse | Version field parseable | No |
schema.validation | JSON Schema validation | Partial |
schema.checksum | Checksum integrity | Yes |
2. Data Health (data)
| Check ID | Description | Auto-Fixable |
|---|---|---|
data.task.id_unique | No duplicate task IDs | No |
data.task.id_format | All IDs match pattern | No |
data.dependency.valid | All dependencies exist | Yes (remove) |
data.dependency.acyclic | No circular dependencies | Yes (break) |
data.status.valid | All statuses in enum | No |
data.timestamp.sane | No future timestamps | Yes (set now) |
data.hierarchy.valid | Parent/child relationships | Partial |
3. File Health (files)
| Check ID | Description | Auto-Fixable |
|---|---|---|
files.todo.exists | todo.json exists | No |
files.todo.readable | todo.json readable | No |
files.todo.writable | todo.json writable | No |
files.todo.parseable | Valid JSON | No |
files.config.exists | Config file exists | Yes (create) |
files.backup.available | Recent backup exists | Yes (create) |
4. Session Health (session)
| Check ID | Description | Auto-Fixable |
|---|---|---|
session.active.single | Max 1 active task | Yes |
session.focus.valid | Focus references valid task | Yes (clear) |
session.lock.stale | No stale locks | Yes (release) |
session.state.consistent | Session state consistent | Yes (end) |
5. Sync Health (sync)
| Check ID | Description | Auto-Fixable |
|---|---|---|
sync.todowrite.state | Sync state valid | Yes |
sync.todowrite.conflicts | No unresolved conflicts | No |
sync.todowrite.timestamp | Reasonable last sync | No |
6. Coordination Health (coordination)
| Check ID | Description | Auto-Fixable |
|---|---|---|
coordination.lock.valid | Lock file valid | Yes |
coordination.session.owner | Session ownership clear | No |
coordination.agents.conflict | No agent conflicts | No |
JSON Output Schema
Example Output
Part 4: Agent Recovery Protocol
Session Lifecycle with Health Checks
Recovery Decision Algorithm
Retry Protocol
| Exit Code Range | Initial Delay | Backoff Factor | Max Retries | Max Wait |
|---|---|---|---|---|
| 20-29 (Concurrency) | 100ms | 2x | 3 | 1.4s |
| 30-39 (Schema) | 1000ms | 1.5x | 5 | 7.6s |
| 40-49 (Coordination) | 100ms | 2x | 5 | 3.1s |
| 50-59 (Health) | 500ms | 1.5x | 3 | 1.6s |
Part 5: Self-Healing Patterns
Auto-Fix Classification
| Issue Type | Auto-Fixable | Risk Level | Rationale |
|---|---|---|---|
| Schema version outdated (minor) | YES | Low | Non-breaking, additive |
| Schema version outdated (major) | NO | Critical | Potential data loss |
| Checksum mismatch | YES | Low | Checksum is derived |
| Orphan dependencies | YES | Low | Cleans invalid refs |
| Duplicate task IDs | NO | Critical | Requires human choice |
| Circular dependencies | YES | Medium | Auto-picks edge to break |
| Future timestamps | YES | Low | Obvious correction |
| Invalid status enum | NO | High | Unknown intent |
| Missing required fields | PARTIAL | Medium | Can set defaults |
| File permission errors | NO | Critical | System-level |
| Stale lock | YES | Low | Cleanup orphaned state |
| Orphaned session | YES | Low | End orphaned session |
Risk Level Definitions
| Risk Level | Definition | Auto-Fix Policy |
|---|---|---|
low | Fully reversible, no data loss | Auto-fix allowed |
medium | Reversible with backup, minor modification | Auto-fix with backup |
high | Potentially destructive, may lose data | Require --force |
critical | Irreversible, significant impact | Human required |
Dry-Run Semantics
All fix operations MUST support--dry-run:
Backup-First Guarantee
Every auto-fix operation MUST:- Create backup before any modification
- Log operation to audit trail
- Verify success after fix
- Provide rollback command in output
Part 6: Multi-Agent Coordination
Agent Identity
Every agent operation SHOULD include agent identity:Lock File Schema
Coordination Commands
Part 7: Revised VERSION-GUARD-FINAL-DESIGN
Required Changes
| Original | Issue | Required Change |
|---|---|---|
| Exit code 25 only | Cannot differentiate | Exit codes 30-37 |
| JSON output in Phase 2 | Too late | JSON output in Phase 1 |
| Plain text errors | Not parseable | JSON error envelope |
| Warning suppression in non-TTY | Silent failure | JSON warnings to stderr |
Revised Fast Version Check
Revised Phase Plan
Phase 1 (v0.24.0) - CRITICAL:- Add exit codes 30-37 (schema/version)
- JSON error output for version check (MOVED from Phase 2)
- JSON warning output to stderr
- Basic
ct health --quickwith schema check -
_meta.lastWriterVersiontracking
- Full
ct healthcommand with all categories -
ct health --fixwith dry-run and backup - Exit codes 50-54 (health)
- Integrate version check into write scripts
- Multi-agent coordination (exit codes 40-45)
- Lock/session/task ownership commands
- Conflict detection
-
ct migrate wizardfor batch migration
Part 8: Testing Requirements
Exit Code Tests
Health Check Tests
Part 9: Implementation Checklist
Phase 1 Checklist
- Add exit codes 30-37 to
lib/exit-codes.sh - Add error codes E_SCHEMA_* to
lib/error-codes.sh - Create
output_schema_error()function - Create
output_schema_warning()function - Update
fast_version_check()for JSON output - Create
scripts/health.shwith--quickmode - Add schema category checks
- Add
health.schema.json - Add tests for exit codes 30-37
- Add tests for health —quick
- Update VERSION-GUARD-FINAL-DESIGN.md
Phase 2 Checklist
- Add exit codes 50-54 to
lib/exit-codes.sh - Add error codes E_HEALTH_* to
lib/error-codes.sh - Implement full health check categories
- Implement
ct health --fix - Implement
ct health --fix --dry-run - Add
health-fix.schema.json - Integrate version check into write scripts
- Add backup-first guarantee
- Add tests for all health checks
- Add tests for fix operations
Phase 3 Checklist
- Add exit codes 40-45 to
lib/exit-codes.sh - Add error codes E_LOCK_, E_SESSION_, E_TASK_*
- Implement agent identity in lock files
- Implement
ct lock status - Implement
ct task owner - Implement
ct session handoff - Add coordination category to health check
- Add
coordination.schema.json - Add tests for coordination commands
Appendix A: JSON Schema Files
health.schema.json
See Part 3 for complete schema.health-fix.schema.json
coordination.schema.json
Appendix B: Quick Reference
Exit Code Quick Reference
Health Command Quick Reference
Agent Recovery Quick Reference
Specification v1.0.0 - LLM-Agent-First Health Check Protocol Created: 2025-12-20 Status: DRAFT - Pending Implementation
