Skip to main content

CLEO Metrics Value Proof Specification

Version: 0.1.0 Status: DRAFT Created: 2026-02-01 Epic: T2833

Problem Statement

CLEO claims to save context tokens and prevent hallucinations, but there is no mechanism to prove these claims:
  1. Token consumption: All metrics show 0 because there’s no data source
  2. Manifest savings: Theory says manifest reads save tokens, but no measurement
  3. Hallucination prevention: Validators exist but no before/after comparison
  4. Skill composition: Single skill only, no progressive loading measurement

Goals

  1. Measure actual token usage - Before and after CLEO
  2. Prove manifest efficiency - Full file vs manifest-only reads
  3. Track validation impact - Violations caught, fixes applied
  4. Enable skill composition - Multiple skills with progressive disclosure

Part 1: Token Consumption Tracking

The Solution: OpenTelemetry Integration

Claude Code DOES track actual tokens via OpenTelemetry telemetry:
claude_code.token.usage (tokens)
├── type: "input" | "output" | "cacheRead" | "cacheCreation"
└── model: "claude-sonnet-4-5-20250929" etc.
Available data per API request:
  • input_tokens - Actual input tokens consumed
  • output_tokens - Actual output tokens generated
  • cache_read_tokens - Tokens read from cache
  • cache_creation_tokens - Tokens used to create cache

How to Enable Telemetry

Option 1: Console Export (development)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=console
export OTEL_METRIC_EXPORT_INTERVAL=5000
Option 2: File Export (CLEO integration)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_PROTOCOL=http/json
export OTEL_EXPORTER_OTLP_ENDPOINT=file://.cleo/metrics/otel/

Part 2: Manifest Token Savings

Hypothesis

Reading manifest summaries instead of full agent output files saves significant tokens.

Measurement Approach

# Track manifest reads
track_manifest_read() {
    local entry_id="$1"
    local tokens=$(estimate_tokens "$entry_content")
    log_token_event "manifest_read" "$tokens" "$entry_id"
}

# Estimate full file equivalent
estimate_full_file_tokens() {
    local file_path="$1"
    local char_count=$(wc -c < "$file_path")
    echo $((char_count / 4))  # ~4 chars per token
}

Expected Results

ApproachTokens per Entry10 Entries
Manifest only~200~2,000
Full file~2,000~20,000
Savings90%18,000

Part 3: Validation Impact Measurement

Tracking Violations Caught

{
  "timestamp": "2026-02-01T01:23:45Z",
  "source_id": "T1234",
  "validation_result": {
    "passed": false,
    "score": 75,
    "violations": [
      {
        "requirement": "RSCH-001",
        "severity": "error",
        "message": "Research task modified code",
        "fix": "Revert code changes"
      }
    ]
  }
}

Value Demonstration

  • Violations caught: Count per period
  • Prevention rate: Violations / Total completions
  • By protocol: Which protocols catch most issues

Part 4: A/B Testing Framework

Test Scenarios

ScenarioDescription
BaselineDirect implementation without CLEO
With CLEOOrchestrator + subagents + manifest

Metrics to Compare

MetricBaselineWith CLEOExpected
Total tokensHigherLower-50%+
Files readManyFew-80%+
Validation failuresN/ACaught>0

Implementation Status

  • OpenTelemetry integration design
  • Token estimation fallback
  • Manifest validation logging
  • A/B testing framework
  • Metrics dashboard

References