Skip to main content

Web Aggregation Pipeline Specification

Version: 1.0.0 Status: DRAFT Effective: v1.0.0+ Last Updated: 2025-12-19

RFC 2119 Conformance

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals.

Preamble

This specification defines a unified web aggregation pipeline system designed for LLM agents. The system aggregates content from multiple web sources (websites, articles, blogs, discussions) and distills it into actionable insights while minimizing context window consumption. The specification draws from extensive research across 1000+ developer experiences and analysis of production tools including Tavily, Exa, Firecrawl, Crawl4AI, and associated MCP servers.
Design Principle: Stage-based architecture with specialized tools per stage, not monolithic solutions.

Executive Summary

Mission

Enable LLM agents to aggregate web content from diverse sources and synthesize actionable insights without context window bloat.

Core Principles

  1. Stage Separation: Discovery, Extraction, Normalization, Distillation, Output
  2. Tool Specialization: Best tool per stage, not one tool for everything
  3. Token Efficiency: 60-80% reduction through selective extraction and caching
  4. Graceful Degradation: Fallback chains when primary tools fail
  5. Citation Preservation: Every insight traceable to source

Key Decisions

DecisionChoiceRationale
Primary searchExa (neural) or Tavily (general)Semantic discovery vs. quick facts
Primary extractionFirecrawl or Tavily ExtractLLM-ready markdown output
Schema extractionFirecrawl + Zod/PydanticBuilt-in structured extraction
Cost optimizationCrawl4AI + OllamaZero API cost for self-hosted
ReasoningSequential Thinking MCPComplex multi-step synthesis
DocumentationContext7 MCPOfficial library docs

Part 1: Pipeline Architecture

1.1 Stage Model

The pipeline MUST implement five sequential stages:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  DISCOVERY  │ →  │ EXTRACTION  │ →  │NORMALIZATION│ →  │ DISTILLATION│ →  │   OUTPUT    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     Find              Fetch             Clean              Synthesize         Format
     sources           content           structure          insights           report

1.2 Stage Responsibilities

StageInputOutputFailure Behavior
DiscoveryQuery or seed URLsPrioritized source listEmpty list → halt pipeline
ExtractionSource referencesRaw content + metadataPartial data → continue with available
NormalizationRaw contentClean markdown + metadataDegraded quality → warn, continue
DistillationNormalized contentSynthesized insightsReduced confidence → flag uncertainty
OutputInsights + citationsFormatted reportFormat fallback → plain text

1.3 Data Flow Contract

Each stage MUST accept input conforming to:
StageInput {
  data: any                    // Stage-specific payload
  metadata: {
    source_count: integer      // Number of sources processed
    confidence: float          // 0.0-1.0 quality score
    degraded: boolean          // Whether fallbacks were used
  }
  context: {
    original_query: string     // User's original request
    stage_history: string[]    // Stages completed
  }
}
Each stage MUST produce output conforming to:
StageOutput {
  data: any                    // Stage-specific result
  metadata: {
    duration_ms: integer       // Processing time
    items_processed: integer   // Count of items handled
    items_failed: integer      // Count of failures
    confidence: float          // Quality score
  }
  errors: Error[]              // Non-fatal errors encountered
}

Part 2: Discovery Stage

2.1 Purpose

The Discovery stage identifies relevant sources from a natural language query or expands from seed URLs.

2.2 Input Modes

The system MUST support two input modes:
ModeInputBehavior
Query ModeNatural language questionExecute search, return ranked URLs
URL ModeList of seed URLsSkip search, pass URLs to Extraction

2.3 Tool Requirements

2.3.1 Primary Search Tools

The system SHOULD use one of the following for query-based discovery: Exa Search (semantic/research-focused):
  • Neural embeddings for semantic matching
  • findSimilar for expanding from seed URLs
  • Category filtering: research_paper, news, tweet, company, github
  • Date range filtering via startPublishedDate/endPublishedDate
Tavily Search (general-purpose):
  • AI-optimized results with domain filtering
  • Topic modes: general, news, finance
  • Time range: day, week, month, year
  • Up to 300 include domains, 150 exclude domains

2.3.2 Tool Selection Criteria

Query TypeRecommended ToolRationale
Academic/researchExa (neural + category=research_paper)Semantic matching
Find similar contentExa (findSimilar)Unique capability
Quick factual lookupTavilyFast, reliable
News/current eventsTavily (topic=news)Time filtering
Technical documentationContext7Curated, version-specific

2.4 Output Requirements

Discovery MUST produce a ranked list of sources:
DiscoveryResult {
  sources: [{
    url: string                // Canonical URL
    title: string              // Page title
    relevance_score: float     // 0.0-1.0 ranking
    source_type: enum          // article|discussion|documentation|research
    published_date: date?      // If available
    domain: string             // Extracted domain
  }]
  query_metadata: {
    original_query: string
    search_provider: string
    result_count: integer
  }
}

2.5 Constraints

  • Discovery SHOULD return 5-20 sources per query
  • Discovery MUST NOT return more than 100 sources without pagination
  • Discovery SHOULD deduplicate URLs from multiple search providers
  • Discovery MUST preserve the original query for citation purposes

Part 3: Extraction Stage

3.1 Purpose

The Extraction stage fetches content from discovered URLs and prepares it for normalization.

3.2 Tool Requirements

3.2.1 Primary Extraction Tools

Tavily Extract:
  • Up to 20 URLs per request
  • Output formats: markdown, text
  • Extraction depths: basic, advanced
  • Handles JavaScript rendering
Firecrawl Scrape:
  • Schema-based LLM extraction (Zod/Pydantic)
  • Batch processing with webhooks
  • Actions support (click, wait, type)
  • Self-hosting option available
Crawl4AI (self-hosted):
  • Zero API cost with Ollama
  • JsonCssExtractionStrategy for structured data
  • LLMExtractionStrategy for complex layouts
  • Memory-adaptive batch processing

3.2.2 Tool Selection by Content Type

Content TypePrimary ToolFallback
Static HTMLTavily ExtractFirecrawl
JavaScript SPAFirecrawl (with actions)Playwright direct
Structured dataCrawl4AI (JsonCssExtractionStrategy)Firecrawl schema
DocumentationContext7Direct markdown fetch
Reddit/forumsTavily (domain filter)PRAW API

3.3 Output Requirements

Extraction MUST produce:
ExtractionResult {
  content: [{
    url: string                // Source URL
    title: string              // Page title
    raw_content: string        // Full extracted text
    content_type: string       // text/html, text/markdown
    extracted_at: timestamp    // Fetch time
    metadata: {
      author: string?
      published_date: date?
      word_count: integer
      language: string?
    }
  }]
  failures: [{
    url: string
    error: string
    recoverable: boolean
  }]
}

3.4 Constraints

  • Extraction MUST handle JavaScript-rendered content
  • Extraction SHOULD cache successful fetches for 1-24 hours based on content type
  • Extraction MUST respect rate limits (see Part 8)
  • Extraction MUST NOT store content from paywalled sources without authorization
  • Extraction SHOULD timeout after 30 seconds per URL

Part 4: Normalization Stage

4.1 Purpose

The Normalization stage cleans extracted content, removes boilerplate, and structures it for LLM consumption.

4.2 Processing Requirements

4.2.1 Boilerplate Removal

The system MUST remove:
  • Navigation elements
  • Footer content
  • Advertisements
  • Cookie banners
  • Sidebar widgets
The system SHOULD use:
  • Trafilatura (97.8% precision) for general content
  • Mozilla Readability for articles
  • Custom selectors for known sites

4.2.2 Markdown Conversion

Normalized content MUST be in markdown format with:
  • Preserved heading hierarchy (H1-H6)
  • Code blocks with language hints
  • Lists and tables
  • Links with original URLs
  • No inline styles or scripts

4.2.3 Metadata Extraction

The system SHOULD extract metadata from:
  1. JSON-LD structured data (priority)
  2. Schema.org microdata
  3. Open Graph tags
  4. HTML meta tags
  5. Heuristic patterns (bylines, dates)

4.3 Output Requirements

NormalizedResult {
  documents: [{
    url: string
    title: string
    content_markdown: string    // Clean markdown
    metadata: {
      author: string?
      published_date: date?
      modified_date: date?
      description: string?
      tags: string[]?
      word_count: integer
      estimated_reading_time: integer
    }
    quality_score: float        // 0.0-1.0
  }]
  statistics: {
    total_input_tokens: integer
    total_output_tokens: integer
    compression_ratio: float
  }
}

4.4 Token Optimization

  • Normalization SHOULD achieve 90-96% token reduction from raw HTML
  • Normalization MUST preserve semantic structure
  • Normalization SHOULD target 500-2000 tokens per source after cleaning

Part 5: Distillation Stage

5.1 Purpose

The Distillation stage synthesizes normalized content into actionable insights with proper attribution.

5.2 Processing Patterns

5.2.1 Map-Reduce Summarization

For multiple sources, the system SHOULD:
  1. Map: Summarize each source independently (parallel, cheap model)
  2. Reduce: Merge summaries, resolve conflicts (expensive model)
Sources[1..N] → Map(summarize, 500 tokens each) → Reduce(synthesize) → Output

5.2.2 Hierarchical Summarization

For long documents, the system SHOULD:
  1. Level 0: Raw content
  2. Level 1: Section summaries
  3. Level 2: Document summary
  4. Level 3: Executive insight

5.2.3 Claim Extraction

The system MUST extract claims with provenance:
Claim {
  statement: string
  confidence: enum(high|medium|low)
  support_type: enum(direct|inferred|contested)
  sources: [{
    url: string
    excerpt: string
    reliability_tier: integer  // 1-5
  }]
}

5.3 Sequential Thinking Integration

For complex synthesis tasks, the system SHOULD use Sequential Thinking MCP:
ComplexityUse Sequential Thinking?Rationale
Simple factualNoDirect extraction sufficient
Multi-source comparisonYesRequires structured reasoning
Consensus detectionYesMulti-step analysis
Contradiction resolutionYesExplicit reasoning trail

5.4 Output Requirements

DistillationResult {
  executive_summary: string       // 2-3 sentences
  key_findings: [{
    finding: string
    confidence: float
    source_count: integer
    citations: string[]           // URLs
  }]
  themes: string[]                // Common topics across sources
  contradictions: [{
    claim_a: string
    claim_b: string
    resolution: string?
  }]
  actionable_items: string[]      // Concrete recommendations
  gaps: string[]                  // What wasn't found
  source_reliability: {
    tier_1_count: integer         // Official docs, specs
    tier_2_count: integer         // Academic, authoritative
    tier_3_count: integer         // Reputable blogs
    tier_4_count: integer         // Community content
    tier_5_count: integer         // General web
  }
}

5.5 Constraints

  • Distillation MUST NOT produce claims without source attribution
  • Distillation MUST flag contradictions explicitly
  • Distillation SHOULD express uncertainty when sources disagree
  • Distillation MUST preserve original URLs for all citations

Part 6: Output Stage

6.1 Purpose

The Output stage formats distilled insights into a structured, actionable document.

6.2 Output Formats

The system MUST support:
FormatUse Case
Markdown ReportHuman reading, documentation
Structured JSONProgrammatic consumption
Compact SummaryToken-constrained contexts

6.3 Report Schema

6.3.1 Markdown Report Template

# Research: {topic}

## Key Takeaways
1. {finding_1} [confidence: {score}]
2. {finding_2} [confidence: {score}]
3. {finding_3} [confidence: {score}]

## Evidence Summary
| Finding | Sources | Confidence |
|---------|---------|------------|
| {claim} | {count} | {level} |

## Recommendations
- **Immediate**: {action}
- **Follow-up**: {action}

## Caveats
- {limitation_1}
- {limitation_2}

## Sources
1. [{title}]({url}) - {tier} source

6.3.2 JSON Schema

OutputReport {
  metadata: {
    query: string
    generated_at: timestamp
    sources_analyzed: integer
    overall_confidence: float
  }
  executive_summary: {
    findings: string[]
    recommendations: string[]
    confidence_notes: string
  }
  detailed_findings: [{
    topic: string
    claims: [{
      statement: string
      support_level: enum
      source_ids: string[]
    }]
    contradictions: [{
      claim_a: string
      claim_b: string
      resolution: string?
    }]
  }]
  sources: [{
    id: string
    url: string
    title: string
    reliability_tier: integer
    accessed: timestamp
  }]
  gaps: string[]
  next_steps: string[]
}

Part 7: MCP Server Configuration

7.1 Required MCP Servers

The system MUST configure these MCP servers:
ServerPurposeRequired
Tavily MCPSearch and extractionYes
Sequential ThinkingComplex reasoningYes
Context7Documentation lookupRecommended
Exa MCPSemantic searchOptional
Firecrawl MCPSchema extractionOptional

7.2 Configuration Schema

{
  "mcpServers": {
    "tavily": {
      "command": "npx",
      "args": ["-y", "tavily-mcp@latest"],
      "env": {
        "TAVILY_API_KEY": "{api_key}"
      }
    },
    "sequential-thinking": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-sequential-thinking"]
    },
    "context7": {
      "command": "npx",
      "args": ["-y", "@upstash/context7-mcp@latest"]
    },
    "exa": {
      "command": "npx",
      "args": ["-y", "exa-mcp-server"],
      "env": {
        "EXA_API_KEY": "{api_key}"
      }
    },
    "firecrawl": {
      "command": "npx",
      "args": ["-y", "firecrawl-mcp"],
      "env": {
        "FIRECRAWL_API_KEY": "{api_key}"
      }
    }
  }
}

7.3 Tool Invocation Patterns

7.3.1 Discovery

// Query mode
tavily_search({
  query: "user question",
  max_results: 10,
  search_depth: "advanced",
  include_domains: ["trusted.com"],
  time_range: "month"
})

// Similarity expansion
exa_find_similar({
  url: "seed_url",
  num_results: 5,
  exclude_source_domain: true
})

7.3.2 Extraction

// Multi-URL extraction
tavily_extract({
  urls: ["url1", "url2", "url3"],
  extract_depth: "advanced",
  format: "markdown"
})

// Schema-based extraction
firecrawl_extract({
  urls: ["url"],
  schema: {/* JSON Schema */},
  prompt: "extraction instructions"
})

7.3.3 Synthesis

// Complex reasoning
sequential_thinking({
  thought: "Analyzing sources for consensus...",
  thought_number: 1,
  total_thoughts: 5,
  next_thought_needed: true
})

Part 8: Rate Limits and Constraints

8.1 Provider Rate Limits

ProviderEndpointFree TierPaid Tier
TavilySearch100 RPM1000 RPM
TavilyExtract100 RPM1000 RPM
TavilyCrawl100 RPM100 RPM
ExaSearch5 QPSCustom
ExaContents50 QPSCustom
FirecrawlScrape10-5000 RPMPlan-based

8.2 Batch Size Limits

OperationMaximum
Tavily Extract URLs20 per request
Firecrawl BatchPlan-dependent
Exa Search Results100 per query

8.3 Timeout Requirements

OperationDefaultMaximum
Single URL fetch30s60s
Batch operation120s300s
Search query10s30s
LLM extraction60s120s

Part 9: Token Optimization

9.1 Budget Allocation

The system SHOULD allocate tokens as follows:
ComponentBudgetFlexibility
System prompt2KFixed
Retrieved context8KVariable
Conversation history4KRolling
Current query1KVariable
Reserved output4KFixed
Buffer2KSafety

9.2 Optimization Strategies

9.2.1 Content Compression

The system MUST achieve:
  • 90-96% reduction from raw HTML via boilerplate removal
  • 50-70% reduction via fit_markdown format
  • Semantic chunking with 10-15% overlap

9.2.2 Caching Layers

LayerPurposeTTL
Exact matchIdentical queries1h
SemanticSimilar queries15m
SourceURL content1-24h
EmbeddingText chunks7d

9.2.3 Model Routing

Query ComplexityModel TierCost Impact
Simple (70%)Nano (gpt-4.1-nano, haiku)Baseline
Medium (20%)Standard (gpt-4.1-mini, sonnet)10x
Complex (10%)Premium (gpt-4.1, opus)100x

9.3 Progressive Disclosure

The system SHOULD implement progressive context loading:
Level 0: Summaries only (500 tokens)
  ↓ (if insufficient)
Level 1: Relevant sections (2000 tokens)
  ↓ (if insufficient)
Level 2: Full content (unlimited)

Part 10: Error Handling

10.1 Error Categories

CategoryRetry StrategyFallback
Transient (429, 503, timeout)Exponential backoff, 3-5 attemptsWait and retry
Permanent (404, 401, invalid)No retrySkip source, log
Partial (incomplete response)No retryUse available data

10.2 Fallback Chains

Stage: Extraction
├── Primary: Tavily Extract
│   ├── Failure → Fallback 1: Firecrawl Scrape
│   │   ├── Failure → Fallback 2: Direct fetch + Trafilatura
│   │   │   └── Failure → Return URL only, no content

10.3 Circuit Breaker

The system SHOULD implement circuit breakers:
  • OPEN after 5 failures in 60 seconds
  • HALF_OPEN after 30 seconds cooldown
  • CLOSED after 2 consecutive successes

Part 11: Source Reliability

11.1 Reliability Tiers

TierSource TypeBase ScoreExamples
1Official docs, RFCs, specs0.95W3C, IETF, vendor docs
2Academic, peer-reviewed0.90arXiv, journals
3Authoritative blogs0.80Major tech company blogs
4Community resources0.70Stack Overflow, GitHub
5General web0.50General articles, forums

11.2 Freshness Decay

reliability = base_score * freshness_multiplier
freshness_multiplier = 1.0 - (days_since_update / 365 * decay_rate)

decay_rate = {
  fast_moving: 0.3    // JS frameworks, AI tools
  moderate: 0.1       // HTTP specs, databases
  stable: 0.02        // TCP/IP, core protocols
}

11.3 Consensus Detection

The system SHOULD calculate consensus scores:
consensus_score = Σ(source_weight × agrees) / Σ(source_weight)

Thresholds:
  > 0.8  → Strong consensus (state as established)
  0.5-0.8 → Moderate agreement (note majority view)
  0.2-0.5 → Mixed/contested (present multiple views)
  < 0.2  → Strong disagreement (highlight controversy)

Part 12: llms.txt Integration

12.1 Discovery

The system SHOULD check for llms.txt at:
  • https://{domain}/llms.txt
  • https://{domain}/llms-full.txt
  • https://{domain}/docs/llms.txt

12.2 Usage

When llms.txt exists:
  1. Parse section structure
  2. Follow links to .md versions
  3. Prioritize llms.txt content over scraped content

12.3 Fallback

When llms.txt is absent:
  1. Use semantic HTML parsing
  2. Follow sitemap.xml structure
  3. Target documentation framework patterns (Docusaurus, Sphinx, GitBook)

Appendix A: Decision Rationale

DecisionAlternatives ConsideredWhy Chosen
Stage-based pipelineMonolithic toolFlexibility, best tool per task
Exa for semantic searchGoogle/Bing scrapingNeural embeddings, findSimilar unique
Markdown as interchangeJSON, HTMLToken-efficient, LLM-friendly
Multi-level cachingSingle cacheDifferent TTLs per content type
Model routingSingle model60-85% cost reduction

Appendix B: Industry Precedents

SystemPatternRelevance
LangChain RAGChunk → Embed → RetrieveChunking strategies
PerplexitySearch → Extract → SynthesizePipeline model
FirecrawlSchema-based extractionStructured output
RAPTORHierarchical summarizationToken efficiency
RouteLLMModel cascadeCost optimization

Appendix C: Glossary

TermDefinition
DiscoveryFinding relevant URLs from a query
ExtractionFetching content from URLs
NormalizationCleaning content for LLM consumption
DistillationSynthesizing insights from content
BoilerplateNavigation, footer, ads - non-content HTML
Semantic chunkingSplitting by meaning, not character count
Circuit breakerPattern to prevent cascading failures
Consensus scoreWeighted agreement across sources

DocumentRelationship
SPEC-BIBLE-GUIDELINES.mdAuthoritative for specification format
RCSD-PIPELINE-SPEC.mdStage: This spec implements the RESEARCH stage of the RCSD Pipeline (Part 2)
WEB-AGGREGATION-PIPELINE-IMPLEMENTATION-REPORT.mdTracks implementation status

End of Specification