Web Aggregation Pipeline Specification
Version: 1.0.0 Status: DRAFT Effective: v1.0.0+ Last Updated: 2025-12-19RFC 2119 Conformance
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals.Preamble
This specification defines a unified web aggregation pipeline system designed for LLM agents. The system aggregates content from multiple web sources (websites, articles, blogs, discussions) and distills it into actionable insights while minimizing context window consumption. The specification draws from extensive research across 1000+ developer experiences and analysis of production tools including Tavily, Exa, Firecrawl, Crawl4AI, and associated MCP servers.Design Principle: Stage-based architecture with specialized tools per stage, not monolithic solutions.
Executive Summary
Mission
Enable LLM agents to aggregate web content from diverse sources and synthesize actionable insights without context window bloat.Core Principles
- Stage Separation: Discovery, Extraction, Normalization, Distillation, Output
- Tool Specialization: Best tool per stage, not one tool for everything
- Token Efficiency: 60-80% reduction through selective extraction and caching
- Graceful Degradation: Fallback chains when primary tools fail
- Citation Preservation: Every insight traceable to source
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Primary search | Exa (neural) or Tavily (general) | Semantic discovery vs. quick facts |
| Primary extraction | Firecrawl or Tavily Extract | LLM-ready markdown output |
| Schema extraction | Firecrawl + Zod/Pydantic | Built-in structured extraction |
| Cost optimization | Crawl4AI + Ollama | Zero API cost for self-hosted |
| Reasoning | Sequential Thinking MCP | Complex multi-step synthesis |
| Documentation | Context7 MCP | Official library docs |
Part 1: Pipeline Architecture
1.1 Stage Model
The pipeline MUST implement five sequential stages:1.2 Stage Responsibilities
| Stage | Input | Output | Failure Behavior |
|---|---|---|---|
| Discovery | Query or seed URLs | Prioritized source list | Empty list → halt pipeline |
| Extraction | Source references | Raw content + metadata | Partial data → continue with available |
| Normalization | Raw content | Clean markdown + metadata | Degraded quality → warn, continue |
| Distillation | Normalized content | Synthesized insights | Reduced confidence → flag uncertainty |
| Output | Insights + citations | Formatted report | Format fallback → plain text |
1.3 Data Flow Contract
Each stage MUST accept input conforming to:Part 2: Discovery Stage
2.1 Purpose
The Discovery stage identifies relevant sources from a natural language query or expands from seed URLs.2.2 Input Modes
The system MUST support two input modes:| Mode | Input | Behavior |
|---|---|---|
| Query Mode | Natural language question | Execute search, return ranked URLs |
| URL Mode | List of seed URLs | Skip search, pass URLs to Extraction |
2.3 Tool Requirements
2.3.1 Primary Search Tools
The system SHOULD use one of the following for query-based discovery: Exa Search (semantic/research-focused):- Neural embeddings for semantic matching
findSimilarfor expanding from seed URLs- Category filtering:
research_paper,news,tweet,company,github - Date range filtering via
startPublishedDate/endPublishedDate
- AI-optimized results with domain filtering
- Topic modes:
general,news,finance - Time range:
day,week,month,year - Up to 300 include domains, 150 exclude domains
2.3.2 Tool Selection Criteria
| Query Type | Recommended Tool | Rationale |
|---|---|---|
| Academic/research | Exa (neural + category=research_paper) | Semantic matching |
| Find similar content | Exa (findSimilar) | Unique capability |
| Quick factual lookup | Tavily | Fast, reliable |
| News/current events | Tavily (topic=news) | Time filtering |
| Technical documentation | Context7 | Curated, version-specific |
2.4 Output Requirements
Discovery MUST produce a ranked list of sources:2.5 Constraints
- Discovery SHOULD return 5-20 sources per query
- Discovery MUST NOT return more than 100 sources without pagination
- Discovery SHOULD deduplicate URLs from multiple search providers
- Discovery MUST preserve the original query for citation purposes
Part 3: Extraction Stage
3.1 Purpose
The Extraction stage fetches content from discovered URLs and prepares it for normalization.3.2 Tool Requirements
3.2.1 Primary Extraction Tools
Tavily Extract:- Up to 20 URLs per request
- Output formats:
markdown,text - Extraction depths:
basic,advanced - Handles JavaScript rendering
- Schema-based LLM extraction (Zod/Pydantic)
- Batch processing with webhooks
- Actions support (click, wait, type)
- Self-hosting option available
- Zero API cost with Ollama
- JsonCssExtractionStrategy for structured data
- LLMExtractionStrategy for complex layouts
- Memory-adaptive batch processing
3.2.2 Tool Selection by Content Type
| Content Type | Primary Tool | Fallback |
|---|---|---|
| Static HTML | Tavily Extract | Firecrawl |
| JavaScript SPA | Firecrawl (with actions) | Playwright direct |
| Structured data | Crawl4AI (JsonCssExtractionStrategy) | Firecrawl schema |
| Documentation | Context7 | Direct markdown fetch |
| Reddit/forums | Tavily (domain filter) | PRAW API |
3.3 Output Requirements
Extraction MUST produce:3.4 Constraints
- Extraction MUST handle JavaScript-rendered content
- Extraction SHOULD cache successful fetches for 1-24 hours based on content type
- Extraction MUST respect rate limits (see Part 8)
- Extraction MUST NOT store content from paywalled sources without authorization
- Extraction SHOULD timeout after 30 seconds per URL
Part 4: Normalization Stage
4.1 Purpose
The Normalization stage cleans extracted content, removes boilerplate, and structures it for LLM consumption.4.2 Processing Requirements
4.2.1 Boilerplate Removal
The system MUST remove:- Navigation elements
- Footer content
- Advertisements
- Cookie banners
- Sidebar widgets
- Trafilatura (97.8% precision) for general content
- Mozilla Readability for articles
- Custom selectors for known sites
4.2.2 Markdown Conversion
Normalized content MUST be in markdown format with:- Preserved heading hierarchy (H1-H6)
- Code blocks with language hints
- Lists and tables
- Links with original URLs
- No inline styles or scripts
4.2.3 Metadata Extraction
The system SHOULD extract metadata from:- JSON-LD structured data (priority)
- Schema.org microdata
- Open Graph tags
- HTML meta tags
- Heuristic patterns (bylines, dates)
4.3 Output Requirements
4.4 Token Optimization
- Normalization SHOULD achieve 90-96% token reduction from raw HTML
- Normalization MUST preserve semantic structure
- Normalization SHOULD target 500-2000 tokens per source after cleaning
Part 5: Distillation Stage
5.1 Purpose
The Distillation stage synthesizes normalized content into actionable insights with proper attribution.5.2 Processing Patterns
5.2.1 Map-Reduce Summarization
For multiple sources, the system SHOULD:- Map: Summarize each source independently (parallel, cheap model)
- Reduce: Merge summaries, resolve conflicts (expensive model)
5.2.2 Hierarchical Summarization
For long documents, the system SHOULD:- Level 0: Raw content
- Level 1: Section summaries
- Level 2: Document summary
- Level 3: Executive insight
5.2.3 Claim Extraction
The system MUST extract claims with provenance:5.3 Sequential Thinking Integration
For complex synthesis tasks, the system SHOULD use Sequential Thinking MCP:| Complexity | Use Sequential Thinking? | Rationale |
|---|---|---|
| Simple factual | No | Direct extraction sufficient |
| Multi-source comparison | Yes | Requires structured reasoning |
| Consensus detection | Yes | Multi-step analysis |
| Contradiction resolution | Yes | Explicit reasoning trail |
5.4 Output Requirements
5.5 Constraints
- Distillation MUST NOT produce claims without source attribution
- Distillation MUST flag contradictions explicitly
- Distillation SHOULD express uncertainty when sources disagree
- Distillation MUST preserve original URLs for all citations
Part 6: Output Stage
6.1 Purpose
The Output stage formats distilled insights into a structured, actionable document.6.2 Output Formats
The system MUST support:| Format | Use Case |
|---|---|
| Markdown Report | Human reading, documentation |
| Structured JSON | Programmatic consumption |
| Compact Summary | Token-constrained contexts |
6.3 Report Schema
6.3.1 Markdown Report Template
6.3.2 JSON Schema
Part 7: MCP Server Configuration
7.1 Required MCP Servers
The system MUST configure these MCP servers:| Server | Purpose | Required |
|---|---|---|
| Tavily MCP | Search and extraction | Yes |
| Sequential Thinking | Complex reasoning | Yes |
| Context7 | Documentation lookup | Recommended |
| Exa MCP | Semantic search | Optional |
| Firecrawl MCP | Schema extraction | Optional |
7.2 Configuration Schema
7.3 Tool Invocation Patterns
7.3.1 Discovery
7.3.2 Extraction
7.3.3 Synthesis
Part 8: Rate Limits and Constraints
8.1 Provider Rate Limits
| Provider | Endpoint | Free Tier | Paid Tier |
|---|---|---|---|
| Tavily | Search | 100 RPM | 1000 RPM |
| Tavily | Extract | 100 RPM | 1000 RPM |
| Tavily | Crawl | 100 RPM | 100 RPM |
| Exa | Search | 5 QPS | Custom |
| Exa | Contents | 50 QPS | Custom |
| Firecrawl | Scrape | 10-5000 RPM | Plan-based |
8.2 Batch Size Limits
| Operation | Maximum |
|---|---|
| Tavily Extract URLs | 20 per request |
| Firecrawl Batch | Plan-dependent |
| Exa Search Results | 100 per query |
8.3 Timeout Requirements
| Operation | Default | Maximum |
|---|---|---|
| Single URL fetch | 30s | 60s |
| Batch operation | 120s | 300s |
| Search query | 10s | 30s |
| LLM extraction | 60s | 120s |
Part 9: Token Optimization
9.1 Budget Allocation
The system SHOULD allocate tokens as follows:| Component | Budget | Flexibility |
|---|---|---|
| System prompt | 2K | Fixed |
| Retrieved context | 8K | Variable |
| Conversation history | 4K | Rolling |
| Current query | 1K | Variable |
| Reserved output | 4K | Fixed |
| Buffer | 2K | Safety |
9.2 Optimization Strategies
9.2.1 Content Compression
The system MUST achieve:- 90-96% reduction from raw HTML via boilerplate removal
- 50-70% reduction via
fit_markdownformat - Semantic chunking with 10-15% overlap
9.2.2 Caching Layers
| Layer | Purpose | TTL |
|---|---|---|
| Exact match | Identical queries | 1h |
| Semantic | Similar queries | 15m |
| Source | URL content | 1-24h |
| Embedding | Text chunks | 7d |
9.2.3 Model Routing
| Query Complexity | Model Tier | Cost Impact |
|---|---|---|
| Simple (70%) | Nano (gpt-4.1-nano, haiku) | Baseline |
| Medium (20%) | Standard (gpt-4.1-mini, sonnet) | 10x |
| Complex (10%) | Premium (gpt-4.1, opus) | 100x |
9.3 Progressive Disclosure
The system SHOULD implement progressive context loading:Part 10: Error Handling
10.1 Error Categories
| Category | Retry Strategy | Fallback |
|---|---|---|
| Transient (429, 503, timeout) | Exponential backoff, 3-5 attempts | Wait and retry |
| Permanent (404, 401, invalid) | No retry | Skip source, log |
| Partial (incomplete response) | No retry | Use available data |
10.2 Fallback Chains
10.3 Circuit Breaker
The system SHOULD implement circuit breakers:- OPEN after 5 failures in 60 seconds
- HALF_OPEN after 30 seconds cooldown
- CLOSED after 2 consecutive successes
Part 11: Source Reliability
11.1 Reliability Tiers
| Tier | Source Type | Base Score | Examples |
|---|---|---|---|
| 1 | Official docs, RFCs, specs | 0.95 | W3C, IETF, vendor docs |
| 2 | Academic, peer-reviewed | 0.90 | arXiv, journals |
| 3 | Authoritative blogs | 0.80 | Major tech company blogs |
| 4 | Community resources | 0.70 | Stack Overflow, GitHub |
| 5 | General web | 0.50 | General articles, forums |
11.2 Freshness Decay
11.3 Consensus Detection
The system SHOULD calculate consensus scores:Part 12: llms.txt Integration
12.1 Discovery
The system SHOULD check for llms.txt at:https://{domain}/llms.txthttps://{domain}/llms-full.txthttps://{domain}/docs/llms.txt
12.2 Usage
When llms.txt exists:- Parse section structure
- Follow links to
.mdversions - Prioritize llms.txt content over scraped content
12.3 Fallback
When llms.txt is absent:- Use semantic HTML parsing
- Follow sitemap.xml structure
- Target documentation framework patterns (Docusaurus, Sphinx, GitBook)
Appendix A: Decision Rationale
| Decision | Alternatives Considered | Why Chosen |
|---|---|---|
| Stage-based pipeline | Monolithic tool | Flexibility, best tool per task |
| Exa for semantic search | Google/Bing scraping | Neural embeddings, findSimilar unique |
| Markdown as interchange | JSON, HTML | Token-efficient, LLM-friendly |
| Multi-level caching | Single cache | Different TTLs per content type |
| Model routing | Single model | 60-85% cost reduction |
Appendix B: Industry Precedents
| System | Pattern | Relevance |
|---|---|---|
| LangChain RAG | Chunk → Embed → Retrieve | Chunking strategies |
| Perplexity | Search → Extract → Synthesize | Pipeline model |
| Firecrawl | Schema-based extraction | Structured output |
| RAPTOR | Hierarchical summarization | Token efficiency |
| RouteLLM | Model cascade | Cost optimization |
Appendix C: Glossary
| Term | Definition |
|---|---|
| Discovery | Finding relevant URLs from a query |
| Extraction | Fetching content from URLs |
| Normalization | Cleaning content for LLM consumption |
| Distillation | Synthesizing insights from content |
| Boilerplate | Navigation, footer, ads - non-content HTML |
| Semantic chunking | Splitting by meaning, not character count |
| Circuit breaker | Pattern to prevent cascading failures |
| Consensus score | Weighted agreement across sources |
Related Specifications
| Document | Relationship |
|---|---|
| SPEC-BIBLE-GUIDELINES.md | Authoritative for specification format |
| RCSD-PIPELINE-SPEC.md | Stage: This spec implements the RESEARCH stage of the RCSD Pipeline (Part 2) |
| WEB-AGGREGATION-PIPELINE-IMPLEMENTATION-REPORT.md | Tracks implementation status |
End of Specification
