RAG Systems
Modern production RAG systems use:
- HyDE - Generate hypothetical answer, embed that
- Query rewriting - LLM rewrites question as declarative
- Trained dense retrievers - Models specifically trained on Q&A pairs
- Reranking - Use a cross-encoder after retrieval to re-score
- Hybrid search - Combine dense embeddings with keyword search (BM25)
Modern RAG Systems - The Evolution
The Problem with Naive RAG (What This Codebase Has)
Question → Embed → Vector Search → Retrieve → Generate
Issues:
- Semantic gap (questions ≠ documents)
- No relevance verification
- Single-shot retrieval (can’t adapt)
- No multi-source fusion
Modern RAG Architecture
- Pre-Retrieval: Query Enhancement
A. Query Rewriting
User: “What’s the best model for cats and dogs?” ↓ Rewritten: “image classification models for pet recognition”
B. HyDE (Hypothetical Document Embeddings)
Question: “What models work for sentiment analysis?” ↓ LLM generates fake answer: “BERT, RoBERTa, and DistilBERT are commonly used for sentiment analysis tasks. They achieve 90%+ accuracy…” ↓ Embed the FAKE ANSWER (not the question!) ↓ Search for documents similar to the fake answer
Why it works: Fake answers look like real documents!
C. Query Decomposition
Complex: “Compare ResNet and ViT for medical imaging” ↓ Sub-queries: 2. “What is ResNet architecture?” 3. “What is Vision Transformer (ViT)?” 4. “Medical imaging model benchmarks”
D. Multi-Query Generation
Original: “best NLP models” ↓ Generate variants:
- “state-of-the-art natural language processing models”
- “top-performing text understanding architectures”
- “latest transformer models for NLP”
Retrieve with ALL variants, merge results.
- Retrieval: Hybrid Multi-Stage
Stage 1: Coarse Retrieval (Fast)
BM25 (keyword search): top 1000 docs
- Dense retrieval: top 1000 docs = Combined candidate pool: ~1500 docs
BM25 (sparse, keyword-based):
- Good for exact matches, rare terms
- Fast, no ML needed
- Handles out-of-vocabulary words
Dense (embeddings):
- Good for semantic similarity
- Handles paraphrasing
- Language-agnostic
Stage 2: Reranking (Accurate)
1500 candidates → Cross-Encoder → Score each ↓ Top 10-20 docs
Cross-Encoder (e.g., ColBERT, BGE-reranker):
- Processes (query, document) pairs together
- Much more accurate than bi-encoder
- Too slow for large corpus (hence 2-stage)
Stage 3: Diversity Filter
Top 20 docs → MMR or Similar-to-Query-But-Different → 5 final docs
Prevents redundancy (5 versions of same answer).
- Post-Retrieval: Context Refinement
A. Document Filtering
for doc in retrieved_docs: relevance_score = llm.score(query, doc) if relevance_score < threshold: drop(doc)
B. Context Compression
Long doc (10,000 tokens) → Extract only relevant sentences → 500 tokens
Uses extractive summarization or relevance scoring.
C. Temporal Ordering
Arrange by:
- Date (newest first)
- Relevance score (most relevant first)
- Reasoning chain (logical order)
- Generation: Enhanced Context Integration
A. Structured Prompting
System: You are an expert. Use ONLY the provided context.
Context: [Doc 1 - Source: paper X] … [Doc 2 - Source: model Y] …
Question: {question}
Instructions:
- Cite sources [Doc 1], [Doc 2]
- Say “I don’t know” if context insufficient
B. Chain-of-Thought Reasoning
Let’s think step by step:
- From [Doc 1], we know X
- From [Doc 2], we know Y
- Therefore, the answer is Z
C. Self-Consistency
Generate 5 different answers → Vote → Return most common
Modern Production Patterns
Pattern 1: Corrective RAG (CRAG)
Query → Retrieve → [Relevance Check] ↓ ┌───────────────┼───────────────┐ Low relevance Medium relevance High relevance ↓ ↓ ↓ Web search Refine & re-retrieve Use as-is ↓ ↓ ↓ └───────────────┴───────────────────┘ ↓ Generate answer
Pattern 2: Self-RAG
Generate answer chunk → [Should I retrieve more?] ↓ Yes → Retrieve → [Is it relevant?] ↓ Yes → [Is answer supported?] ↓ Continue generating
The model decides when to retrieve during generation!
Pattern 3: FLARE (Forward-Looking Active REtrieval)
Generate: “The capital of France” ↓ Check confidence: LOW on “France” ↓ Retrieve: “France geography” ↓ Continue: “The capital of France is Paris”
Retrieves only when uncertain during generation.
Pattern 4: Iterative RAG (like ReSP in this codebase)
Round 1: Question → Retrieve → Summarize → [Need more info?] → Yes ↓ Round 2: Sub-question → Retrieve → Summarize → [Need more?] → No ↓ Final answer
Advanced Techniques
- Agentic RAG
Agent has tools:
- Vector search
- Keyword search
- SQL database
- Calculator
- Web browser
Agent decides which tool(s) to use per question
Example: LangChain Agents, AutoGPT
- GraphRAG (Microsoft)
Documents → Extract entities & relationships → Build knowledge graph ↓ Query → Graph traversal → Find connected entities → Summarize
Better for: Multi-hop reasoning, complex relationships
- Fusion RAG
Question → Generate N variations ↓ Retrieve for each ↓ Merge results (Reciprocal Rank Fusion) ↓ Generate final answer
- Multi-Vector Retrieval
Document → Split into chunks ↓ Embed each chunk ↓ Also embed: title, summary, keywords ↓ Store all embeddings for same doc ↓ Query → Search across all embedding types → Merge hits
- Late Chunking
Traditional: Chunk → Embed each chunk independently Problem: Chunks lose context
Late Chunking: Embed whole doc → Extract chunk embeddings from full context Result: Chunks retain global context
Real Production Stacks
Example 1: Perplexity AI
Query rewriting ↓ Web search (BM25) + Vector search ↓ Reranking (cross-encoder) ↓ LLM generation with citations ↓ Follow-up question generation
Example 2: Anthropic’s Retrieval (Hypothetical)
Multi-query expansion ↓ Hybrid retrieval (sparse + dense) ↓ Temporal filtering (recent docs weighted higher) ↓ Relevance filtering (Claude judges relevance) ↓ Context compression ↓ Claude generates with citations
Example 3: Enterprise RAG
Query → [Router] → Which knowledge base? ↓ Specialized retrievers: - Legal docs DB - Technical docs DB - Customer support DB ↓ Fusion & reranking ↓ Answer + confidence score
Key Metrics for Modern RAG
- Retrieval Metrics: - Recall@K: Are relevant docs in top K? - MRR (Mean Reciprocal Rank): Position of first relevant doc - NDCG: Ranking quality
- Generation Metrics: - Faithfulness: Does answer match retrieved context? - Answer relevance: Does answer address question? - Context precision: Are retrieved docs relevant?
- End-to-End: - Latency (p50, p95, p99) - Cost per query - User satisfaction / thumbs up rate
The Evolution Timeline
2018: Naive RAG ↓ 2020: Dense retrieval (DPR, REALM) ↓ 2021: Hybrid retrieval + reranking ↓ 2022: Query transformation (HyDE) ↓ 2023: Agentic RAG, Self-RAG, GraphRAG ↓ 2024: Multi-modal RAG (text + images + tables) 2025: You are here → Compound AI systems
What This Codebase Has vs. Modern RAG
| Feature | This Codebase | Modern RAG |
|---|---|---|
| Query processing | Direct embed | HyDE, rewriting, decomposition |
| Retrieval | Single-stage dense | Multi-stage hybrid (BM25 + dense + rerank) |
| Diversity | MMR only | MMR + filtering + clustering |
| Iteration | ReSP has basic iteration | Agentic, self-correcting |
| Citations | Model IDs only | Document snippets, page numbers, URLs |
| Evaluation | External script | Built-in relevance checking |
The Bottom Line
Naive RAG: 40-60% accuracy on complex questionsModern RAG: 70-85% accuracy on same questionsBut: 5-10x more complex, 2-5x slower, 3-10x more expensive
The trick is choosing which techniques for your specific use case.
Want to implement any of these improvements to benchmark against the current system?