RAG Systems

Modern production RAG systems use:

  1. HyDE - Generate hypothetical answer, embed that
  2. Query rewriting - LLM rewrites question as declarative
  3. Trained dense retrievers - Models specifically trained on Q&A pairs
  4. Reranking - Use a cross-encoder after retrieval to re-score
  5. Hybrid search - Combine dense embeddings with keyword search (BM25)

Modern RAG Systems - The Evolution

The Problem with Naive RAG (What This Codebase Has)

Question → Embed → Vector Search → Retrieve → Generate

Issues:

  • Semantic gap (questions ≠ documents)
  • No relevance verification
  • Single-shot retrieval (can’t adapt)
  • No multi-source fusion

Modern RAG Architecture

  1. Pre-Retrieval: Query Enhancement

A. Query Rewriting

User: “What’s the best model for cats and dogs?” ↓ Rewritten: “image classification models for pet recognition”

B. HyDE (Hypothetical Document Embeddings)

Question: “What models work for sentiment analysis?” ↓ LLM generates fake answer: “BERT, RoBERTa, and DistilBERT are commonly used for sentiment analysis tasks. They achieve 90%+ accuracy…” ↓ Embed the FAKE ANSWER (not the question!) ↓ Search for documents similar to the fake answer

Why it works: Fake answers look like real documents!

C. Query Decomposition

Complex: “Compare ResNet and ViT for medical imaging” ↓ Sub-queries: 2. “What is ResNet architecture?” 3. “What is Vision Transformer (ViT)?” 4. “Medical imaging model benchmarks”

D. Multi-Query Generation

Original: “best NLP models” ↓ Generate variants:

  • “state-of-the-art natural language processing models”
  • “top-performing text understanding architectures”
  • “latest transformer models for NLP”

Retrieve with ALL variants, merge results.


  1. Retrieval: Hybrid Multi-Stage

Stage 1: Coarse Retrieval (Fast)

BM25 (keyword search): top 1000 docs

  • Dense retrieval: top 1000 docs = Combined candidate pool: ~1500 docs

BM25 (sparse, keyword-based):

  • Good for exact matches, rare terms
  • Fast, no ML needed
  • Handles out-of-vocabulary words

Dense (embeddings):

  • Good for semantic similarity
  • Handles paraphrasing
  • Language-agnostic

Stage 2: Reranking (Accurate)

1500 candidates → Cross-Encoder → Score each ↓ Top 10-20 docs

Cross-Encoder (e.g., ColBERT, BGE-reranker):

  • Processes (query, document) pairs together
  • Much more accurate than bi-encoder
  • Too slow for large corpus (hence 2-stage)

Stage 3: Diversity Filter

Top 20 docs → MMR or Similar-to-Query-But-Different → 5 final docs

Prevents redundancy (5 versions of same answer).


  1. Post-Retrieval: Context Refinement

A. Document Filtering

for doc in retrieved_docs: relevance_score = llm.score(query, doc) if relevance_score < threshold: drop(doc)

B. Context Compression

Long doc (10,000 tokens) → Extract only relevant sentences → 500 tokens

Uses extractive summarization or relevance scoring.

C. Temporal Ordering

Arrange by:

  • Date (newest first)
  • Relevance score (most relevant first)
  • Reasoning chain (logical order)

  1. Generation: Enhanced Context Integration

A. Structured Prompting

System: You are an expert. Use ONLY the provided context.

Context: [Doc 1 - Source: paper X] … [Doc 2 - Source: model Y] …

Question: {question}

Instructions:

  • Cite sources [Doc 1], [Doc 2]
  • Say “I don’t know” if context insufficient

B. Chain-of-Thought Reasoning

Let’s think step by step:

  1. From [Doc 1], we know X
  2. From [Doc 2], we know Y
  3. Therefore, the answer is Z

C. Self-Consistency

Generate 5 different answers → Vote → Return most common


Modern Production Patterns

Pattern 1: Corrective RAG (CRAG)

Query → Retrieve → [Relevance Check] ↓ ┌───────────────┼───────────────┐ Low relevance Medium relevance High relevance ↓ ↓ ↓ Web search Refine & re-retrieve Use as-is ↓ ↓ ↓ └───────────────┴───────────────────┘ ↓ Generate answer

Pattern 2: Self-RAG

Generate answer chunk → [Should I retrieve more?] ↓ Yes → Retrieve → [Is it relevant?] ↓ Yes → [Is answer supported?] ↓ Continue generating

The model decides when to retrieve during generation!

Pattern 3: FLARE (Forward-Looking Active REtrieval)

Generate: “The capital of France” ↓ Check confidence: LOW on “France” ↓ Retrieve: “France geography” ↓ Continue: “The capital of France is Paris”

Retrieves only when uncertain during generation.

Pattern 4: Iterative RAG (like ReSP in this codebase)

Round 1: Question → Retrieve → Summarize → [Need more info?] → Yes ↓ Round 2: Sub-question → Retrieve → Summarize → [Need more?] → No ↓ Final answer


Advanced Techniques

  1. Agentic RAG

Agent has tools:

  • Vector search
  • Keyword search
  • SQL database
  • Calculator
  • Web browser

Agent decides which tool(s) to use per question

Example: LangChain Agents, AutoGPT

  1. GraphRAG (Microsoft)

Documents → Extract entities & relationships → Build knowledge graph ↓ Query → Graph traversal → Find connected entities → Summarize

Better for: Multi-hop reasoning, complex relationships

  1. Fusion RAG

Question → Generate N variations ↓ Retrieve for each ↓ Merge results (Reciprocal Rank Fusion) ↓ Generate final answer

  1. Multi-Vector Retrieval

Document → Split into chunks ↓ Embed each chunk ↓ Also embed: title, summary, keywords ↓ Store all embeddings for same doc ↓ Query → Search across all embedding types → Merge hits

  1. Late Chunking

Traditional: Chunk → Embed each chunk independently Problem: Chunks lose context

Late Chunking: Embed whole doc → Extract chunk embeddings from full context Result: Chunks retain global context


Real Production Stacks

Example 1: Perplexity AI

Query rewriting ↓ Web search (BM25) + Vector search ↓ Reranking (cross-encoder) ↓ LLM generation with citations ↓ Follow-up question generation

Example 2: Anthropic’s Retrieval (Hypothetical)

Multi-query expansion ↓ Hybrid retrieval (sparse + dense) ↓ Temporal filtering (recent docs weighted higher) ↓ Relevance filtering (Claude judges relevance) ↓ Context compression ↓ Claude generates with citations

Example 3: Enterprise RAG

Query → [Router] → Which knowledge base? ↓ Specialized retrievers: - Legal docs DB - Technical docs DB - Customer support DB ↓ Fusion & reranking ↓ Answer + confidence score


Key Metrics for Modern RAG

  1. Retrieval Metrics: - Recall@K: Are relevant docs in top K? - MRR (Mean Reciprocal Rank): Position of first relevant doc - NDCG: Ranking quality
  2. Generation Metrics: - Faithfulness: Does answer match retrieved context? - Answer relevance: Does answer address question? - Context precision: Are retrieved docs relevant?
  3. End-to-End: - Latency (p50, p95, p99) - Cost per query - User satisfaction / thumbs up rate

The Evolution Timeline

2018: Naive RAG ↓ 2020: Dense retrieval (DPR, REALM) ↓ 2021: Hybrid retrieval + reranking ↓ 2022: Query transformation (HyDE) ↓ 2023: Agentic RAG, Self-RAG, GraphRAG ↓ 2024: Multi-modal RAG (text + images + tables) 2025: You are here → Compound AI systems


What This Codebase Has vs. Modern RAG

FeatureThis CodebaseModern RAG
Query processingDirect embedHyDE, rewriting, decomposition
RetrievalSingle-stage denseMulti-stage hybrid (BM25 + dense + rerank)
DiversityMMR onlyMMR + filtering + clustering
IterationReSP has basic iterationAgentic, self-correcting
CitationsModel IDs onlyDocument snippets, page numbers, URLs
EvaluationExternal scriptBuilt-in relevance checking

The Bottom Line

Naive RAG: 40-60% accuracy on complex questionsModern RAG: 70-85% accuracy on same questionsBut: 5-10x more complex, 2-5x slower, 3-10x more expensive

The trick is choosing which techniques for your specific use case.

Want to implement any of these improvements to benchmark against the current system?