RAG Systems

Modern production RAG systems use:

HyDE - Generate hypothetical answer, embed that
Query rewriting - LLM rewrites question as declarative
Trained dense retrievers - Models specifically trained on Q&A pairs
Reranking - Use a cross-encoder after retrieval to re-score
Hybrid search - Combine dense embeddings with keyword search (BM25)

Modern RAG Systems - The Evolution

The Problem with Naive RAG (What This Codebase Has)

Question → Embed → Vector Search → Retrieve → Generate

Issues:

Semantic gap (questions ≠ documents)
No relevance verification
Single-shot retrieval (can’t adapt)
No multi-source fusion

Modern RAG Architecture

Pre-Retrieval: Query Enhancement

A. Query Rewriting

User: “What’s the best model for cats and dogs?” ↓ Rewritten: “image classification models for pet recognition”

B. HyDE (Hypothetical Document Embeddings)

Question: “What models work for sentiment analysis?” ↓ LLM generates fake answer: “BERT, RoBERTa, and DistilBERT are commonly used for sentiment analysis tasks. They achieve 90%+ accuracy…” ↓ Embed the FAKE ANSWER (not the question!) ↓ Search for documents similar to the fake answer

Why it works: Fake answers look like real documents!

C. Query Decomposition

Complex: “Compare ResNet and ViT for medical imaging” ↓ Sub-queries: 2. “What is ResNet architecture?” 3. “What is Vision Transformer (ViT)?” 4. “Medical imaging model benchmarks”

D. Multi-Query Generation

Original: “best NLP models” ↓ Generate variants:

“state-of-the-art natural language processing models”
“top-performing text understanding architectures”
“latest transformer models for NLP”

Retrieve with ALL variants, merge results.

Retrieval: Hybrid Multi-Stage

Stage 1: Coarse Retrieval (Fast)

BM25 (keyword search): top 1000 docs

Dense retrieval: top 1000 docs = Combined candidate pool: ~1500 docs

BM25 (sparse, keyword-based):

Good for exact matches, rare terms
Fast, no ML needed
Handles out-of-vocabulary words

Dense (embeddings):

Good for semantic similarity
Handles paraphrasing
Language-agnostic

Stage 2: Reranking (Accurate)

1500 candidates → Cross-Encoder → Score each ↓ Top 10-20 docs

Cross-Encoder (e.g., ColBERT, BGE-reranker):

Processes (query, document) pairs together
Much more accurate than bi-encoder
Too slow for large corpus (hence 2-stage)

Stage 3: Diversity Filter

Top 20 docs → MMR or Similar-to-Query-But-Different → 5 final docs

Prevents redundancy (5 versions of same answer).

Post-Retrieval: Context Refinement

A. Document Filtering

for doc in retrieved_docs: relevance_score = llm.score(query, doc) if relevance_score < threshold: drop(doc)

B. Context Compression

Long doc (10,000 tokens) → Extract only relevant sentences → 500 tokens

Uses extractive summarization or relevance scoring.

C. Temporal Ordering

Arrange by:

Date (newest first)
Relevance score (most relevant first)
Reasoning chain (logical order)

Generation: Enhanced Context Integration

A. Structured Prompting

System: You are an expert. Use ONLY the provided context.

Context: [Doc 1 - Source: paper X] … [Doc 2 - Source: model Y] …

Question: {question}

Instructions:

Cite sources [Doc 1], [Doc 2]
Say “I don’t know” if context insufficient

B. Chain-of-Thought Reasoning

Let’s think step by step:

From [Doc 1], we know X
From [Doc 2], we know Y
Therefore, the answer is Z

C. Self-Consistency

Generate 5 different answers → Vote → Return most common

Modern Production Patterns

Pattern 1: Corrective RAG (CRAG)

Query → Retrieve → [Relevance Check] ↓ ┌───────────────┼───────────────┐ Low relevance Medium relevance High relevance ↓ ↓ ↓ Web search Refine & re-retrieve Use as-is ↓ ↓ ↓ └───────────────┴───────────────────┘ ↓ Generate answer

Pattern 2: Self-RAG

Generate answer chunk → [Should I retrieve more?] ↓ Yes → Retrieve → [Is it relevant?] ↓ Yes → [Is answer supported?] ↓ Continue generating

The model decides when to retrieve during generation!

Pattern 3: FLARE (Forward-Looking Active REtrieval)

Generate: “The capital of France” ↓ Check confidence: LOW on “France” ↓ Retrieve: “France geography” ↓ Continue: “The capital of France is Paris”

Retrieves only when uncertain during generation.

Pattern 4: Iterative RAG (like ReSP in this codebase)

Round 1: Question → Retrieve → Summarize → [Need more info?] → Yes ↓ Round 2: Sub-question → Retrieve → Summarize → [Need more?] → No ↓ Final answer

Advanced Techniques

Agentic RAG

Agent has tools:

Vector search
Keyword search
SQL database
Calculator
Web browser

Agent decides which tool(s) to use per question

Example: LangChain Agents, AutoGPT

GraphRAG (Microsoft)

Documents → Extract entities & relationships → Build knowledge graph ↓ Query → Graph traversal → Find connected entities → Summarize

Better for: Multi-hop reasoning, complex relationships

Fusion RAG

Question → Generate N variations ↓ Retrieve for each ↓ Merge results (Reciprocal Rank Fusion) ↓ Generate final answer

Multi-Vector Retrieval

Document → Split into chunks ↓ Embed each chunk ↓ Also embed: title, summary, keywords ↓ Store all embeddings for same doc ↓ Query → Search across all embedding types → Merge hits

Late Chunking

Traditional: Chunk → Embed each chunk independently Problem: Chunks lose context

Late Chunking: Embed whole doc → Extract chunk embeddings from full context Result: Chunks retain global context

Real Production Stacks

Example 1: Perplexity AI

Query rewriting ↓ Web search (BM25) + Vector search ↓ Reranking (cross-encoder) ↓ LLM generation with citations ↓ Follow-up question generation

Example 2: Anthropic’s Retrieval (Hypothetical)

Multi-query expansion ↓ Hybrid retrieval (sparse + dense) ↓ Temporal filtering (recent docs weighted higher) ↓ Relevance filtering (Claude judges relevance) ↓ Context compression ↓ Claude generates with citations

Example 3: Enterprise RAG

Query → [Router] → Which knowledge base? ↓ Specialized retrievers: - Legal docs DB - Technical docs DB - Customer support DB ↓ Fusion & reranking ↓ Answer + confidence score

Key Metrics for Modern RAG

Retrieval Metrics: - Recall@K: Are relevant docs in top K? - MRR (Mean Reciprocal Rank): Position of first relevant doc - NDCG: Ranking quality
Generation Metrics: - Faithfulness: Does answer match retrieved context? - Answer relevance: Does answer address question? - Context precision: Are retrieved docs relevant?
End-to-End: - Latency (p50, p95, p99) - Cost per query - User satisfaction / thumbs up rate

The Evolution Timeline

2018: Naive RAG ↓ 2020: Dense retrieval (DPR, REALM) ↓ 2021: Hybrid retrieval + reranking ↓ 2022: Query transformation (HyDE) ↓ 2023: Agentic RAG, Self-RAG, GraphRAG ↓ 2024: Multi-modal RAG (text + images + tables) 2025: You are here → Compound AI systems

What This Codebase Has vs. Modern RAG

Feature	This Codebase	Modern RAG
Query processing	Direct embed	HyDE, rewriting, decomposition
Retrieval	Single-stage dense	Multi-stage hybrid (BM25 + dense + rerank)
Diversity	MMR only	MMR + filtering + clustering
Iteration	ReSP has basic iteration	Agentic, self-correcting
Citations	Model IDs only	Document snippets, page numbers, URLs
Evaluation	External script	Built-in relevance checking

The Bottom Line

Naive RAG: 40-60% accuracy on complex questionsModern RAG: 70-85% accuracy on same questionsBut: 5-10x more complex, 2-5x slower, 3-10x more expensive

The trick is choosing which techniques for your specific use case.

Want to implement any of these improvements to benchmark against the current system?

🛠️ Steven Gong

Table of Contents

RAG Systems

Arrange by:

Graph View

Backlinks