Long Context LLMs: Memory vs Retrieval
How do we handle conversations and documents that exceed context windows? Two approaches dominate.
Modern LLMs have expanding context windows (GPT-4 Turbo: 128k, Claude 3: 200k, Gemini 1.5: 1M tokens), but naive approaches fail at scale. This post explores when to use retrieval vs when to leverage long context directly.
The Long Context Problem
Scenario: User wants to chat with a 500-page technical manual (≈500k tokens).
Challenge: How do we make this work?
Options: 1. Retrieval-Augmented Generation (RAG): Retrieve relevant chunks, feed to LLM 2. Long Context: Feed entire document into 1M token context window 3. Hybrid: Combine both approaches
Let's explore the trade-offs.
Approach 1: Retrieval-Augmented Generation (RAG)
Strategy: Retrieve only relevant sections, pass to LLM.
Architecture
def rag_query(query, document_chunks, top_k=5):
"""
RAG: Retrieve then generate
Context usage: ~5k tokens (top-k chunks)
Cost: Low ($0.01 per query)
Latency: Medium (retrieval + generation)
"""
# 1. Embed query
query_embedding = embed(query)
# 2. Retrieve top-k relevant chunks
relevant_chunks = vector_db.search(
query_embedding,
top_k=top_k
)
# 3. Generate with retrieved context
context = "\n\n".join([chunk.text for chunk in relevant_chunks])
prompt = f"""
Context:
{context}
Question: {query}
Answer based only on the context above:
"""
return llm.generate(prompt)
Advantages
✅ Cost-effective: Only process small context - 5k tokens vs 500k tokens (100× cheaper) - Example: $0.01 vs $1.00 per query
✅ Fast: Minimal latency - Retrieval: ~50ms (FAISS/Chroma) - Generation: ~1s (5k context) - Total: ~1.05s
✅ Scalable: Works for unlimited document sizes - 1GB PDF? Same cost as 1MB PDF - Just add more chunks to vector DB
✅ Dynamic updates: Easy to add/remove content - Add new documents without reprocessing entire corpus
Disadvantages
❌ Lossy: Might miss important context - Top-k retrieval might not capture everything - Cross-reference information gets lost
❌ Chunking challenges: How to split optimally? - Fixed-size: Breaks semantic units - Semantic: More expensive, still imperfect - Sentence-level: Loses paragraph context
❌ Multi-hop reasoning fails: Can't connect distant facts - "What did the author say about X in Chapter 2 vs Chapter 10?" - Retrieval might get Chapter 2 or 10, but not both
❌ Ordering matters: Chronological queries break - "How did the methodology evolve throughout the paper?" - Chunks lose temporal structure
Approach 2: Long Context LLMs
Strategy: Feed entire document into context window.
Architecture
def long_context_query(query, full_document):
"""
Long context: Process entire document
Context usage: ~500k tokens (full doc)
Cost: High ($1.00 per query)
Latency: High (500k token processing)
"""
prompt = f"""
Document:
{full_document}
Question: {query}
Answer based on the document above:
"""
return llm.generate(
prompt,
max_context=1000000 # 1M token window
)
Advantages
✅ No information loss: Entire document available - Model sees everything - Cross-references preserved - Chronological order maintained
✅ Multi-hop reasoning works: Can connect distant facts - "Compare methodology in Chapter 2 vs Chapter 10" ✓ - Model can traverse entire document
✅ Simpler architecture: No chunking, no retrieval - Just prompt engineering - Fewer moving parts
✅ Better for summarization: Holistic view - "Summarize key themes across the entire book" - Model has global context
Disadvantages
❌ Expensive: Linear cost with document size - 500k tokens @ $0.002/1k = $1.00 per query - 100× more expensive than RAG
❌ Slow: Processing time scales with length - 500k tokens ≈ 30-60s latency - User experience suffers
❌ Attention dilution: "Lost in the middle" problem - Models perform worse on middle sections - Attention spreads thin over long contexts
❌ Hard limits: Still bounded by context window - 1M tokens ≈ 750k words ≈ 3,000 pages - What about larger corpora?
The "Lost in the Middle" Problem
Research finding: LLMs struggle with information in the middle of long contexts.
Experiment
def test_retrieval_position(model, context_length):
"""
Insert key fact at different positions in long context
Test if model can retrieve it
"""
positions = [0.0, 0.25, 0.5, 0.75, 1.0] # start, 25%, 50%, 75%, end
results = {}
for position in positions:
# Insert fact at position
context = create_context_with_fact_at(position, context_length)
# Ask model to retrieve fact
response = model.generate(f"{context}\n\nQuestion: What was the key fact?")
# Evaluate accuracy
results[position] = check_accuracy(response)
return results
# Typical results (GPT-4 Turbo, 100k context):
# Position 0.0 (start): 95% accuracy
# Position 0.25: 85% accuracy
# Position 0.5 (middle): 60% accuracy ← Lost in the middle!
# Position 0.75: 85% accuracy
# Position 1.0 (end): 95% accuracy
Implication: Even with long context, middle information gets "lost".
Mitigation: - Put important info at start or end of prompt - Use retrieval to surface key facts to top - Hierarchical summarization
Approach 3: Hybrid (Best of Both Worlds)
Strategy: Use retrieval to filter, then leverage long context for reasoning.
Architecture
def hybrid_query(query, document_chunks, context_budget=100000):
"""
Hybrid: Retrieve broader context, use long-context reasoning
Context usage: ~100k tokens (top-20 chunks)
Cost: Medium ($0.20 per query)
Latency: Medium-High (retrieval + long context)
"""
# 1. Retrieve top-20 chunks (broader than RAG)
query_embedding = embed(query)
relevant_chunks = vector_db.search(
query_embedding,
top_k=20 # More chunks than pure RAG
)
# 2. Include surrounding context for each chunk
expanded_chunks = []
for chunk in relevant_chunks:
# Get ±2 chunks around each retrieved chunk
surrounding = get_surrounding_chunks(chunk, window=2)
expanded_chunks.extend(surrounding)
# 3. Deduplicate and assemble
unique_chunks = deduplicate(expanded_chunks)
context = assemble_with_structure(unique_chunks)
# 4. Use long context LLM for reasoning
prompt = f"""
Relevant sections from document:
{context}
Question: {query}
Answer based on the sections above. You may need to connect
information across multiple sections.
"""
return llm.generate(prompt, max_context=200000)
When Hybrid Wins
Use case 1: Multi-hop questions - Query: "How did the author's views on X evolve from Chapter 1 to Chapter 10?" - Retrieval: Get both chapters - Long context: Reason across them
Use case 2: Comparison queries - Query: "Compare approach A vs approach B" - Retrieval: Get both approaches - Long context: Detailed comparison
Use case 3: Evidence gathering - Query: "Find all mentions of X and summarize the key takeaways" - Retrieval: Get all relevant sections - Long context: Synthesize across them
Decision Framework: Which Approach?
Use RAG when:
✅ Cost-sensitive applications - High query volume (1M+ queries/day) - Budget constraints
✅ Simple lookup queries - "What is the definition of X?" - "When was Y published?" - Single-hop reasoning
✅ Extremely large corpora - 10,000+ documents - Multi-TB knowledge bases
✅ Low-latency requirements - Chatbots with <2s response time - Real-time applications
✅ Dynamic content - Frequently updated documents - Need to add/remove content easily
Use Long Context when:
✅ Complex reasoning required - Multi-hop questions - Comparative analysis - Temporal reasoning
✅ High accuracy critical - Cannot afford to miss information - Medical, legal, scientific domains
✅ Small document sets - Single large document (book, manual, thesis) - <100k tokens total
✅ Cost not a constraint - Research projects - High-value use cases
Use Hybrid when:
✅ Medium complexity - More than simple lookup, less than full analysis - Need both retrieval efficiency + reasoning power
✅ Moderate document sizes - 100k - 1M tokens - Sweet spot for hybrid
✅ Quality > cost (but cost matters) - Willing to pay 10× more than RAG - But not 100× more for full long context
Implementation Patterns
Pattern 1: Hierarchical Retrieval + Long Context
def hierarchical_hybrid(query, documents):
"""
Step 1: Retrieve relevant documents (coarse)
Step 2: Use long context within those documents (fine)
"""
# Coarse: Which documents?
relevant_docs = retrieve_documents(query, top_k=3)
# Fine: Long context reasoning within them
combined_docs = "\n\n---\n\n".join(relevant_docs)
return llm.generate(
f"Documents:\n{combined_docs}\n\nQuestion: {query}",
max_context=200000
)
Pattern 2: Iterative Retrieval + Context Expansion
def iterative_expansion(query, vector_db, max_iterations=3):
"""
Iteratively expand context until sufficient information found
"""
context_chunks = []
current_query = query
for i in range(max_iterations):
# Retrieve based on current query
new_chunks = vector_db.search(current_query, top_k=5)
context_chunks.extend(new_chunks)
# Generate intermediate answer
context = assemble(context_chunks)
response = llm.generate(
f"Context: {context}\n\nQuestion: {query}\n\n"
"If you need more information, say 'NEED_MORE: <what info>'."
)
# Check if model needs more info
if "NEED_MORE:" in response:
current_query = extract_info_need(response)
else:
return response # Done!
return response
Pattern 3: Summarization Pyramid
def summarization_pyramid(large_document, max_context=100000):
"""
Compress long documents via hierarchical summarization
Fit result into long context window
"""
# Level 1: Chunk document
chunks = chunk_document(large_document, chunk_size=5000)
# Level 2: Summarize each chunk
summaries = [
llm.generate(f"Summarize:\n{chunk}")
for chunk in chunks
]
# Level 3: Summarize summaries (if needed)
if len(summaries) * 500 > max_context:
summaries = [
llm.generate(f"Summarize these summaries:\n{group}")
for group in batch(summaries, n=10)
]
# Final: Long context reasoning over compressed representation
compressed_doc = "\n\n".join(summaries)
return compressed_doc # Can now fit in context window
Cost-Performance Trade-offs
Scenario: 500k token document, 10k queries/day
| Approach | Cost/Query | Total/Day | Latency | Accuracy |
|---|---|---|---|---|
| RAG (top-5) | $0.01 | $100 | 1.0s | 75% |
| RAG (top-20) | $0.04 | $400 | 1.2s | 85% |
| Hybrid | $0.20 | $2,000 | 3.0s | 92% |
| Full Long Context | $1.00 | $10,000 | 30s | 95% |
Observations: - RAG: 100× cheaper than long context, but 20% accuracy loss - Hybrid: 5× cheaper than long context, only 3% accuracy loss - Long context: Highest accuracy, but extreme cost + latency
Recommendation: Hybrid for most production use cases.
Optimizations
For RAG
1. Query Expansion
def expand_query(query):
"""Generate multiple query variants for better retrieval"""
return llm.generate(
f"Generate 3 alternative phrasings of: {query}"
).split('\n')
expanded_queries = expand_query(user_query)
all_chunks = [retrieve(q) for q in expanded_queries]
deduplicated = deduplicate(all_chunks)
2. Reranking
def rerank(query, chunks):
"""Use cross-encoder to rerank retrieved chunks"""
scores = cross_encoder.predict([
(query, chunk.text) for chunk in chunks
])
return sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
3. Contextual Chunk Embeddings
def embed_with_context(chunk, surrounding_chunks):
"""Embed chunk with surrounding context for better retrieval"""
context = f"Previous: {surrounding_chunks[-1]}\n\n{chunk}\n\nNext: {surrounding_chunks[1]}"
return embed(context)
For Long Context
1. Prompt Positioning
def position_optimized_prompt(query, document):
"""Put query at start AND end to avoid 'lost in the middle'"""
return f"""
Question: {query}
Document:
{document}
Answer the question above based on the document.
Question (reminder): {query}
"""
2. Chunked Processing with Memory
def process_with_memory(large_doc, chunk_size=50000):
"""Process large doc in chunks, maintain running summary"""
chunks = split(large_doc, chunk_size)
memory = ""
for chunk in chunks:
prompt = f"Memory: {memory}\n\nNew content: {chunk}\n\nUpdate memory:"
memory = llm.generate(prompt)
return memory # Compressed representation
Future Directions
1. Sparse Attention Mechanisms
- Longformer, BigBird: O(n) instead of O(n²)
- Trade accuracy for efficiency
2. Retrieval-Interleaved Generation
- Models that retrieve mid-generation
- RETRO, kNN-LM patterns
3. External Memory Systems
- Vector DBs as external memory
- Learn when to retrieve vs use parametric knowledge
4. Infinite Context
- RMT (Recurrent Memory Transformer)
- Compress arbitrarily long contexts
Key Takeaways
- RAG wins on cost/latency: 100× cheaper, 10× faster
- Long context wins on accuracy: No information loss, multi-hop reasoning
- Hybrid is the sweet spot: 5× cheaper than long context, 95% of the accuracy
- "Lost in the middle" is real: Position matters in long contexts
- Use case determines approach: Simple lookup → RAG, complex reasoning → long context
- Optimize retrieval first: Query expansion, reranking, contextual embeddings
- Context budgets matter: Track tokens like you track dollars
Production Recommendations
Start with: Hybrid approach (retrieval + moderate long context)
Optimize for: - Cost: Aggressive retrieval filtering, smaller top-k - Latency: Parallel retrieval, streaming generation - Quality: Larger context budgets, reranking
Monitor: - Query complexity distribution (simple vs multi-hop) - Retrieval quality (precision@k, recall@k) - Cost per query (track context usage) - User satisfaction (thumbs up/down)
Iterate: - A/B test RAG vs hybrid vs long context - Tune retrieval parameters (top-k, reranking threshold) - Adjust context budgets based on query type
Further Reading
- Lost in the Middle: How Language Models Use Long Contexts
- Retrieval-Augmented Generation for Large Language Models: A Survey
- Long-Context LLMs: A Survey
- GitHub: Production RAG Implementation
Part of my technical blog exploring practical AI engineering. See my AI Research Portfolio for implementations.