Skip to content

Long Context LLMs: Memory vs Retrieval

How do we handle conversations and documents that exceed context windows? Two approaches dominate.

Modern LLMs have expanding context windows (GPT-4 Turbo: 128k, Claude 3: 200k, Gemini 1.5: 1M tokens), but naive approaches fail at scale. This post explores when to use retrieval vs when to leverage long context directly.

The Long Context Problem

Scenario: User wants to chat with a 500-page technical manual (≈500k tokens).

Challenge: How do we make this work?

Options: 1. Retrieval-Augmented Generation (RAG): Retrieve relevant chunks, feed to LLM 2. Long Context: Feed entire document into 1M token context window 3. Hybrid: Combine both approaches

Let's explore the trade-offs.


Approach 1: Retrieval-Augmented Generation (RAG)

Strategy: Retrieve only relevant sections, pass to LLM.

Architecture

def rag_query(query, document_chunks, top_k=5):
    """
    RAG: Retrieve then generate

    Context usage: ~5k tokens (top-k chunks)
    Cost: Low ($0.01 per query)
    Latency: Medium (retrieval + generation)
    """
    # 1. Embed query
    query_embedding = embed(query)

    # 2. Retrieve top-k relevant chunks
    relevant_chunks = vector_db.search(
        query_embedding,
        top_k=top_k
    )

    # 3. Generate with retrieved context
    context = "\n\n".join([chunk.text for chunk in relevant_chunks])

    prompt = f"""
    Context:
    {context}

    Question: {query}

    Answer based only on the context above:
    """

    return llm.generate(prompt)

Advantages

Cost-effective: Only process small context - 5k tokens vs 500k tokens (100× cheaper) - Example: $0.01 vs $1.00 per query

Fast: Minimal latency - Retrieval: ~50ms (FAISS/Chroma) - Generation: ~1s (5k context) - Total: ~1.05s

Scalable: Works for unlimited document sizes - 1GB PDF? Same cost as 1MB PDF - Just add more chunks to vector DB

Dynamic updates: Easy to add/remove content - Add new documents without reprocessing entire corpus

Disadvantages

Lossy: Might miss important context - Top-k retrieval might not capture everything - Cross-reference information gets lost

Chunking challenges: How to split optimally? - Fixed-size: Breaks semantic units - Semantic: More expensive, still imperfect - Sentence-level: Loses paragraph context

Multi-hop reasoning fails: Can't connect distant facts - "What did the author say about X in Chapter 2 vs Chapter 10?" - Retrieval might get Chapter 2 or 10, but not both

Ordering matters: Chronological queries break - "How did the methodology evolve throughout the paper?" - Chunks lose temporal structure


Approach 2: Long Context LLMs

Strategy: Feed entire document into context window.

Architecture

def long_context_query(query, full_document):
    """
    Long context: Process entire document

    Context usage: ~500k tokens (full doc)
    Cost: High ($1.00 per query)
    Latency: High (500k token processing)
    """
    prompt = f"""
    Document:
    {full_document}

    Question: {query}

    Answer based on the document above:
    """

    return llm.generate(
        prompt,
        max_context=1000000  # 1M token window
    )

Advantages

No information loss: Entire document available - Model sees everything - Cross-references preserved - Chronological order maintained

Multi-hop reasoning works: Can connect distant facts - "Compare methodology in Chapter 2 vs Chapter 10" ✓ - Model can traverse entire document

Simpler architecture: No chunking, no retrieval - Just prompt engineering - Fewer moving parts

Better for summarization: Holistic view - "Summarize key themes across the entire book" - Model has global context

Disadvantages

Expensive: Linear cost with document size - 500k tokens @ $0.002/1k = $1.00 per query - 100× more expensive than RAG

Slow: Processing time scales with length - 500k tokens ≈ 30-60s latency - User experience suffers

Attention dilution: "Lost in the middle" problem - Models perform worse on middle sections - Attention spreads thin over long contexts

Hard limits: Still bounded by context window - 1M tokens ≈ 750k words ≈ 3,000 pages - What about larger corpora?


The "Lost in the Middle" Problem

Research finding: LLMs struggle with information in the middle of long contexts.

Experiment

def test_retrieval_position(model, context_length):
    """
    Insert key fact at different positions in long context
    Test if model can retrieve it
    """
    positions = [0.0, 0.25, 0.5, 0.75, 1.0]  # start, 25%, 50%, 75%, end

    results = {}
    for position in positions:
        # Insert fact at position
        context = create_context_with_fact_at(position, context_length)

        # Ask model to retrieve fact
        response = model.generate(f"{context}\n\nQuestion: What was the key fact?")

        # Evaluate accuracy
        results[position] = check_accuracy(response)

    return results

# Typical results (GPT-4 Turbo, 100k context):
# Position 0.0 (start):  95% accuracy
# Position 0.25:         85% accuracy
# Position 0.5 (middle): 60% accuracy  ← Lost in the middle!
# Position 0.75:         85% accuracy
# Position 1.0 (end):    95% accuracy

Implication: Even with long context, middle information gets "lost".

Mitigation: - Put important info at start or end of prompt - Use retrieval to surface key facts to top - Hierarchical summarization


Approach 3: Hybrid (Best of Both Worlds)

Strategy: Use retrieval to filter, then leverage long context for reasoning.

Architecture

def hybrid_query(query, document_chunks, context_budget=100000):
    """
    Hybrid: Retrieve broader context, use long-context reasoning

    Context usage: ~100k tokens (top-20 chunks)
    Cost: Medium ($0.20 per query)
    Latency: Medium-High (retrieval + long context)
    """
    # 1. Retrieve top-20 chunks (broader than RAG)
    query_embedding = embed(query)
    relevant_chunks = vector_db.search(
        query_embedding,
        top_k=20  # More chunks than pure RAG
    )

    # 2. Include surrounding context for each chunk
    expanded_chunks = []
    for chunk in relevant_chunks:
        # Get ±2 chunks around each retrieved chunk
        surrounding = get_surrounding_chunks(chunk, window=2)
        expanded_chunks.extend(surrounding)

    # 3. Deduplicate and assemble
    unique_chunks = deduplicate(expanded_chunks)
    context = assemble_with_structure(unique_chunks)

    # 4. Use long context LLM for reasoning
    prompt = f"""
    Relevant sections from document:
    {context}

    Question: {query}

    Answer based on the sections above. You may need to connect
    information across multiple sections.
    """

    return llm.generate(prompt, max_context=200000)

When Hybrid Wins

Use case 1: Multi-hop questions - Query: "How did the author's views on X evolve from Chapter 1 to Chapter 10?" - Retrieval: Get both chapters - Long context: Reason across them

Use case 2: Comparison queries - Query: "Compare approach A vs approach B" - Retrieval: Get both approaches - Long context: Detailed comparison

Use case 3: Evidence gathering - Query: "Find all mentions of X and summarize the key takeaways" - Retrieval: Get all relevant sections - Long context: Synthesize across them


Decision Framework: Which Approach?

Use RAG when:

Cost-sensitive applications - High query volume (1M+ queries/day) - Budget constraints

Simple lookup queries - "What is the definition of X?" - "When was Y published?" - Single-hop reasoning

Extremely large corpora - 10,000+ documents - Multi-TB knowledge bases

Low-latency requirements - Chatbots with <2s response time - Real-time applications

Dynamic content - Frequently updated documents - Need to add/remove content easily

Use Long Context when:

Complex reasoning required - Multi-hop questions - Comparative analysis - Temporal reasoning

High accuracy critical - Cannot afford to miss information - Medical, legal, scientific domains

Small document sets - Single large document (book, manual, thesis) - <100k tokens total

Cost not a constraint - Research projects - High-value use cases

Use Hybrid when:

Medium complexity - More than simple lookup, less than full analysis - Need both retrieval efficiency + reasoning power

Moderate document sizes - 100k - 1M tokens - Sweet spot for hybrid

Quality > cost (but cost matters) - Willing to pay 10× more than RAG - But not 100× more for full long context


Implementation Patterns

Pattern 1: Hierarchical Retrieval + Long Context

def hierarchical_hybrid(query, documents):
    """
    Step 1: Retrieve relevant documents (coarse)
    Step 2: Use long context within those documents (fine)
    """
    # Coarse: Which documents?
    relevant_docs = retrieve_documents(query, top_k=3)

    # Fine: Long context reasoning within them
    combined_docs = "\n\n---\n\n".join(relevant_docs)

    return llm.generate(
        f"Documents:\n{combined_docs}\n\nQuestion: {query}",
        max_context=200000
    )

Pattern 2: Iterative Retrieval + Context Expansion

def iterative_expansion(query, vector_db, max_iterations=3):
    """
    Iteratively expand context until sufficient information found
    """
    context_chunks = []
    current_query = query

    for i in range(max_iterations):
        # Retrieve based on current query
        new_chunks = vector_db.search(current_query, top_k=5)
        context_chunks.extend(new_chunks)

        # Generate intermediate answer
        context = assemble(context_chunks)
        response = llm.generate(
            f"Context: {context}\n\nQuestion: {query}\n\n"
            "If you need more information, say 'NEED_MORE: <what info>'."
        )

        # Check if model needs more info
        if "NEED_MORE:" in response:
            current_query = extract_info_need(response)
        else:
            return response  # Done!

    return response

Pattern 3: Summarization Pyramid

def summarization_pyramid(large_document, max_context=100000):
    """
    Compress long documents via hierarchical summarization
    Fit result into long context window
    """
    # Level 1: Chunk document
    chunks = chunk_document(large_document, chunk_size=5000)

    # Level 2: Summarize each chunk
    summaries = [
        llm.generate(f"Summarize:\n{chunk}")
        for chunk in chunks
    ]

    # Level 3: Summarize summaries (if needed)
    if len(summaries) * 500 > max_context:
        summaries = [
            llm.generate(f"Summarize these summaries:\n{group}")
            for group in batch(summaries, n=10)
        ]

    # Final: Long context reasoning over compressed representation
    compressed_doc = "\n\n".join(summaries)

    return compressed_doc  # Can now fit in context window

Cost-Performance Trade-offs

Scenario: 500k token document, 10k queries/day

Approach Cost/Query Total/Day Latency Accuracy
RAG (top-5) $0.01 $100 1.0s 75%
RAG (top-20) $0.04 $400 1.2s 85%
Hybrid $0.20 $2,000 3.0s 92%
Full Long Context $1.00 $10,000 30s 95%

Observations: - RAG: 100× cheaper than long context, but 20% accuracy loss - Hybrid: 5× cheaper than long context, only 3% accuracy loss - Long context: Highest accuracy, but extreme cost + latency

Recommendation: Hybrid for most production use cases.


Optimizations

For RAG

1. Query Expansion

def expand_query(query):
    """Generate multiple query variants for better retrieval"""
    return llm.generate(
        f"Generate 3 alternative phrasings of: {query}"
    ).split('\n')

expanded_queries = expand_query(user_query)
all_chunks = [retrieve(q) for q in expanded_queries]
deduplicated = deduplicate(all_chunks)

2. Reranking

def rerank(query, chunks):
    """Use cross-encoder to rerank retrieved chunks"""
    scores = cross_encoder.predict([
        (query, chunk.text) for chunk in chunks
    ])
    return sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)

3. Contextual Chunk Embeddings

def embed_with_context(chunk, surrounding_chunks):
    """Embed chunk with surrounding context for better retrieval"""
    context = f"Previous: {surrounding_chunks[-1]}\n\n{chunk}\n\nNext: {surrounding_chunks[1]}"
    return embed(context)

For Long Context

1. Prompt Positioning

def position_optimized_prompt(query, document):
    """Put query at start AND end to avoid 'lost in the middle'"""
    return f"""
    Question: {query}

    Document:
    {document}

    Answer the question above based on the document.
    Question (reminder): {query}
    """

2. Chunked Processing with Memory

def process_with_memory(large_doc, chunk_size=50000):
    """Process large doc in chunks, maintain running summary"""
    chunks = split(large_doc, chunk_size)
    memory = ""

    for chunk in chunks:
        prompt = f"Memory: {memory}\n\nNew content: {chunk}\n\nUpdate memory:"
        memory = llm.generate(prompt)

    return memory  # Compressed representation


Future Directions

1. Sparse Attention Mechanisms

  • Longformer, BigBird: O(n) instead of O(n²)
  • Trade accuracy for efficiency

2. Retrieval-Interleaved Generation

  • Models that retrieve mid-generation
  • RETRO, kNN-LM patterns

3. External Memory Systems

  • Vector DBs as external memory
  • Learn when to retrieve vs use parametric knowledge

4. Infinite Context

  • RMT (Recurrent Memory Transformer)
  • Compress arbitrarily long contexts

Key Takeaways

  1. RAG wins on cost/latency: 100× cheaper, 10× faster
  2. Long context wins on accuracy: No information loss, multi-hop reasoning
  3. Hybrid is the sweet spot: 5× cheaper than long context, 95% of the accuracy
  4. "Lost in the middle" is real: Position matters in long contexts
  5. Use case determines approach: Simple lookup → RAG, complex reasoning → long context
  6. Optimize retrieval first: Query expansion, reranking, contextual embeddings
  7. Context budgets matter: Track tokens like you track dollars

Production Recommendations

Start with: Hybrid approach (retrieval + moderate long context)

Optimize for: - Cost: Aggressive retrieval filtering, smaller top-k - Latency: Parallel retrieval, streaming generation - Quality: Larger context budgets, reranking

Monitor: - Query complexity distribution (simple vs multi-hop) - Retrieval quality (precision@k, recall@k) - Cost per query (track context usage) - User satisfaction (thumbs up/down)

Iterate: - A/B test RAG vs hybrid vs long context - Tune retrieval parameters (top-k, reranking threshold) - Adjust context budgets based on query type


Further Reading


Part of my technical blog exploring practical AI engineering. See my AI Research Portfolio for implementations.