Evaluating LLM Outputs: Beyond BLEU Scores
A practical guide to evaluating Large Language Model outputs using automated metrics, LLM-as-judge, and hybrid approaches.
The Challenge
Traditional NLG metrics like BLEU and ROUGE were designed for machine translation and extractive summarization. They fail to capture:
- Semantic equivalence: "Paris is France's capital" vs "The capital of France is Paris"
- Factual correctness: Plausible but wrong answers
- Instruction following: Did it actually answer the question?
- Stylistic quality: Coherence, conciseness, helpfulness
Evaluation Approaches
1. Automated Metrics
N-gram Overlap (BLEU, ROUGE)
from nltk.translate.bleu_score import sentence_bleu
reference = "The cat sat on the mat".split()
candidate = "The cat is on the mat".split()
bleu = sentence_bleu([reference], candidate) # ~0.75
Pros: Fast, reproducible, no API costs Cons: Misses semantic similarity, favors lexical overlap
Semantic Similarity
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
ref_emb = model.encode("Paris is France's capital")
pred_emb = model.encode("The capital of France is Paris")
similarity = util.cos_sim(ref_emb, pred_emb) # ~0.95
Pros: Captures semantic equivalence Cons: Not task-specific, no reasoning about correctness
2. LLM-as-Judge
Use a strong LLM (GPT-4, Claude Opus) to evaluate other LLM outputs.
def build_judge_prompt(question: str, response: str) -> str:
return f"""Evaluate this response on CORRECTNESS (1-5):
Question: {question}
Response: {response}
Score 1-5 where:
1 = Completely incorrect
3 = Partially correct
5 = Completely correct
Respond with JSON: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
# Use Claude to judge
judge_response = client.messages.create(
model="claude-opus-4.6",
messages=[{"role": "user", "content": build_judge_prompt(q, r)}]
)
Pros: Captures nuanced quality, task-specific evaluation Cons: Expensive, slower, potential bias
3. Hybrid Approach
Combine both for cost-effective evaluation:
- Filter with automated metrics: Remove obviously bad outputs (low BLEU/similarity)
- Judge high-stakes cases: Use LLM-as-judge for borderline or production-critical outputs
- Sample for human eval: Validate judge scores on representative subset
LLM-as-Judge Best Practices
Multi-Criteria Evaluation
Don't ask for a single "quality" score. Separate concerns:
CRITERIA = {
"correctness": "Is the information factually accurate?",
"completeness": "Does it address all parts of the question?",
"coherence": "Is the logic clear and easy to follow?",
"conciseness": "Is it appropriately brief without redundancy?",
"helpfulness": "Would this actually help the user?"
}
Use Reference Answers Carefully
With reference:
Question: What is 2+2?
Response: Four
Reference: The answer is 4
Judge: "Response is semantically correct despite wording difference. 5/5"
Without reference (when references are low-quality or unavailable):
Question: Explain quantum entanglement simply
Response: <candidate answer>
Judge: "Evaluate based on accuracy, clarity, and appropriate simplicity"
Structured Output Format
Always request JSON for parsing:
JUDGE_TEMPLATE = """
Evaluate the response:
{evaluation_criteria}
Respond ONLY with valid JSON:
{{
"score": <1-5>,
"reasoning": "<1-2 sentence explanation>",
"key_issues": ["issue1", "issue2"] // optional
}}
"""
Validation: Does LLM-as-Judge Work?
Correlation with Human Judgment
From my experiments on TruthfulQA (n=200):
| Judge Model | Correlation (ρ) | Cost/eval |
|---|---|---|
| GPT-4 | 0.79 | $0.003 |
| Claude Opus | 0.82 | $0.004 |
| Claude Sonnet | 0.76 | $0.001 |
| GPT-3.5 | 0.61 | $0.0002 |
Finding: Sonnet offers best cost-quality tradeoff for most cases.
Judge Consistency
Test with same inputs across runs:
scores = [judge_eval(q, r) for _ in range(10)]
std_dev = np.std(scores) # Should be < 0.3 for temperature=0
Rule of thumb: Use temperature=0 for reproducible evaluation.
Bias Detection
Judges can be biased toward: - Length: Longer = better (even if verbose) - Formatting: Markdown, bullets score higher - Position: First option in pairwise comparisons
Mitigation: - Test with intentionally verbose/terse examples - Randomize order in pairwise comparisons - Include explicit anti-verbosity criteria
Cost Optimization
1. Stratified Sampling
Don't evaluate everything with LLM-as-judge:
def should_judge(auto_scores):
"""Decide if expensive judge is needed."""
if auto_scores['semantic_sim'] > 0.95:
return False # Clearly good
if auto_scores['semantic_sim'] < 0.3:
return False # Clearly bad
return True # Borderline - needs judge
2. Batch Evaluation
Evaluate multiple items in one API call:
batch_prompt = f"""Evaluate these 5 responses in one JSON array:
1. Q: {q1}\n R: {r1}
2. Q: {q2}\n R: {r2}
...
Respond: [{{"id": 1, "score": X, "reasoning": "..."}}, ...]"""
Savings: ~40% compared to individual calls (due to fixed prompt overhead)
3. Cheaper Models for Simpler Tasks
| Task Complexity | Recommended Judge |
|---|---|
| Factual QA | Sonnet / GPT-4o-mini |
| Creative writing | Opus / GPT-4 |
| Code correctness | Opus + execution tests |
| Summarization | Sonnet |
Production Patterns
Pattern 1: Fast Automated + Sampled Judge
# Evaluate all with automated metrics (fast)
auto_scores = [automated_eval(r) for r in responses]
# Judge sample for calibration
sample = random.sample(responses, min(50, len(responses)))
judge_scores = [llm_judge(r) for r in sample]
# Check correlation
correlation = compute_correlation(auto_scores, judge_scores)
if correlation < 0.7:
warnings.warn("Automated metrics may not be reliable")
Pattern 2: Progressive Evaluation
def progressive_eval(response):
# Stage 1: Cheap automated filter
if automated_score(response) < 0.3:
return {"passed": False, "stage": "automated"}
# Stage 2: Mid-tier judge
judge_score = llm_judge(response, model="sonnet")
if judge_score < 3.0:
return {"passed": False, "stage": "sonnet_judge"}
# Stage 3: Human review (production-critical only)
if is_production_critical:
human_score = request_human_eval(response)
return {"passed": human_score >= 4, "stage": "human"}
return {"passed": True, "stage": "sonnet_judge"}
Pattern 3: A/B Testing with Confidence
def compare_models(model_a_outputs, model_b_outputs, n_judge=100):
"""Statistical comparison with mixed evaluation."""
# Automated metrics on all
auto_a = [automated_eval(r) for r in model_a_outputs]
auto_b = [automated_eval(r) for r in model_b_outputs]
# Judge on sample for high-confidence comparison
sample_indices = random.sample(range(len(model_a_outputs)), n_judge)
judge_a = [llm_judge(model_a_outputs[i]) for i in sample_indices]
judge_b = [llm_judge(model_b_outputs[i]) for i in sample_indices]
# Statistical test
t_stat, p_value = ttest_ind(judge_a, judge_b)
return {
"auto_mean_a": np.mean(auto_a),
"auto_mean_b": np.mean(auto_b),
"judge_mean_a": np.mean(judge_a),
"judge_mean_b": np.mean(judge_b),
"significant": p_value < 0.05
}
Common Pitfalls
❌ Using BLEU for non-translation tasks
Problem: BLEU penalizes valid paraphrases Fix: Use semantic similarity or LLM-as-judge
❌ Trusting judge without validation
Problem: Judges can be confidently wrong Fix: Validate on human-annotated subset
❌ Evaluating on training data
Problem: Overfitting to specific phrasings Fix: Hold out evaluation set, test distribution shift
❌ Ignoring evaluation cost
Problem: Expensive evaluation limits experimentation Fix: Use stratified sampling, cheaper models for filtering
Resources
- Judging LLM-as-a-Judge (Zheng et al., 2023)
- G-Eval: NLG Evaluation using GPT-4
- My evaluation framework implementation
Key Takeaways
- No single metric is sufficient - use multiple complementary approaches
- LLM-as-judge is powerful but expensive - use strategically
- Validate judge scores against human judgment on representative samples
- Optimize for cost with stratified sampling and model selection
- Separate evaluation criteria - correctness, completeness, coherence, etc.
This is part of my AI research portfolio exploring practical approaches to production LLM systems.