How to Implement RAG Evaluation with RAGAS
Difficulty: ⭐⭐ Intermediate | Time: 2 hours
🎯 The Problem
Your RAG system is running, but you have no idea if it's actually good. You're deploying blind without metrics, can't measure improvements, and have no way to catch quality regressions. Manual testing doesn't scale.
This guide solves: Setting up automated evaluation with RAGAS metrics so you can measure quality, track improvements, and catch issues before users do.
⚡ TL;DR - Quick Start
from packages.rag.evaluators import RAGASEvaluator, EvaluationSample
# 1. Create evaluator
evaluator = RAGASEvaluator(
langsmith_api_key="your-key",
langsmith_project="my-evals"
)
# 2. Create test samples
samples = [
EvaluationSample(
question="What is RecoAgent?",
ground_truth="RecoAgent is an enterprise RAG platform...",
answer="", # Will be generated
contexts=[] # Will be retrieved
)
]
# 3. Run evaluation
results = evaluator.evaluate_samples(samples)
print(f"✅ Precision: {results['context_precision']:.2f}")
print(f"✅ Recall: {results['context_recall']:.2f}")
Expected: See precision, recall, faithfulness, and relevancy scores!
Full Guide
This guide shows you how to set up comprehensive evaluation for your RAG system using RAGAS metrics, LangSmith integration, and automated evaluation pipelines.
Prerequisites
- Python 3.8+ installed
- RecoAgent installed:
pip install recoagent
- LangSmith API key
- Sample evaluation dataset
Step 1: Understanding RAGAS Metrics
RAGAS provides several key metrics for evaluating RAG systems:
- Context Precision: How relevant are the retrieved contexts?
- Context Recall: How much of the ground truth is covered by retrieved contexts?
- Faithfulness: How faithful is the generated answer to the retrieved contexts?
- Answer Relevancy: How relevant is the answer to the question?
- Answer Similarity: How similar is the answer to the ground truth?
from packages.rag.evaluators import RAGASEvaluator, EvaluationSample
import os
# Initialize RAGAS evaluator
evaluator = RAGASEvaluator(
langsmith_api_key=os.getenv("LANGSMITH_API_KEY"),
langsmith_project="recoagent-eval"
)
print("RAGAS evaluator initialized with metrics:")
print("- Context Precision: Measures relevance of retrieved contexts")
print("- Context Recall: Measures coverage of ground truth")
print("- Faithfulness: Measures answer fidelity to contexts")
print("- Answer Relevancy: Measures answer relevance to question")
print("- Answer Similarity: Measures similarity to ground truth")
Step 2: Creating Evaluation Datasets
Create comprehensive evaluation datasets for your domain:
from datetime import datetime
from typing import List, Dict, Any
def create_evaluation_dataset() -> List[EvaluationSample]:
"""Create a sample evaluation dataset."""
samples = [
EvaluationSample(
question="What is RecoAgent and how does it work?",
ground_truth="RecoAgent is an enterprise RAG platform built with LangGraph that combines hybrid retrieval, agent orchestration, and safety guardrails for production-ready AI applications.",
answer="", # Will be generated by the system
contexts=[], # Will be retrieved by the system
source="knowledge_base",
metadata={
"domain": "general",
"difficulty": "easy",
"expected_sources": 3,
"sample_id": "sample_001"
}
),
EvaluationSample(
question="How do I implement hybrid retrieval with Reciprocal Rank Fusion?",
ground_truth="Hybrid retrieval combines BM25 and vector search using Reciprocal Rank Fusion to merge results. Implement by creating BM25Retriever and VectorRetriever, then use HybridRetriever with RRF for optimal ranking.",
answer="",
contexts=[],
source="technical_docs",
metadata={
"domain": "technical",
"difficulty": "hard",
"expected_sources": 5,
"sample_id": "sample_002"
}
),
EvaluationSample(
question="What safety features does RecoAgent provide?",
ground_truth="RecoAgent includes NVIDIA NeMo Guardrails for input/output filtering, PII detection, topic restrictions, cost tracking, and escalation policies for enterprise safety requirements.",
answer="",
contexts=[],
source="safety_docs",
metadata={
"domain": "safety",
"difficulty": "medium",
"expected_sources": 4,
"sample_id": "sample_003"
}
)
]
return samples
# Create evaluation dataset
eval_dataset = create_evaluation_dataset()
print(f"Created evaluation dataset with {len(eval_dataset)} samples")
Step 3: Setting Up Automated Evaluation Pipeline
Create an automated pipeline that generates answers and evaluates them:
from packages.agents import RAGAgentGraph, AgentConfig, ToolRegistry
from packages.rag.retrievers import HybridRetriever
from packages.rag.stores import OpenSearchStore
import asyncio
async def evaluate_rag_system(agent, eval_samples):
"""Run evaluation on a RAG system."""
evaluated_samples = []
for sample in eval_samples:
print(f"Evaluating sample: {sample.metadata['sample_id']}")
try:
# Generate answer using the agent
result = await agent.run(sample.question, user_id="eval_user")
# Update sample with generated answer
sample.answer = result.get("answer", "")
# Extract retrieved contexts from result metadata
if "retrieved_docs" in result.get("metadata", {}):
sample.contexts = [
doc["content"] for doc in result["metadata"]["retrieved_docs"]
]
evaluated_samples.append(sample)
print(f"✓ Generated answer for {sample.metadata['sample_id']}")
except Exception as e:
print(f"✗ Error evaluating {sample.metadata['sample_id']}: {e}")
# Still add the sample for error tracking
evaluated_samples.append(sample)
return evaluated_samples
# Set up agent for evaluation
config = AgentConfig(
model_name="gpt-4",
temperature=0.1,
max_tokens=1000,
max_steps=5,
cost_limit=0.50
)
# Initialize components
vector_store = OpenSearchStore(endpoint="http://localhost:9200", index_name="knowledge_base")
vector_retriever = VectorRetriever(vector_store=vector_store)
hybrid_retriever = HybridRetriever(vector_retriever=vector_retriever, bm25_retriever=bm25_retriever)
tool_registry = ToolRegistry()
tool_registry.register_retrieval_tool(hybrid_retriever)
agent = RAGAgentGraph(config=config, tool_registry=tool_registry)
# Run evaluation
print("Starting automated evaluation...")
evaluated_samples = await evaluate_rag_system(agent, eval_dataset)
print(f"Completed evaluation for {len(evaluated_samples)} samples")
Step 4: Running RAGAS Evaluation
Now run the RAGAS evaluation on the generated answers:
def run_ragas_evaluation(evaluator, evaluated_samples):
"""Run RAGAS evaluation on evaluated samples."""
print("Running RAGAS evaluation...")
# Evaluate samples with RAGAS
eval_results = evaluator.evaluate_samples(evaluated_samples)
# Compute aggregate metrics
aggregate_metrics = evaluator.compute_aggregate_metrics(eval_results)
return eval_results, aggregate_metrics
# Run RAGAS evaluation
eval_results, aggregate_metrics = run_ragas_evaluation(evaluator, evaluated_samples)
# Display results
print("\n=== RAGAS Evaluation Results ===")
print(f"Context Precision: {aggregate_metrics.get('context_precision', 0):.3f}")
print(f"Context Recall: {aggregate_metrics.get('context_recall', 0):.3f}")
print(f"Faithfulness: {aggregate_metrics.get('faithfulness', 0):.3f}")
print(f"Answer Relevancy: {aggregate_metrics.get('answer_relevancy', 0):.3f}")
print(f"Answer Similarity: {aggregate_metrics.get('answer_similarity', 0):.3f}")
print(f"Total Samples: {aggregate_metrics.get('total_samples', 0)}")
print(f"Valid Samples: {aggregate_metrics.get('valid_samples', 0)}")
print(f"Error Rate: {aggregate_metrics.get('error_rate', 0):.3f}")
Step 5: Detailed Results Analysis
Analyze individual sample results for deeper insights:
def analyze_evaluation_results(eval_results):
"""Analyze detailed evaluation results."""
print("\n=== Detailed Results Analysis ===")
# Group by domain
domain_results = {}
for result in eval_results:
domain = result.contexts[0].metadata.get('domain', 'unknown') if result.contexts else 'unknown'
if domain not in domain_results:
domain_results[domain] = []
domain_results[domain].append(result)
# Analyze by domain
for domain, results in domain_results.items():
print(f"\n--- {domain.upper()} DOMAIN ---")
avg_precision = sum(r.metrics.get('context_precision', 0) for r in results) / len(results)
avg_recall = sum(r.metrics.get('context_recall', 0) for r in results) / len(results)
avg_faithfulness = sum(r.metrics.get('faithfulness', 0) for r in results) / len(results)
print(f"Average Context Precision: {avg_precision:.3f}")
print(f"Average Context Recall: {avg_recall:.3f}")
print(f"Average Faithfulness: {avg_faithfulness:.3f}")
# Identify problematic samples
low_precision = [r for r in results if r.metrics.get('context_precision', 0) < 0.7]
low_recall = [r for r in results if r.metrics.get('context_recall', 0) < 0.7]
if low_precision:
print(f"⚠️ {len(low_precision)} samples with low context precision")
if low_recall:
print(f"⚠️ {len(low_recall)} samples with low context recall")
# Analyze by difficulty
difficulty_results = {}
for result in eval_results:
difficulty = result.contexts[0].metadata.get('difficulty', 'unknown') if result.contexts else 'unknown'
if difficulty not in difficulty_results:
difficulty_results[difficulty] = []
difficulty_results[difficulty].append(result)
print(f"\n--- DIFFICULTY ANALYSIS ---")
for difficulty, results in difficulty_results.items():
avg_similarity = sum(r.metrics.get('answer_similarity', 0) for r in results) / len(results)
print(f"{difficulty.title()} questions: Avg Answer Similarity = {avg_similarity:.3f}")
# Run detailed analysis
analyze_evaluation_results(eval_results)
Step 6: Creating Evaluation Reports
Generate comprehensive evaluation reports:
import json
from datetime import datetime
def generate_evaluation_report(eval_results, aggregate_metrics, output_path="evaluation_report.json"):
"""Generate comprehensive evaluation report."""
report = {
"timestamp": datetime.utcnow().isoformat(),
"summary": {
"total_samples": aggregate_metrics.get('total_samples', 0),
"valid_samples": aggregate_metrics.get('valid_samples', 0),
"error_rate": aggregate_metrics.get('error_rate', 0),
"overall_score": sum([
aggregate_metrics.get('context_precision', 0),
aggregate_metrics.get('context_recall', 0),
aggregate_metrics.get('faithfulness', 0),
aggregate_metrics.get('answer_relevancy', 0),
aggregate_metrics.get('answer_similarity', 0)
]) / 5
},
"metrics": {
"context_precision": aggregate_metrics.get('context_precision', 0),
"context_recall": aggregate_metrics.get('context_recall', 0),
"faithfulness": aggregate_metrics.get('faithfulness', 0),
"answer_relevancy": aggregate_metrics.get('answer_relevancy', 0),
"answer_similarity": aggregate_metrics.get('answer_similarity', 0)
},
"detailed_results": [
{
"sample_id": result.sample_id,
"question": result.question,
"answer": result.answer,
"ground_truth": result.ground_truth,
"metrics": result.metrics,
"contexts_count": len(result.contexts),
"cost": result.cost,
"latency_ms": result.latency_ms
}
for result in eval_results
],
"recommendations": generate_recommendations(aggregate_metrics)
}
# Save report
with open(output_path, 'w') as f:
json.dump(report, f, indent=2)
print(f"Evaluation report saved to {output_path}")
return report
def generate_recommendations(metrics):
"""Generate improvement recommendations based on metrics."""
recommendations = []
if metrics.get('context_precision', 0) < 0.7:
recommendations.append({
"metric": "Context Precision",
"score": metrics.get('context_precision', 0),
"recommendation": "Improve retrieval quality by tuning hybrid retrieval parameters or adding more relevant documents"
})
if metrics.get('context_recall', 0) < 0.7:
recommendations.append({
"metric": "Context Recall",
"score": metrics.get('context_recall', 0),
"recommendation": "Increase retrieval diversity by adjusting k values or adding more comprehensive documents"
})
if metrics.get('faithfulness', 0) < 0.8:
recommendations.append({
"metric": "Faithfulness",
"score": metrics.get('faithfulness', 0),
"recommendation": "Improve answer generation to better follow retrieved contexts, consider prompt engineering"
})
if metrics.get('answer_relevancy', 0) < 0.7:
recommendations.append({
"metric": "Answer Relevancy",
"score": metrics.get('answer_relevancy', 0),
"recommendation": "Enhance answer generation to better address the specific question asked"
})
return recommendations
# Generate evaluation report
report = generate_evaluation_report(eval_results, aggregate_metrics)
print(f"Overall RAGAS Score: {report['summary']['overall_score']:.3f}")
if report['recommendations']:
print("\n=== Improvement Recommendations ===")
for rec in report['recommendations']:
print(f"• {rec['metric']} ({rec['score']:.3f}): {rec['recommendation']}")
Step 7: Setting Up Continuous Evaluation
Set up automated evaluation that runs regularly:
import schedule
import time
from datetime import datetime
def continuous_evaluation_job():
"""Job to run continuous evaluation."""
print(f"Starting continuous evaluation at {datetime.now()}")
# Load evaluation dataset
eval_dataset = create_evaluation_dataset()
# Run evaluation
evaluated_samples = await evaluate_rag_system(agent, eval_dataset)
eval_results, aggregate_metrics = run_ragas_evaluation(evaluator, evaluated_samples)
# Generate report
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
report_path = f"evaluation_report_{timestamp}.json"
report = generate_evaluation_report(eval_results, aggregate_metrics, report_path)
# Check for regressions
check_for_regressions(report)
print(f"Continuous evaluation completed. Report saved to {report_path}")
def check_for_regressions(current_report):
"""Check for performance regressions."""
# Load previous report (if exists)
try:
with open("latest_evaluation_report.json", 'r') as f:
previous_report = json.load(f)
current_score = current_report['summary']['overall_score']
previous_score = previous_report['summary']['overall_score']
if current_score < previous_score - 0.05: # 5% regression threshold
print(f"⚠️ Performance regression detected!")
print(f"Previous score: {previous_score:.3f}")
print(f"Current score: {current_score:.3f}")
print(f"Regression: {previous_score - current_score:.3f}")
else:
print(f"✅ No significant regression detected")
print(f"Score change: {current_score - previous_score:+.3f}")
except FileNotFoundError:
print("No previous report found for comparison")
# Save as latest report
with open("latest_evaluation_report.json", 'w') as f:
json.dump(current_report, f, indent=2)
# Schedule continuous evaluation
schedule.every().day.at("02:00").do(continuous_evaluation_job)
schedule.every().sunday.at("03:00").do(lambda: continuous_evaluation_job()) # Weekly deep evaluation
print("Continuous evaluation scheduled:")
print("- Daily evaluation at 2:00 AM")
print("- Weekly deep evaluation on Sundays at 3:00 AM")
# Run the scheduler
while True:
schedule.run_pending()
time.sleep(60)
Step 8: Integration with CI/CD
Integrate evaluation into your CI/CD pipeline:
# .github/workflows/evaluation.yml
name: RAG Evaluation
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install recoagent
pip install pytest
- name: Run RAG Evaluation
env:
LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python scripts/run_evaluation.py
- name: Check evaluation thresholds
run: |
python scripts/check_evaluation_thresholds.py
- name: Upload evaluation report
uses: actions/upload-artifact@v3
with:
name: evaluation-report
path: evaluation_report.json
# scripts/check_evaluation_thresholds.py
import json
import sys
def check_evaluation_thresholds(report_path="evaluation_report.json"):
"""Check if evaluation meets minimum thresholds."""
with open(report_path, 'r') as f:
report = json.load(f)
thresholds = {
'context_precision': 0.7,
'context_recall': 0.7,
'faithfulness': 0.8,
'answer_relevancy': 0.7,
'answer_similarity': 0.7,
'overall_score': 0.7
}
failed_checks = []
# Check individual metrics
for metric, threshold in thresholds.items():
if metric == 'overall_score':
actual_score = report['summary']['overall_score']
else:
actual_score = report['metrics'][metric]
if actual_score < threshold:
failed_checks.append(f"{metric}: {actual_score:.3f} < {threshold}")
# Check error rate
error_rate = report['summary']['error_rate']
if error_rate > 0.1: # 10% error rate threshold
failed_checks.append(f"error_rate: {error_rate:.3f} > 0.1")
if failed_checks:
print("❌ Evaluation failed thresholds:")
for check in failed_checks:
print(f" - {check}")
sys.exit(1)
else:
print("✅ All evaluation thresholds met")
sys.exit(0)
if __name__ == "__main__":
check_evaluation_thresholds()
What You've Learned
You now know how to:
✅ Set up RAGAS evaluation with comprehensive metrics
✅ Create evaluation datasets for your domain
✅ Run automated evaluation pipelines
✅ Analyze results and identify improvement areas
✅ Generate evaluation reports with recommendations
✅ Set up continuous evaluation and regression detection
✅ Integrate evaluation into CI/CD pipelines
Best Practices
Evaluation Dataset Design
- Diverse questions: Cover different difficulty levels and domains
- Realistic scenarios: Use questions your users actually ask
- Regular updates: Keep dataset current with your knowledge base
Evaluation Frequency
- Daily: Automated evaluation for regression detection
- Weekly: Comprehensive evaluation with detailed analysis
- Per release: Full evaluation before production deployment
Threshold Management
- Start conservative: Set achievable thresholds initially
- Gradual improvement: Increase thresholds as system improves
- Domain-specific: Different thresholds for different use cases
Troubleshooting
Common Issues
Low Context Precision
- Check document quality and relevance
- Tune retrieval parameters (alpha, k values)
- Improve document chunking strategy
Low Context Recall
- Add more comprehensive documents
- Increase retrieval diversity
- Use multiple retrieval strategies
Low Faithfulness
- Improve prompt engineering
- Add context citation requirements
- Use smaller, more focused chunks
Evaluation Failures
- Check API keys and credentials
- Verify vector store connectivity
- Ensure sufficient compute resources
Getting Help
- Check the Examples for working implementations
- Browse Reference for complete API documentation
- See the Comprehensive LangSmith Evaluation Implementation for advanced features
- Review the Operational Runbook for production operations
- Contact support@recohut.com for assistance