How to Implement RAG Evaluation with RAGAS

Difficulty: ⭐⭐ Intermediate | Time: 2 hours

🎯 The Problem

Your RAG system is running, but you have no idea if it's actually good. You're deploying blind without metrics, can't measure improvements, and have no way to catch quality regressions. Manual testing doesn't scale.

This guide solves: Setting up automated evaluation with RAGAS metrics so you can measure quality, track improvements, and catch issues before users do.

⚡ TL;DR - Quick Start

from packages.rag.evaluators import RAGASEvaluator, EvaluationSample

# 1. Create evaluator
evaluator = RAGASEvaluator(
    langsmith_api_key="your-key",
    langsmith_project="my-evals"
)

# 2. Create test samples
samples = [
    EvaluationSample(
        question="What is RecoAgent?",
        ground_truth="RecoAgent is an enterprise RAG platform...",
        answer="",  # Will be generated
        contexts=[]  # Will be retrieved
    )
]

# 3. Run evaluation
results = evaluator.evaluate_samples(samples)
print(f"✅ Precision: {results['context_precision']:.2f}")
print(f"✅ Recall: {results['context_recall']:.2f}")

Expected: See precision, recall, faithfulness, and relevancy scores!

Full Guide

This guide shows you how to set up comprehensive evaluation for your RAG system using RAGAS metrics, LangSmith integration, and automated evaluation pipelines.

Prerequisites

Python 3.8+ installed
RecoAgent installed: pip install recoagent
LangSmith API key
Sample evaluation dataset

Step 1: Understanding RAGAS Metrics

RAGAS provides several key metrics for evaluating RAG systems:

Context Precision: How relevant are the retrieved contexts?
Context Recall: How much of the ground truth is covered by retrieved contexts?
Faithfulness: How faithful is the generated answer to the retrieved contexts?
Answer Relevancy: How relevant is the answer to the question?
Answer Similarity: How similar is the answer to the ground truth?

from packages.rag.evaluators import RAGASEvaluator, EvaluationSample
import os

# Initialize RAGAS evaluator
evaluator = RAGASEvaluator(
    langsmith_api_key=os.getenv("LANGSMITH_API_KEY"),
    langsmith_project="recoagent-eval"
)

print("RAGAS evaluator initialized with metrics:")
print("- Context Precision: Measures relevance of retrieved contexts")
print("- Context Recall: Measures coverage of ground truth")
print("- Faithfulness: Measures answer fidelity to contexts")
print("- Answer Relevancy: Measures answer relevance to question")
print("- Answer Similarity: Measures similarity to ground truth")

Step 2: Creating Evaluation Datasets

Create comprehensive evaluation datasets for your domain:

from datetime import datetime
from typing import List, Dict, Any

def create_evaluation_dataset() -> List[EvaluationSample]:
    """Create a sample evaluation dataset."""
    
    samples = [
        EvaluationSample(
            question="What is RecoAgent and how does it work?",
            ground_truth="RecoAgent is an enterprise RAG platform built with LangGraph that combines hybrid retrieval, agent orchestration, and safety guardrails for production-ready AI applications.",
            answer="",  # Will be generated by the system
            contexts=[],  # Will be retrieved by the system
            source="knowledge_base",
            metadata={
                "domain": "general",
                "difficulty": "easy",
                "expected_sources": 3,
                "sample_id": "sample_001"
            }
        ),
        EvaluationSample(
            question="How do I implement hybrid retrieval with Reciprocal Rank Fusion?",
            ground_truth="Hybrid retrieval combines BM25 and vector search using Reciprocal Rank Fusion to merge results. Implement by creating BM25Retriever and VectorRetriever, then use HybridRetriever with RRF for optimal ranking.",
            answer="",
            contexts=[],
            source="technical_docs",
            metadata={
                "domain": "technical",
                "difficulty": "hard",
                "expected_sources": 5,
                "sample_id": "sample_002"
            }
        ),
        EvaluationSample(
            question="What safety features does RecoAgent provide?",
            ground_truth="RecoAgent includes NVIDIA NeMo Guardrails for input/output filtering, PII detection, topic restrictions, cost tracking, and escalation policies for enterprise safety requirements.",
            answer="",
            contexts=[],
            source="safety_docs",
            metadata={
                "domain": "safety",
                "difficulty": "medium",
                "expected_sources": 4,
                "sample_id": "sample_003"
            }
        )
    ]
    
    return samples

# Create evaluation dataset
eval_dataset = create_evaluation_dataset()
print(f"Created evaluation dataset with {len(eval_dataset)} samples")

Step 3: Setting Up Automated Evaluation Pipeline

Create an automated pipeline that generates answers and evaluates them:

from packages.agents import RAGAgentGraph, AgentConfig, ToolRegistry
from packages.rag.retrievers import HybridRetriever
from packages.rag.stores import OpenSearchStore
import asyncio

async def evaluate_rag_system(agent, eval_samples):
    """Run evaluation on a RAG system."""
    
    evaluated_samples = []
    
    for sample in eval_samples:
        print(f"Evaluating sample: {sample.metadata['sample_id']}")
        
        try:
            # Generate answer using the agent
            result = await agent.run(sample.question, user_id="eval_user")
            
            # Update sample with generated answer
            sample.answer = result.get("answer", "")
            
            # Extract retrieved contexts from result metadata
            if "retrieved_docs" in result.get("metadata", {}):
                sample.contexts = [
                    doc["content"] for doc in result["metadata"]["retrieved_docs"]
                ]
            
            evaluated_samples.append(sample)
            
            print(f"✓ Generated answer for {sample.metadata['sample_id']}")
            
        except Exception as e:
            print(f"✗ Error evaluating {sample.metadata['sample_id']}: {e}")
            # Still add the sample for error tracking
            evaluated_samples.append(sample)
    
    return evaluated_samples

# Set up agent for evaluation
config = AgentConfig(
    model_name="gpt-4",
    temperature=0.1,
    max_tokens=1000,
    max_steps=5,
    cost_limit=0.50
)

# Initialize components
vector_store = OpenSearchStore(endpoint="http://localhost:9200", index_name="knowledge_base")
vector_retriever = VectorRetriever(vector_store=vector_store)
hybrid_retriever = HybridRetriever(vector_retriever=vector_retriever, bm25_retriever=bm25_retriever)

tool_registry = ToolRegistry()
tool_registry.register_retrieval_tool(hybrid_retriever)

agent = RAGAgentGraph(config=config, tool_registry=tool_registry)

# Run evaluation
print("Starting automated evaluation...")
evaluated_samples = await evaluate_rag_system(agent, eval_dataset)
print(f"Completed evaluation for {len(evaluated_samples)} samples")

Step 4: Running RAGAS Evaluation

Now run the RAGAS evaluation on the generated answers:

def run_ragas_evaluation(evaluator, evaluated_samples):
    """Run RAGAS evaluation on evaluated samples."""
    
    print("Running RAGAS evaluation...")
    
    # Evaluate samples with RAGAS
    eval_results = evaluator.evaluate_samples(evaluated_samples)
    
    # Compute aggregate metrics
    aggregate_metrics = evaluator.compute_aggregate_metrics(eval_results)
    
    return eval_results, aggregate_metrics

# Run RAGAS evaluation
eval_results, aggregate_metrics = run_ragas_evaluation(evaluator, evaluated_samples)

# Display results
print("\n=== RAGAS Evaluation Results ===")
print(f"Context Precision: {aggregate_metrics.get('context_precision', 0):.3f}")
print(f"Context Recall: {aggregate_metrics.get('context_recall', 0):.3f}")
print(f"Faithfulness: {aggregate_metrics.get('faithfulness', 0):.3f}")
print(f"Answer Relevancy: {aggregate_metrics.get('answer_relevancy', 0):.3f}")
print(f"Answer Similarity: {aggregate_metrics.get('answer_similarity', 0):.3f}")
print(f"Total Samples: {aggregate_metrics.get('total_samples', 0)}")
print(f"Valid Samples: {aggregate_metrics.get('valid_samples', 0)}")
print(f"Error Rate: {aggregate_metrics.get('error_rate', 0):.3f}")

Step 5: Detailed Results Analysis

Analyze individual sample results for deeper insights:

def analyze_evaluation_results(eval_results):
    """Analyze detailed evaluation results."""
    
    print("\n=== Detailed Results Analysis ===")
    
    # Group by domain
    domain_results = {}
    for result in eval_results:
        domain = result.contexts[0].metadata.get('domain', 'unknown') if result.contexts else 'unknown'
        if domain not in domain_results:
            domain_results[domain] = []
        domain_results[domain].append(result)
    
    # Analyze by domain
    for domain, results in domain_results.items():
        print(f"\n--- {domain.upper()} DOMAIN ---")
        
        avg_precision = sum(r.metrics.get('context_precision', 0) for r in results) / len(results)
        avg_recall = sum(r.metrics.get('context_recall', 0) for r in results) / len(results)
        avg_faithfulness = sum(r.metrics.get('faithfulness', 0) for r in results) / len(results)
        
        print(f"Average Context Precision: {avg_precision:.3f}")
        print(f"Average Context Recall: {avg_recall:.3f}")
        print(f"Average Faithfulness: {avg_faithfulness:.3f}")
        
        # Identify problematic samples
        low_precision = [r for r in results if r.metrics.get('context_precision', 0) < 0.7]
        low_recall = [r for r in results if r.metrics.get('context_recall', 0) < 0.7]
        
        if low_precision:
            print(f"⚠️  {len(low_precision)} samples with low context precision")
        if low_recall:
            print(f"⚠️  {len(low_recall)} samples with low context recall")
    
    # Analyze by difficulty
    difficulty_results = {}
    for result in eval_results:
        difficulty = result.contexts[0].metadata.get('difficulty', 'unknown') if result.contexts else 'unknown'
        if difficulty not in difficulty_results:
            difficulty_results[difficulty] = []
        difficulty_results[difficulty].append(result)
    
    print(f"\n--- DIFFICULTY ANALYSIS ---")
    for difficulty, results in difficulty_results.items():
        avg_similarity = sum(r.metrics.get('answer_similarity', 0) for r in results) / len(results)
        print(f"{difficulty.title()} questions: Avg Answer Similarity = {avg_similarity:.3f}")

# Run detailed analysis
analyze_evaluation_results(eval_results)

Step 6: Creating Evaluation Reports

Generate comprehensive evaluation reports:

import json
from datetime import datetime

def generate_evaluation_report(eval_results, aggregate_metrics, output_path="evaluation_report.json"):
    """Generate comprehensive evaluation report."""
    
    report = {
        "timestamp": datetime.utcnow().isoformat(),
        "summary": {
            "total_samples": aggregate_metrics.get('total_samples', 0),
            "valid_samples": aggregate_metrics.get('valid_samples', 0),
            "error_rate": aggregate_metrics.get('error_rate', 0),
            "overall_score": sum([
                aggregate_metrics.get('context_precision', 0),
                aggregate_metrics.get('context_recall', 0),
                aggregate_metrics.get('faithfulness', 0),
                aggregate_metrics.get('answer_relevancy', 0),
                aggregate_metrics.get('answer_similarity', 0)
            ]) / 5
        },
        "metrics": {
            "context_precision": aggregate_metrics.get('context_precision', 0),
            "context_recall": aggregate_metrics.get('context_recall', 0),
            "faithfulness": aggregate_metrics.get('faithfulness', 0),
            "answer_relevancy": aggregate_metrics.get('answer_relevancy', 0),
            "answer_similarity": aggregate_metrics.get('answer_similarity', 0)
        },
        "detailed_results": [
            {
                "sample_id": result.sample_id,
                "question": result.question,
                "answer": result.answer,
                "ground_truth": result.ground_truth,
                "metrics": result.metrics,
                "contexts_count": len(result.contexts),
                "cost": result.cost,
                "latency_ms": result.latency_ms
            }
            for result in eval_results
        ],
        "recommendations": generate_recommendations(aggregate_metrics)
    }
    
    # Save report
    with open(output_path, 'w') as f:
        json.dump(report, f, indent=2)
    
    print(f"Evaluation report saved to {output_path}")
    return report

def generate_recommendations(metrics):
    """Generate improvement recommendations based on metrics."""
    
    recommendations = []
    
    if metrics.get('context_precision', 0) < 0.7:
        recommendations.append({
            "metric": "Context Precision",
            "score": metrics.get('context_precision', 0),
            "recommendation": "Improve retrieval quality by tuning hybrid retrieval parameters or adding more relevant documents"
        })
    
    if metrics.get('context_recall', 0) < 0.7:
        recommendations.append({
            "metric": "Context Recall", 
            "score": metrics.get('context_recall', 0),
            "recommendation": "Increase retrieval diversity by adjusting k values or adding more comprehensive documents"
        })
    
    if metrics.get('faithfulness', 0) < 0.8:
        recommendations.append({
            "metric": "Faithfulness",
            "score": metrics.get('faithfulness', 0),
            "recommendation": "Improve answer generation to better follow retrieved contexts, consider prompt engineering"
        })
    
    if metrics.get('answer_relevancy', 0) < 0.7:
        recommendations.append({
            "metric": "Answer Relevancy",
            "score": metrics.get('answer_relevancy', 0),
            "recommendation": "Enhance answer generation to better address the specific question asked"
        })
    
    return recommendations

# Generate evaluation report
report = generate_evaluation_report(eval_results, aggregate_metrics)
print(f"Overall RAGAS Score: {report['summary']['overall_score']:.3f}")

if report['recommendations']:
    print("\n=== Improvement Recommendations ===")
    for rec in report['recommendations']:
        print(f"• {rec['metric']} ({rec['score']:.3f}): {rec['recommendation']}")

Step 7: Setting Up Continuous Evaluation

Set up automated evaluation that runs regularly:

import schedule
import time
from datetime import datetime

def continuous_evaluation_job():
    """Job to run continuous evaluation."""
    
    print(f"Starting continuous evaluation at {datetime.now()}")
    
    # Load evaluation dataset
    eval_dataset = create_evaluation_dataset()
    
    # Run evaluation
    evaluated_samples = await evaluate_rag_system(agent, eval_dataset)
    eval_results, aggregate_metrics = run_ragas_evaluation(evaluator, evaluated_samples)
    
    # Generate report
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    report_path = f"evaluation_report_{timestamp}.json"
    report = generate_evaluation_report(eval_results, aggregate_metrics, report_path)
    
    # Check for regressions
    check_for_regressions(report)
    
    print(f"Continuous evaluation completed. Report saved to {report_path}")

def check_for_regressions(current_report):
    """Check for performance regressions."""
    
    # Load previous report (if exists)
    try:
        with open("latest_evaluation_report.json", 'r') as f:
            previous_report = json.load(f)
        
        current_score = current_report['summary']['overall_score']
        previous_score = previous_report['summary']['overall_score']
        
        if current_score < previous_score - 0.05:  # 5% regression threshold
            print(f"⚠️  Performance regression detected!")
            print(f"Previous score: {previous_score:.3f}")
            print(f"Current score: {current_score:.3f}")
            print(f"Regression: {previous_score - current_score:.3f}")
        else:
            print(f"✅ No significant regression detected")
            print(f"Score change: {current_score - previous_score:+.3f}")
            
    except FileNotFoundError:
        print("No previous report found for comparison")
    
    # Save as latest report
    with open("latest_evaluation_report.json", 'w') as f:
        json.dump(current_report, f, indent=2)

# Schedule continuous evaluation
schedule.every().day.at("02:00").do(continuous_evaluation_job)
schedule.every().sunday.at("03:00").do(lambda: continuous_evaluation_job())  # Weekly deep evaluation

print("Continuous evaluation scheduled:")
print("- Daily evaluation at 2:00 AM")
print("- Weekly deep evaluation on Sundays at 3:00 AM")

# Run the scheduler
while True:
    schedule.run_pending()
    time.sleep(60)

Step 8: Integration with CI/CD

Integrate evaluation into your CI/CD pipeline:

# .github/workflows/evaluation.yml
name: RAG Evaluation

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  evaluate:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'
    
    - name: Install dependencies
      run: |
        pip install recoagent
        pip install pytest
    
    - name: Run RAG Evaluation
      env:
        LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      run: |
        python scripts/run_evaluation.py
    
    - name: Check evaluation thresholds
      run: |
        python scripts/check_evaluation_thresholds.py
    
    - name: Upload evaluation report
      uses: actions/upload-artifact@v3
      with:
        name: evaluation-report
        path: evaluation_report.json

# scripts/check_evaluation_thresholds.py
import json
import sys

def check_evaluation_thresholds(report_path="evaluation_report.json"):
    """Check if evaluation meets minimum thresholds."""
    
    with open(report_path, 'r') as f:
        report = json.load(f)
    
    thresholds = {
        'context_precision': 0.7,
        'context_recall': 0.7,
        'faithfulness': 0.8,
        'answer_relevancy': 0.7,
        'answer_similarity': 0.7,
        'overall_score': 0.7
    }
    
    failed_checks = []
    
    # Check individual metrics
    for metric, threshold in thresholds.items():
        if metric == 'overall_score':
            actual_score = report['summary']['overall_score']
        else:
            actual_score = report['metrics'][metric]
        
        if actual_score < threshold:
            failed_checks.append(f"{metric}: {actual_score:.3f} < {threshold}")
    
    # Check error rate
    error_rate = report['summary']['error_rate']
    if error_rate > 0.1:  # 10% error rate threshold
        failed_checks.append(f"error_rate: {error_rate:.3f} > 0.1")
    
    if failed_checks:
        print("❌ Evaluation failed thresholds:")
        for check in failed_checks:
            print(f"  - {check}")
        sys.exit(1)
    else:
        print("✅ All evaluation thresholds met")
        sys.exit(0)

if __name__ == "__main__":
    check_evaluation_thresholds()

What You've Learned

You now know how to:

✅ Set up RAGAS evaluation with comprehensive metrics
✅ Create evaluation datasets for your domain
✅ Run automated evaluation pipelines
✅ Analyze results and identify improvement areas
✅ Generate evaluation reports with recommendations
✅ Set up continuous evaluation and regression detection
✅ Integrate evaluation into CI/CD pipelines

Best Practices

Evaluation Dataset Design

Diverse questions: Cover different difficulty levels and domains
Realistic scenarios: Use questions your users actually ask
Regular updates: Keep dataset current with your knowledge base

Evaluation Frequency

Daily: Automated evaluation for regression detection
Weekly: Comprehensive evaluation with detailed analysis
Per release: Full evaluation before production deployment

Threshold Management

Start conservative: Set achievable thresholds initially
Gradual improvement: Increase thresholds as system improves
Domain-specific: Different thresholds for different use cases

Troubleshooting

Common Issues

Low Context Precision

Check document quality and relevance
Tune retrieval parameters (alpha, k values)
Improve document chunking strategy

Low Context Recall

Add more comprehensive documents
Increase retrieval diversity
Use multiple retrieval strategies

Low Faithfulness

Improve prompt engineering
Add context citation requirements
Use smaller, more focused chunks

Evaluation Failures

Check API keys and credentials
Verify vector store connectivity
Ensure sufficient compute resources

Getting Help

Check the Examples for working implementations
Browse Reference for complete API documentation
See the Comprehensive LangSmith Evaluation Implementation for advanced features
Review the Operational Runbook for production operations
Contact support@recohut.com for assistance

🎯 The Problem​

⚡ TL;DR - Quick Start​

Full Guide​

Prerequisites​

Step 1: Understanding RAGAS Metrics​

Step 2: Creating Evaluation Datasets​

Step 3: Setting Up Automated Evaluation Pipeline​

Step 4: Running RAGAS Evaluation​

Step 5: Detailed Results Analysis​

Step 6: Creating Evaluation Reports​

Step 7: Setting Up Continuous Evaluation​

Step 8: Integration with CI/CD​

What You've Learned​

Best Practices​

Evaluation Dataset Design​

Evaluation Frequency​

Threshold Management​

Troubleshooting​

Common Issues​

Getting Help​