Skip to main content

RAG Evaluators

Evaluation components for RAG systems using RAGAS metrics to measure context precision, recall, faithfulness, answer relevancy, and other quality indicators.

Core Classes

RAGASEvaluator

Description: Main evaluator using RAGAS metrics for comprehensive RAG evaluation

Parameters:

  • metrics (List[str]): List of metrics to evaluate (default: ["context_precision", "context_recall", "faithfulness", "answer_relevancy"])
  • enable_langsmith (bool): Enable LangSmith integration (default: True)
  • langsmith_project (str): LangSmith project name (default: "rag-evaluation")

Returns: RAGASEvaluator instance

Example:

from recoagent.rag.evaluators import RAGASEvaluator, EvaluationSample

# Create evaluator
evaluator = RAGASEvaluator(
metrics=["context_precision", "context_recall", "faithfulness", "answer_relevancy"],
enable_langsmith=True,
langsmith_project="my-rag-evaluation"
)

# Create evaluation samples
samples = [
EvaluationSample(
question="What is machine learning?",
ground_truth="Machine learning is a subset of artificial intelligence that enables computers to learn from data.",
answer="Machine learning is a branch of AI that allows systems to learn and improve from experience.",
contexts=["ML is a subset of AI...", "Machine learning algorithms..."],
source="knowledge_base",
metadata={"domain": "AI", "difficulty": "beginner"}
)
]

# Evaluate samples
results = evaluator.evaluate(samples)
print(f"Context Precision: {results['context_precision']:.3f}")
print(f"Faithfulness: {results['faithfulness']:.3f}")

CustomEvaluator

Description: Custom evaluator for domain-specific evaluation metrics

Parameters:

  • evaluation_functions (List[Callable]): List of custom evaluation functions
  • weights (List[float]): Weights for each evaluation function
  • aggregation_method (str): Aggregation method ("mean", "weighted_mean", "median")

Returns: CustomEvaluator instance

Example:

from recoagent.rag.evaluators import CustomEvaluator

# Define custom evaluation functions
def domain_accuracy(answer, ground_truth, domain):
"""Check if answer is accurate for specific domain."""
if domain == "medical":
return 1.0 if "medical" in answer.lower() else 0.0
return 0.5

def technical_depth(answer, question):
"""Evaluate technical depth of answer."""
technical_terms = ["algorithm", "model", "neural", "optimization"]
return sum(1 for term in technical_terms if term in answer.lower()) / len(technical_terms)

# Create custom evaluator
custom_evaluator = CustomEvaluator(
evaluation_functions=[domain_accuracy, technical_depth],
weights=[0.6, 0.4],
aggregation_method="weighted_mean"
)

# Evaluate with custom metrics
results = custom_evaluator.evaluate(samples)

EvaluationDataset

Description: Dataset management for RAG evaluation

Parameters:

  • samples (List[EvaluationSample]): List of evaluation samples
  • metadata (Dict): Dataset metadata
  • validation_split (float): Validation split ratio (default: 0.2)

Returns: EvaluationDataset instance

Example:

from recoagent.rag.evaluators import EvaluationDataset, EvaluationSample

# Create evaluation samples
samples = [
EvaluationSample(
question="How does deep learning work?",
ground_truth="Deep learning uses neural networks with multiple layers...",
answer="Deep learning employs multi-layer neural networks...",
contexts=["Deep learning is a subset of ML...", "Neural networks consist of layers..."],
source="ai_textbook",
metadata={"topic": "deep_learning", "level": "intermediate"}
),
# ... more samples
]

# Create dataset
dataset = EvaluationDataset(
samples=samples,
metadata={"name": "AI Knowledge Base", "version": "1.0"},
validation_split=0.2
)

# Split into train/validation
train_dataset, val_dataset = dataset.split()

Usage Examples

Basic RAGAS Evaluation

from recoagent.rag.evaluators import RAGASEvaluator, EvaluationSample

# Create evaluator with standard metrics
evaluator = RAGASEvaluator(
metrics=["context_precision", "context_recall", "faithfulness", "answer_relevancy"]
)

# Prepare evaluation data
evaluation_samples = [
EvaluationSample(
question="What are the benefits of machine learning?",
ground_truth="Machine learning provides automation, pattern recognition, and predictive capabilities.",
answer="ML offers automation, helps identify patterns in data, and enables predictions.",
contexts=[
"Machine learning automates decision-making processes...",
"Pattern recognition is a key benefit of ML...",
"Predictive analytics uses ML algorithms..."
],
source="ml_handbook",
metadata={"domain": "AI", "complexity": "medium"}
),
EvaluationSample(
question="How do neural networks learn?",
ground_truth="Neural networks learn through backpropagation and gradient descent.",
answer="Neural networks learn using backpropagation to adjust weights.",
contexts=[
"Backpropagation is the learning algorithm for neural networks...",
"Gradient descent optimizes network parameters...",
"Weight adjustment happens during training..."
],
source="neural_networks_guide",
metadata={"domain": "AI", "complexity": "high"}
)
]

# Run evaluation
results = evaluator.evaluate(evaluation_samples)

# Display results
print("=== RAGAS Evaluation Results ===")
for metric, score in results.items():
print(f"{metric}: {score:.3f}")

# Get detailed results per sample
detailed_results = evaluator.get_detailed_results()
for result in detailed_results:
print(f"\nQuestion: {result.question}")
print(f"Context Precision: {result.metrics['context_precision']:.3f}")
print(f"Faithfulness: {result.metrics['faithfulness']:.3f}")

Advanced Evaluation with Custom Metrics

from recoagent.rag.evaluators import RAGASEvaluator, CustomEvaluator
import numpy as np

# Create RAGAS evaluator
ragas_evaluator = RAGASEvaluator(
metrics=["context_precision", "context_recall", "faithfulness", "answer_relevancy"]
)

# Define custom evaluation functions
def factual_consistency(answer, contexts):
"""Check if answer is factually consistent with contexts."""
# Simple implementation - check for contradictions
answer_lower = answer.lower()
context_text = " ".join(contexts).lower()

# Check for key facts alignment
key_terms = ["machine learning", "artificial intelligence", "neural network"]
consistency_score = 0.0

for term in key_terms:
if term in answer_lower and term in context_text:
consistency_score += 1.0

return consistency_score / len(key_terms)

def completeness_score(answer, question):
"""Evaluate how complete the answer is."""
question_words = set(question.lower().split())
answer_words = set(answer.lower().split())

# Check if answer addresses key question terms
addressed_terms = question_words.intersection(answer_words)
return len(addressed_terms) / len(question_words) if question_words else 0.0

# Create custom evaluator
custom_evaluator = CustomEvaluator(
evaluation_functions=[factual_consistency, completeness_score],
weights=[0.7, 0.3],
aggregation_method="weighted_mean"
)

# Run both evaluations
ragas_results = ragas_evaluator.evaluate(evaluation_samples)
custom_results = custom_evaluator.evaluate(evaluation_samples)

# Combine results
combined_results = {
**ragas_results,
"factual_consistency": custom_results["factual_consistency"],
"completeness": custom_results["completeness"]
}

print("=== Combined Evaluation Results ===")
for metric, score in combined_results.items():
print(f"{metric}: {score:.3f}")

Evaluation with LangSmith Integration

from recoagent.rag.evaluators import RAGASEvaluator
from langsmith import Client

# Create evaluator with LangSmith integration
evaluator = RAGASEvaluator(
metrics=["context_precision", "context_recall", "faithfulness", "answer_relevancy"],
enable_langsmith=True,
langsmith_project="rag-evaluation-prod"
)

# Initialize LangSmith client
langsmith_client = Client()

# Create evaluation run
evaluation_run = langsmith_client.create_run(
name="RAG Evaluation Run",
run_type="evaluator",
inputs={"evaluation_samples": len(evaluation_samples)}
)

# Run evaluation with tracing
results = evaluator.evaluate_with_tracing(
samples=evaluation_samples,
run_id=evaluation_run.id
)

# Log results to LangSmith
evaluator.log_results_to_langsmith(
results=results,
run_id=evaluation_run.id,
metadata={"model": "gpt-4", "temperature": 0.1}
)

print(f"Evaluation completed. Run ID: {evaluation_run.id}")
print(f"Results logged to LangSmith project: rag-evaluation-prod")

Batch Evaluation and Analysis

from recoagent.rag.evaluators import RAGASEvaluator, EvaluationDataset
import pandas as pd

# Create large evaluation dataset
large_dataset = EvaluationDataset(
samples=evaluation_samples * 10, # 20 samples
metadata={"name": "Large RAG Evaluation", "version": "2.0"}
)

# Create evaluator
evaluator = RAGASEvaluator(
metrics=["context_precision", "context_recall", "faithfulness", "answer_relevancy"]
)

# Run batch evaluation
batch_results = evaluator.evaluate_batch(
dataset=large_dataset,
batch_size=5,
parallel=True
)

# Analyze results
results_df = pd.DataFrame(batch_results)

print("=== Batch Evaluation Analysis ===")
print(f"Total samples evaluated: {len(results_df)}")
print(f"Average context precision: {results_df['context_precision'].mean():.3f}")
print(f"Average faithfulness: {results_df['faithfulness'].mean():.3f}")
print(f"Standard deviation: {results_df['context_precision'].std():.3f}")

# Performance by domain
domain_analysis = results_df.groupby('domain').agg({
'context_precision': ['mean', 'std'],
'faithfulness': ['mean', 'std']
}).round(3)

print("\n=== Performance by Domain ===")
print(domain_analysis)

# Identify problematic samples
low_performance = results_df[
(results_df['context_precision'] < 0.7) |
(results_df['faithfulness'] < 0.7)
]

print(f"\nSamples with low performance: {len(low_performance)}")
for idx, row in low_performance.iterrows():
print(f"Sample {idx}: CP={row['context_precision']:.3f}, F={row['faithfulness']:.3f}")

Continuous Evaluation Pipeline

from recoagent.rag.evaluators import RAGASEvaluator
import asyncio
from datetime import datetime, timedelta

# Create evaluator for continuous monitoring
evaluator = RAGASEvaluator(
metrics=["context_precision", "context_recall", "faithfulness"],
enable_langsmith=True
)

async def continuous_evaluation():
"""Run continuous evaluation on new data."""
while True:
# Get new evaluation samples (from your data pipeline)
new_samples = get_new_evaluation_samples()

if new_samples:
# Run evaluation
results = evaluator.evaluate(new_samples)

# Check for quality degradation
if results['context_precision'] < 0.8:
alert_quality_degradation(results)

# Log results
evaluator.log_evaluation_results(
results=results,
timestamp=datetime.utcnow(),
metadata={"source": "continuous_evaluation"}
)

print(f"Evaluated {len(new_samples)} samples. Context Precision: {results['context_precision']:.3f}")

# Wait before next evaluation
await asyncio.sleep(3600) # Evaluate every hour

def alert_quality_degradation(results):
"""Alert when quality metrics drop below thresholds."""
alerts = []

if results['context_precision'] < 0.8:
alerts.append(f"Context Precision below threshold: {results['context_precision']:.3f}")

if results['faithfulness'] < 0.8:
alerts.append(f"Faithfulness below threshold: {results['faithfulness']:.3f}")

if alerts:
print("🚨 Quality Alert:")
for alert in alerts:
print(f" - {alert}")

# Send to monitoring system
send_alert_to_monitoring(alerts)

# Run continuous evaluation
asyncio.run(continuous_evaluation())

API Reference

RAGASEvaluator Methods

evaluate(samples: List[EvaluationSample]) -> Dict[str, float]

Evaluate samples using RAGAS metrics

Parameters:

  • samples (List[EvaluationSample]): Evaluation samples

Returns: Dictionary with metric scores

evaluate_batch(dataset: EvaluationDataset, batch_size: int = 10) -> List[Dict]

Evaluate large datasets in batches

Parameters:

  • dataset (EvaluationDataset): Evaluation dataset
  • batch_size (int): Batch size for processing

Returns: List of evaluation results

evaluate_with_tracing(samples: List[EvaluationSample], run_id: str) -> Dict

Evaluate with LangSmith tracing

Parameters:

  • samples (List[EvaluationSample]): Evaluation samples
  • run_id (str): LangSmith run ID

Returns: Evaluation results with tracing

CustomEvaluator Methods

evaluate(samples: List[EvaluationSample]) -> Dict[str, float]

Evaluate samples using custom metrics

Parameters:

  • samples (List[EvaluationSample]): Evaluation samples

Returns: Dictionary with custom metric scores

add_evaluation_function(func: Callable, weight: float) -> None

Add new evaluation function

Parameters:

  • func (Callable): Evaluation function
  • weight (float): Weight for the function

EvaluationDataset Methods

split(validation_split: float = 0.2) -> Tuple[EvaluationDataset, EvaluationDataset]

Split dataset into train and validation

Parameters:

  • validation_split (float): Validation split ratio

Returns: Tuple of (train_dataset, validation_dataset)

filter_by_metadata(metadata_filter: Dict) -> EvaluationDataset

Filter samples by metadata

Parameters:

  • metadata_filter (Dict): Metadata filter criteria

Returns: Filtered dataset

See Also