Skip to main content

Source Attribution System

Overview

The Source Attribution System provides comprehensive source tracking and confidence scoring for RAG responses, enabling enterprise users to verify information and understand reliability for critical business decisions and compliance requirements.

Key Features

  • Granular Source Tracking: Links response segments to specific document sections
  • Confidence Scoring: Based on retrieval relevance and generation uncertainty
  • Uncertainty Quantification: For different parts of responses
  • Source Ranking: By reliability and recency
  • Citation Formatting: Multiple formats with verification
  • Confidence-based Filtering: With warnings for low-confidence information
  • Real-time Monitoring: Dashboard for attribution quality and confidence distribution

Architecture

Core Components

1. Source Attribution System

The main system that orchestrates all attribution functionality.

from packages.rag.source_attribution import SourceAttributionSystem

# Initialize the system
attribution_system = SourceAttributionSystem()

# Perform attribution
result = attribution_system.attribute_sources(
query="What is machine learning?",
retrieval_results=retrieval_results,
response_text="Machine learning is a subset of AI..."
)

2. Confidence Scoring

Calculates confidence scores based on multiple factors:

  • Retrieval Relevance: How well sources match the query
  • Generation Uncertainty: Quality of the generated response
  • Source Reliability: Credibility and authority of sources
  • Temporal Factors: Recency and freshness of information
from packages.rag.source_attribution import ConfidenceCalculator

calculator = ConfidenceCalculator()

# Calculate retrieval confidence
retrieval_confidences = calculator.calculate_retrieval_confidence(
query, retrieval_results
)

# Calculate generation confidence
gen_confidence = calculator.calculate_generation_confidence(
query, context, response
)

3. Uncertainty Quantification

Quantifies different types of uncertainty:

  • Semantic Uncertainty: Alignment between response and sources
  • Factual Uncertainty: Confidence in factual claims
  • Temporal Uncertainty: Age-based uncertainty
from packages.rag.source_attribution import UncertaintyQuantifier

quantifier = UncertaintyQuantifier()

# Calculate semantic uncertainty
semantic_uncertainty = quantifier.calculate_semantic_uncertainty(
response_segment, source_segments
)

# Calculate factual uncertainty
factual_uncertainty = quantifier.calculate_factual_uncertainty(
response_segment, sources
)

Configuration

Basic Configuration

from packages.rag.source_attribution import create_source_attribution_system

# Create with default settings
attribution_system = create_source_attribution_system()

# Create with custom confidence calculator
from packages.rag.source_attribution import ConfidenceCalculator

custom_calculator = ConfidenceCalculator(model_name="custom-model")
attribution_system = create_source_attribution_system(
confidence_calculator=custom_calculator
)

Advanced Configuration

from packages.rag.source_attribution import (
SourceAttributionSystem, ConfidenceCalculator,
UncertaintyQuantifier, SourceRanker, CitationFormatter
)

# Create custom components
confidence_calculator = ConfidenceCalculator(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
uncertainty_quantifier = UncertaintyQuantifier()
source_ranker = SourceRanker()
citation_formatter = CitationFormatter()

# Create system with custom components
attribution_system = SourceAttributionSystem(
confidence_calculator=confidence_calculator,
uncertainty_quantifier=uncertainty_quantifier,
source_ranker=source_ranker,
citation_formatter=citation_formatter
)

Usage Examples

Basic Attribution

# Query and retrieval results
query = "What are the benefits of renewable energy?"
retrieval_results = [
RetrievalResult(
chunk=Chunk(
content="Renewable energy reduces carbon emissions...",
metadata={"source": "energy_report.pdf", "page": 1},
chunk_id="chunk_1",
source="energy_report.pdf",
start_char=0,
end_char=100
),
score=0.9,
retrieval_method="vector"
)
]

response_text = "Renewable energy offers environmental and economic benefits."

# Perform attribution
attribution_result = attribution_system.attribute_sources(
query=query,
retrieval_results=retrieval_results,
response_text=response_text
)

print(f"Overall confidence: {attribution_result.overall_confidence:.3f}")
print(f"Number of sources: {len(attribution_result.source_ranking)}")
print(f"Uncertainty metrics: {attribution_result.uncertainty_metrics}")

Confidence-based Filtering

from packages.rag.confidence_filtering import ConfidenceFilter, FilteringPolicy

# Create confidence filter
confidence_filter = ConfidenceFilter(policy=FilteringPolicy.MODERATE)

# Apply filtering
filtering_result = confidence_filter.filter_response(attribution_result)

print(f"Final action: {filtering_result.final_action}")
print(f"Warnings: {len(filtering_result.warnings)}")
print(f"Failed gates: {filtering_result.failed_gates}")

Source Verification

from packages.rag.source_verification import SourceVerifier

# Create verifier
verifier = SourceVerifier()

# Verify sources
for source in attribution_result.source_ranking:
verification_result = verifier.verify_source(source)
print(f"Source {source.source_id}: {verification_result.status.value}")
print(f"Confidence: {verification_result.confidence:.3f}")

Citation Formatting

from packages.rag.source_verification import EnhancedCitationFormatter

# Create formatter
formatter = EnhancedCitationFormatter()

# Format citations with verification
enhanced_citations = formatter.format_enhanced_citations(
attribution_result.source_ranking,
style="apa",
include_verification=True
)

for citation in enhanced_citations:
print(f"Citation: {citation.citation_text}")
print(f"Verification: {citation.verification_status.value}")
print(f"Credibility: {citation.credibility_score:.3f}")

API Reference

SourceAttributionSystem

Main class for source attribution functionality.

Methods

  • attribute_sources(query, retrieval_results, response_text, rerank_results=None): Perform comprehensive source attribution
  • _convert_to_source_segments(retrieval_results): Convert retrieval results to source segments
  • _segment_response(response_text, source_segments, confidences): Segment response and attribute sources

AttributionResult

Result container for attribution analysis.

Attributes

  • response_text: The original response text
  • response_segments: List of response segments with attribution
  • source_ranking: Ranked list of sources
  • overall_confidence: Overall confidence score (0-1)
  • uncertainty_metrics: Dictionary of uncertainty measurements
  • warnings: List of warnings about the response
  • metadata: Additional metadata

ConfidenceCalculator

Calculates confidence scores for various components.

Methods

  • calculate_retrieval_confidence(query, retrieval_results): Calculate retrieval confidence
  • calculate_generation_confidence(query, context, response): Calculate generation confidence
  • calculate_source_reliability(source): Calculate source reliability

UncertaintyQuantifier

Quantifies uncertainty in different aspects of responses.

Methods

  • calculate_semantic_uncertainty(response_segment, source_segments): Calculate semantic uncertainty
  • calculate_factual_uncertainty(response_segment, sources): Calculate factual uncertainty
  • calculate_temporal_uncertainty(sources): Calculate temporal uncertainty

Monitoring and Analytics

Dashboard Integration

from packages.rag.dashboard import SourceAttributionDashboard

# Create dashboard
dashboard = SourceAttributionDashboard()

# Add attribution results
dashboard.add_attribution_result(attribution_result, query_id="query_123")

# Get dashboard data
dashboard_data = dashboard.get_dashboard_data(hours_back=24)
print(f"Total queries: {dashboard_data['current_metrics']['total_queries']}")
print(f"Average confidence: {dashboard_data['current_metrics']['avg_confidence']:.3f}")

Performance Monitoring

# Get performance report
performance_report = dashboard.get_performance_report(hours_back=24)

print(f"Performance score: {performance_report['summary']['performance_score']:.3f}")
print(f"Recommendations: {performance_report['recommendations']}")

Calibration

Confidence Calibration

from packages.rag.confidence_calibration import ConfidenceCalibrator, FeedbackType

# Create calibrator
calibrator = ConfidenceCalibrator()

# Add calibration data
calibrator.add_calibration_point(
query_id="query_123",
predicted_confidence=0.8,
actual_accuracy=0.7,
feedback_type=FeedbackType.USER_RATING,
feedback_source="user_feedback"
)

# Train calibration model
calibrator.train_calibration_model()

# Apply calibration
calibrated_result = calibrator.calibrate_attribution_result(attribution_result)

Calibration Monitoring

# Get calibration report
calibration_report = calibrator.get_calibration_report(days_back=7)

print(f"Calibration status: {calibration_report['metrics']['calibration_status']}")
print(f"Brier score: {calibration_report['metrics']['brier_score']:.3f}")
print(f"Recommendations: {calibration_report['recommendations']}")

Troubleshooting

Common Issues

  1. Low Confidence Scores

    • Check retrieval quality
    • Review confidence calculation parameters
    • Implement calibration
  2. Poor Attribution Coverage

    • Increase retrieval result count
    • Improve source matching
    • Review filtering thresholds
  3. High Uncertainty

    • Improve source-response alignment
    • Add conflict resolution
    • Enhance source quality

Diagnostic Tools

from packages.rag.troubleshooting_attribution import AttributionTroubleshooter

# Create troubleshooter
troubleshooter = AttributionTroubleshooter()

# Run diagnostics
diagnostics = troubleshooter.diagnose_attribution_system(
attribution_results=[attribution_result]
)

# Generate recommendations
recommendations = troubleshooter.generate_optimization_recommendations(diagnostics)

# Create troubleshooting report
report = troubleshooter.create_troubleshooting_report(diagnostics, recommendations)

Best Practices

Source Quality

  • Maintain diverse source collections
  • Implement source quality scoring
  • Regular source verification
  • Track source performance

Confidence Scoring

  • Collect user feedback for calibration
  • Monitor confidence drift
  • Use multiple calibration methods
  • Validate on held-out data

Uncertainty Quantification

  • Distinguish epistemic vs aleatoric uncertainty
  • Include query complexity factors
  • Validate with human judgments
  • Provide uncertainty explanations

Performance

  • Implement intelligent caching
  • Use parallel processing
  • Monitor system performance
  • Optimize retrieval algorithms

Security Considerations

  • Validate all source URLs
  • Implement rate limiting
  • Sanitize user inputs
  • Monitor for malicious sources
  • Implement access controls

Performance Optimization

  • Cache frequent queries
  • Use lazy loading for large datasets
  • Implement result pagination
  • Optimize database queries
  • Monitor memory usage

Deployment

Environment Variables

# Confidence calculation
CONFIDENCE_MODEL_NAME=cross-encoder/ms-marco-MiniLM-L-6-v2
CONFIDENCE_THRESHOLD=0.5

# Source verification
SOURCE_VERIFICATION_TIMEOUT=10
SOURCE_VERIFICATION_MAX_REDIRECTS=5

# Calibration
CALIBRATION_WINDOW_DAYS=30
MIN_SAMPLES_FOR_CALIBRATION=100

# Dashboard
DASHBOARD_UPDATE_INTERVAL=30
DASHBOARD_MAX_HISTORY_HOURS=24

Docker Deployment

FROM python:3.9-slim

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY packages/ /app/packages/
COPY config/ /app/config/

ENV PYTHONPATH=/app
CMD ["python", "-m", "packages.rag.source_attribution"]

Testing

Unit Tests

# Run unit tests
pytest packages/rag/tests/test_source_attribution.py -v

Integration Tests

# Run integration tests
pytest packages/rag/tests/test_source_attribution.py::TestIntegration -v

Performance Tests

# Run performance tests
pytest packages/rag/tests/test_source_attribution.py -k "performance" -v

Contributing

  1. Follow the existing code style
  2. Add tests for new functionality
  3. Update documentation
  4. Ensure all tests pass
  5. Submit pull request

License

This module is part of the RecoAgent project and follows the same licensing terms.