Source Attribution System
Overview
The Source Attribution System provides comprehensive source tracking and confidence scoring for RAG responses, enabling enterprise users to verify information and understand reliability for critical business decisions and compliance requirements.
Key Features
- Granular Source Tracking: Links response segments to specific document sections
- Confidence Scoring: Based on retrieval relevance and generation uncertainty
- Uncertainty Quantification: For different parts of responses
- Source Ranking: By reliability and recency
- Citation Formatting: Multiple formats with verification
- Confidence-based Filtering: With warnings for low-confidence information
- Real-time Monitoring: Dashboard for attribution quality and confidence distribution
Architecture
Core Components
1. Source Attribution System
The main system that orchestrates all attribution functionality.
from packages.rag.source_attribution import SourceAttributionSystem
# Initialize the system
attribution_system = SourceAttributionSystem()
# Perform attribution
result = attribution_system.attribute_sources(
query="What is machine learning?",
retrieval_results=retrieval_results,
response_text="Machine learning is a subset of AI..."
)
2. Confidence Scoring
Calculates confidence scores based on multiple factors:
- Retrieval Relevance: How well sources match the query
- Generation Uncertainty: Quality of the generated response
- Source Reliability: Credibility and authority of sources
- Temporal Factors: Recency and freshness of information
from packages.rag.source_attribution import ConfidenceCalculator
calculator = ConfidenceCalculator()
# Calculate retrieval confidence
retrieval_confidences = calculator.calculate_retrieval_confidence(
query, retrieval_results
)
# Calculate generation confidence
gen_confidence = calculator.calculate_generation_confidence(
query, context, response
)
3. Uncertainty Quantification
Quantifies different types of uncertainty:
- Semantic Uncertainty: Alignment between response and sources
- Factual Uncertainty: Confidence in factual claims
- Temporal Uncertainty: Age-based uncertainty
from packages.rag.source_attribution import UncertaintyQuantifier
quantifier = UncertaintyQuantifier()
# Calculate semantic uncertainty
semantic_uncertainty = quantifier.calculate_semantic_uncertainty(
response_segment, source_segments
)
# Calculate factual uncertainty
factual_uncertainty = quantifier.calculate_factual_uncertainty(
response_segment, sources
)
Configuration
Basic Configuration
from packages.rag.source_attribution import create_source_attribution_system
# Create with default settings
attribution_system = create_source_attribution_system()
# Create with custom confidence calculator
from packages.rag.source_attribution import ConfidenceCalculator
custom_calculator = ConfidenceCalculator(model_name="custom-model")
attribution_system = create_source_attribution_system(
confidence_calculator=custom_calculator
)
Advanced Configuration
from packages.rag.source_attribution import (
SourceAttributionSystem, ConfidenceCalculator,
UncertaintyQuantifier, SourceRanker, CitationFormatter
)
# Create custom components
confidence_calculator = ConfidenceCalculator(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
uncertainty_quantifier = UncertaintyQuantifier()
source_ranker = SourceRanker()
citation_formatter = CitationFormatter()
# Create system with custom components
attribution_system = SourceAttributionSystem(
confidence_calculator=confidence_calculator,
uncertainty_quantifier=uncertainty_quantifier,
source_ranker=source_ranker,
citation_formatter=citation_formatter
)
Usage Examples
Basic Attribution
# Query and retrieval results
query = "What are the benefits of renewable energy?"
retrieval_results = [
RetrievalResult(
chunk=Chunk(
content="Renewable energy reduces carbon emissions...",
metadata={"source": "energy_report.pdf", "page": 1},
chunk_id="chunk_1",
source="energy_report.pdf",
start_char=0,
end_char=100
),
score=0.9,
retrieval_method="vector"
)
]
response_text = "Renewable energy offers environmental and economic benefits."
# Perform attribution
attribution_result = attribution_system.attribute_sources(
query=query,
retrieval_results=retrieval_results,
response_text=response_text
)
print(f"Overall confidence: {attribution_result.overall_confidence:.3f}")
print(f"Number of sources: {len(attribution_result.source_ranking)}")
print(f"Uncertainty metrics: {attribution_result.uncertainty_metrics}")
Confidence-based Filtering
from packages.rag.confidence_filtering import ConfidenceFilter, FilteringPolicy
# Create confidence filter
confidence_filter = ConfidenceFilter(policy=FilteringPolicy.MODERATE)
# Apply filtering
filtering_result = confidence_filter.filter_response(attribution_result)
print(f"Final action: {filtering_result.final_action}")
print(f"Warnings: {len(filtering_result.warnings)}")
print(f"Failed gates: {filtering_result.failed_gates}")
Source Verification
from packages.rag.source_verification import SourceVerifier
# Create verifier
verifier = SourceVerifier()
# Verify sources
for source in attribution_result.source_ranking:
verification_result = verifier.verify_source(source)
print(f"Source {source.source_id}: {verification_result.status.value}")
print(f"Confidence: {verification_result.confidence:.3f}")
Citation Formatting
from packages.rag.source_verification import EnhancedCitationFormatter
# Create formatter
formatter = EnhancedCitationFormatter()
# Format citations with verification
enhanced_citations = formatter.format_enhanced_citations(
attribution_result.source_ranking,
style="apa",
include_verification=True
)
for citation in enhanced_citations:
print(f"Citation: {citation.citation_text}")
print(f"Verification: {citation.verification_status.value}")
print(f"Credibility: {citation.credibility_score:.3f}")
API Reference
SourceAttributionSystem
Main class for source attribution functionality.
Methods
attribute_sources(query, retrieval_results, response_text, rerank_results=None)
: Perform comprehensive source attribution_convert_to_source_segments(retrieval_results)
: Convert retrieval results to source segments_segment_response(response_text, source_segments, confidences)
: Segment response and attribute sources
AttributionResult
Result container for attribution analysis.
Attributes
response_text
: The original response textresponse_segments
: List of response segments with attributionsource_ranking
: Ranked list of sourcesoverall_confidence
: Overall confidence score (0-1)uncertainty_metrics
: Dictionary of uncertainty measurementswarnings
: List of warnings about the responsemetadata
: Additional metadata
ConfidenceCalculator
Calculates confidence scores for various components.
Methods
calculate_retrieval_confidence(query, retrieval_results)
: Calculate retrieval confidencecalculate_generation_confidence(query, context, response)
: Calculate generation confidencecalculate_source_reliability(source)
: Calculate source reliability
UncertaintyQuantifier
Quantifies uncertainty in different aspects of responses.
Methods
calculate_semantic_uncertainty(response_segment, source_segments)
: Calculate semantic uncertaintycalculate_factual_uncertainty(response_segment, sources)
: Calculate factual uncertaintycalculate_temporal_uncertainty(sources)
: Calculate temporal uncertainty
Monitoring and Analytics
Dashboard Integration
from packages.rag.dashboard import SourceAttributionDashboard
# Create dashboard
dashboard = SourceAttributionDashboard()
# Add attribution results
dashboard.add_attribution_result(attribution_result, query_id="query_123")
# Get dashboard data
dashboard_data = dashboard.get_dashboard_data(hours_back=24)
print(f"Total queries: {dashboard_data['current_metrics']['total_queries']}")
print(f"Average confidence: {dashboard_data['current_metrics']['avg_confidence']:.3f}")
Performance Monitoring
# Get performance report
performance_report = dashboard.get_performance_report(hours_back=24)
print(f"Performance score: {performance_report['summary']['performance_score']:.3f}")
print(f"Recommendations: {performance_report['recommendations']}")
Calibration
Confidence Calibration
from packages.rag.confidence_calibration import ConfidenceCalibrator, FeedbackType
# Create calibrator
calibrator = ConfidenceCalibrator()
# Add calibration data
calibrator.add_calibration_point(
query_id="query_123",
predicted_confidence=0.8,
actual_accuracy=0.7,
feedback_type=FeedbackType.USER_RATING,
feedback_source="user_feedback"
)
# Train calibration model
calibrator.train_calibration_model()
# Apply calibration
calibrated_result = calibrator.calibrate_attribution_result(attribution_result)
Calibration Monitoring
# Get calibration report
calibration_report = calibrator.get_calibration_report(days_back=7)
print(f"Calibration status: {calibration_report['metrics']['calibration_status']}")
print(f"Brier score: {calibration_report['metrics']['brier_score']:.3f}")
print(f"Recommendations: {calibration_report['recommendations']}")
Troubleshooting
Common Issues
-
Low Confidence Scores
- Check retrieval quality
- Review confidence calculation parameters
- Implement calibration
-
Poor Attribution Coverage
- Increase retrieval result count
- Improve source matching
- Review filtering thresholds
-
High Uncertainty
- Improve source-response alignment
- Add conflict resolution
- Enhance source quality
Diagnostic Tools
from packages.rag.troubleshooting_attribution import AttributionTroubleshooter
# Create troubleshooter
troubleshooter = AttributionTroubleshooter()
# Run diagnostics
diagnostics = troubleshooter.diagnose_attribution_system(
attribution_results=[attribution_result]
)
# Generate recommendations
recommendations = troubleshooter.generate_optimization_recommendations(diagnostics)
# Create troubleshooting report
report = troubleshooter.create_troubleshooting_report(diagnostics, recommendations)
Best Practices
Source Quality
- Maintain diverse source collections
- Implement source quality scoring
- Regular source verification
- Track source performance
Confidence Scoring
- Collect user feedback for calibration
- Monitor confidence drift
- Use multiple calibration methods
- Validate on held-out data
Uncertainty Quantification
- Distinguish epistemic vs aleatoric uncertainty
- Include query complexity factors
- Validate with human judgments
- Provide uncertainty explanations
Performance
- Implement intelligent caching
- Use parallel processing
- Monitor system performance
- Optimize retrieval algorithms
Security Considerations
- Validate all source URLs
- Implement rate limiting
- Sanitize user inputs
- Monitor for malicious sources
- Implement access controls
Performance Optimization
- Cache frequent queries
- Use lazy loading for large datasets
- Implement result pagination
- Optimize database queries
- Monitor memory usage
Deployment
Environment Variables
# Confidence calculation
CONFIDENCE_MODEL_NAME=cross-encoder/ms-marco-MiniLM-L-6-v2
CONFIDENCE_THRESHOLD=0.5
# Source verification
SOURCE_VERIFICATION_TIMEOUT=10
SOURCE_VERIFICATION_MAX_REDIRECTS=5
# Calibration
CALIBRATION_WINDOW_DAYS=30
MIN_SAMPLES_FOR_CALIBRATION=100
# Dashboard
DASHBOARD_UPDATE_INTERVAL=30
DASHBOARD_MAX_HISTORY_HOURS=24
Docker Deployment
FROM python:3.9-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY packages/ /app/packages/
COPY config/ /app/config/
ENV PYTHONPATH=/app
CMD ["python", "-m", "packages.rag.source_attribution"]
Testing
Unit Tests
# Run unit tests
pytest packages/rag/tests/test_source_attribution.py -v
Integration Tests
# Run integration tests
pytest packages/rag/tests/test_source_attribution.py::TestIntegration -v
Performance Tests
# Run performance tests
pytest packages/rag/tests/test_source_attribution.py -k "performance" -v
Contributing
- Follow the existing code style
- Add tests for new functionality
- Update documentation
- Ensure all tests pass
- Submit pull request
License
This module is part of the RecoAgent project and follows the same licensing terms.