Operational Runbooks
This guide provides comprehensive operational procedures for managing the document ingestion pipeline in production environments.
Table of Contents
- Daily Operations
- Monitoring and Alerting
- Troubleshooting
- Maintenance Procedures
- Incident Response
- Performance Tuning
- Backup and Recovery
- Security Operations
Daily Operations
Morning Health Check
Procedure: Daily system health verification
Duration: 15 minutes
Steps:
- Check System Status
# Check pipeline status
curl -X GET "https://ingestion-api.company.com/health" | jq
# Check database connectivity
curl -X GET "https://ingestion-api.company.com/health/database" | jq
# Check file system access
curl -X GET "https://ingestion-api.company.com/health/filesystem" | jq
-
Review Overnight Metrics
# Get processing statistics
curl -X GET "https://ingestion-api.company.com/stats" | jq
# Check error rates
curl -X GET "https://ingestion-api.company.com/metrics/error_rate" | jq -
Review Dead Letter Queue
# Check DLQ statistics
curl -X GET "https://ingestion-api.company.com/dlq/stats" | jq
# List pending items
curl -X GET "https://ingestion-api.company.com/dlq/pending?limit=10" | jq
- Check Active Alerts
# List active alerts
curl -X GET "https://ingestion-api.company.com/alerts/active" | jq
Expected Results:
- All health checks return "healthy" status
- Error rate < 5%
- DLQ items < 50
- No critical alerts active
Escalation: If any checks fail, escalate to on-call engineer
Document Processing Review
Procedure: Review document processing performance
Duration: 10 minutes
Steps:
- Check Processing Volume
# Get processing statistics for last 24 hours
curl -X GET "https://ingestion-api.company.com/stats?period=24h" | jq
- Review Processing Times
# Check processing time percentiles
curl -X GET "https://ingestion-api.company.com/metrics/processing_time" | jq
- Identify Slow Documents
# Get slow processing documents
curl -X GET "https://ingestion-api.company.com/documents/slow?threshold=30000" | jq
Expected Results:
- Processing volume within expected range
- P95 processing time < 30 seconds
- No documents stuck in processing
Source System Health Check
Procedure: Verify source system connectivity
Duration: 10 minutes
Steps:
- Check SharePoint Connectivity
# Test SharePoint connection
curl -X POST "https://ingestion-api.company.com/test/sharepoint" \
-H "Content-Type: application/json" \
-d '{"endpoint": "sharepoint.company.com"}'
-
Check S3 Connectivity
# Test S3 connection
curl -X POST "https://ingestion-api.company.com/test/s3" \
-H "Content-Type: application/json" \
-d '{"bucket": "documents-bucket"}' -
Check Database Connectivity
# Test database connections
curl -X POST "https://ingestion-api.company.com/test/database" \
-H "Content-Type: application/json" \
-d '{"source": "hr_database"}'
Expected Results:
- All source systems accessible
- Response times < 5 seconds
- No authentication errors
Monitoring and Alerting
Alert Management
Procedure: Respond to and manage alerts
Severity Levels:
- Critical: System down, data loss, security breach
- High: High error rates, performance degradation
- Medium: Warning conditions, capacity issues
- Low: Informational alerts
Critical Alert Response
Procedure: Respond to critical alerts
Response Time: 15 minutes
Steps:
- Immediate Assessment
# Check system status
curl -X GET "https://ingestion-api.company.com/health" | jq
# Check recent errors
tail -100 /var/log/ingestion/error.log
- Identify Root Cause
# Check system resources
top -p $(pgrep -f ingestion)
df -h
free -h
# Check database status
psql -h localhost -U pipeline_user -d ingestion -c "SELECT * FROM pg_stat_activity;"
-
Implement Fix
- Follow specific runbook for the alert type
- Document actions taken
- Monitor for resolution
-
Escalation
- If not resolved in 15 minutes, escalate to senior engineer
- If not resolved in 30 minutes, escalate to manager
- If not resolved in 60 minutes, escalate to director
Alert Acknowledgment
Procedure: Acknowledge and track alerts
Steps:
- Acknowledge Alert
# Acknowledge alert
curl -X POST "https://ingestion-api.company.com/alerts/{alert_id}/acknowledge" \
-H "Content-Type: application/json" \
-d '{"acknowledged_by": "engineer_name", "notes": "Investigating issue"}'
- Update Alert Status
# Resolve alert
curl -X POST "https://ingestion-api.company.com/alerts/{alert_id}/resolve" \
-H "Content-Type: application/json" \
-d '{"resolution_notes": "Issue resolved by restarting service"}'
Troubleshooting
Common Issues
High Error Rate
Symptoms:
- Error rate > 10%
- Multiple failed documents
- DLQ items increasing
Diagnosis:
# Check error logs
grep "ERROR" /var/log/ingestion/application.log | tail -50
# Check error categories
curl -X GET "https://ingestion-api.company.com/errors/categories" | jq
# Check recent failures
curl -X GET "https://ingestion-api.company.com/documents/failed?limit=20" | jq
Common Causes:
- Source system connectivity issues
- File format changes
- Authentication failures
- Resource constraints
Resolution:
- Check source system connectivity
- Verify file formats haven't changed
- Check authentication credentials
- Monitor system resources
- Review recent configuration changes
Slow Processing
Symptoms:
- Processing time > 30 seconds
- Queue backup
- High CPU/memory usage
Diagnosis:
# Check processing times
curl -X GET "https://ingestion-api.company.com/metrics/processing_time" | jq
# Check system resources
top -p $(pgrep -f ingestion)
iostat -x 1 5
# Check database performance
psql -h localhost -U pipeline_user -d ingestion -c "
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;"
Resolution:
- Scale up processing capacity
- Optimize database queries
- Check for resource contention
- Review document sizes
- Consider horizontal scaling
Dead Letter Queue Issues
Symptoms:
- DLQ items not decreasing
- Items stuck in review
- High priority items aging
Diagnosis:
# Check DLQ statistics
curl -X GET "https://ingestion-api.company.com/dlq/stats" | jq
# Check aging items
curl -X GET "https://ingestion-api.company.com/dlq/aging" | jq
# Check review workflows
curl -X GET "https://ingestion-api.company.com/dlq/reviews" | jq
Resolution:
- Review pending items
- Assign items to reviewers
- Process retry candidates
- Escalate high-priority items
- Clean up resolved items
Database Issues
Connection Pool Exhaustion
Symptoms:
- "Too many connections" errors
- Slow database responses
- Application timeouts
Diagnosis:
-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- Check connection limits
SHOW max_connections;
-- Check connection usage
SELECT datname, usename, count(*)
FROM pg_stat_activity
GROUP BY datname, usename;
Resolution:
- Increase connection pool size
- Optimize long-running queries
- Add database read replicas
- Implement connection pooling
- Review connection timeout settings
Lock Contention
Symptoms:
- Queries hanging
- Deadlock errors
- Slow updates
Diagnosis:
-- Check for locks
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Resolution:
- Kill blocking queries
- Optimize query performance
- Review transaction isolation levels
- Implement proper indexing
- Consider read replicas for reporting
File System Issues
Disk Space Issues
Symptoms:
- "No space left on device" errors
- Slow file operations
- Application failures
Diagnosis:
# Check disk usage
df -h
# Find large files
find /var/log/ingestion -type f -size +100M -exec ls -lh {} \;
# Check for log rotation
ls -la /var/log/ingestion/
Resolution:
- Clean up old log files
- Archive old documents
- Increase disk space
- Implement log rotation
- Move data to external storage
Permission Issues
Symptoms:
- "Permission denied" errors
- File access failures
- Authentication errors
Diagnosis:
# Check file permissions
ls -la /var/data/ingestion/
# Check directory permissions
ls -ld /var/data/ingestion/
# Check process user
ps aux | grep ingestion
Resolution:
- Fix file permissions
- Check user/group ownership
- Verify process user
- Review security policies
- Update access controls
Maintenance Procedures
Daily Maintenance
Procedure: Daily system maintenance
Duration: 30 minutes
Steps:
-
Log Rotation
# Rotate application logs
logrotate -f /etc/logrotate.d/ingestion
# Clean up old logs
find /var/log/ingestion -name "*.log.*" -mtime +7 -delete -
Database Maintenance
-- Update table statistics
ANALYZE;
-- Clean up old data
DELETE FROM metrics WHERE timestamp < NOW() - INTERVAL '30 days';
DELETE FROM health_checks WHERE timestamp < NOW() - INTERVAL '7 days'; -
Cache Cleanup
# Clear application cache
curl -X POST "https://ingestion-api.company.com/admin/cache/clear"
# Clean up temporary files
find /tmp -name "ingestion_*" -mtime +1 -delete
Weekly Maintenance
Procedure: Weekly system maintenance
Duration: 2 hours
Steps:
-
Database Optimization
-- Reindex tables
REINDEX DATABASE ingestion;
-- Vacuum tables
VACUUM ANALYZE;
-- Update statistics
ANALYZE; -
Dead Letter Queue Cleanup
# Archive old DLQ items
curl -X POST "https://ingestion-api.company.com/dlq/cleanup" \
-H "Content-Type: application/json" \
-d '{"days_old": 30}' -
Document Version Cleanup
# Archive old versions
curl -X POST "https://ingestion-api.company.com/versions/cleanup" \
-H "Content-Type: application/json" \
-d '{"days_old": 90, "keep_latest": 5}'
- Performance Review
# Generate performance report
curl -X GET "https://ingestion-api.company.com/reports/performance" > performance_report.json
Monthly Maintenance
Procedure: Monthly system maintenance
Duration: 4 hours
Steps:
- Security Updates
# Update system packages
apt update && apt upgrade -y
# Update Python dependencies
pip install --upgrade -r requirements.txt
# Run security scan
bandit -r recoagent/
-
Backup Verification
# Verify backup integrity
sqlite3 backup_test.db "SELECT count(*) FROM documents;"
# Test restore procedure
./scripts/test_restore.sh -
Capacity Planning
# Generate capacity report
curl -X GET "https://ingestion-api.company.com/reports/capacity" > capacity_report.json
# Review growth trends
curl -X GET "https://ingestion-api.company.com/metrics/growth" | jq
- Documentation Update
- Review and update runbooks
- Update system diagrams
- Document lessons learned
- Update contact information
Incident Response
Incident Classification
Severity Levels:
- P1 (Critical): System down, data loss, security breach
- P2 (High): Major functionality affected, high error rates
- P3 (Medium): Minor functionality affected, performance issues
- P4 (Low): Cosmetic issues, minor bugs
P1 Incident Response
Procedure: Critical incident response
Response Time: 15 minutes
Steps:
- Immediate Response
# Check system status
curl -X GET "https://ingestion-api.company.com/health" | jq
# Check recent errors
tail -100 /var/log/ingestion/error.log
# Check system resources
top -p $(pgrep -f ingestion)
df -h
free -h
-
Assess Impact
- Number of users affected
- Data at risk
- Business impact
- Estimated resolution time
-
Communicate Status
- Notify stakeholders
- Update status page
- Send incident notifications
-
Implement Fix
- Follow incident-specific runbook
- Document all actions
- Test resolution
-
Post-Incident
- Conduct post-mortem
- Update runbooks
- Implement preventive measures
Communication Templates
Incident Notification
Subject: [P1] Document Ingestion System Down
Incident Summary:
- Severity: P1 (Critical)
- System: Document Ingestion Pipeline
- Status: Investigating
- Impact: All document processing affected
- ETA: TBD
Next Update: 30 minutes
Incident Commander: [Name]
Status Update
Subject: [P1] Document Ingestion System - Status Update
Current Status: Investigating
Progress: Identified root cause as database connection pool exhaustion
Next Steps: Scaling up database connections and implementing connection pooling
ETA: 2 hours
Next Update: 1 hour
Incident Commander: [Name]
Resolution Notification
Subject: [P1] Document Ingestion System - RESOLVED
Status: RESOLVED
Resolution: Increased database connection pool size and implemented connection pooling
Duration: 1.5 hours
Root Cause: Database connection pool exhaustion under high load
Prevention: Implemented connection pooling and monitoring
Post-mortem scheduled for: [Date/Time]
Incident Commander: [Name]
Performance Tuning
Database Performance
Query Optimization
Procedure: Optimize slow database queries
Steps:
-
Identify Slow Queries
-- Find slow queries
SELECT query, mean_time, calls, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10; -
Analyze Query Plans
-- Analyze query execution plan
EXPLAIN ANALYZE SELECT * FROM documents WHERE status = 'processing'; -
Add Indexes
-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_documents_status ON documents(status);
CREATE INDEX CONCURRENTLY idx_documents_created_at ON documents(created_at);
Connection Pooling
Configuration:
# database.yml
pool:
min_connections: 5
max_connections: 50
connection_timeout: 30
idle_timeout: 600
Application Performance
Memory Optimization
Procedure: Optimize memory usage
Steps:
- Profile Memory Usage
# Monitor memory usage
ps aux | grep ingestion
pmap -x $(pgrep -f ingestion)
-
Optimize Chunk Sizes
# Adjust chunk sizes based on available memory
chunk_size = min(1000, available_memory_mb // 10) -
Implement Memory Monitoring
import psutil
def check_memory_usage():
memory = psutil.virtual_memory()
if memory.percent > 80:
# Trigger cleanup or scaling
pass
CPU Optimization
Procedure: Optimize CPU usage
Steps:
- Monitor CPU Usage
# Monitor CPU usage
top -p $(pgrep -f ingestion)
htop -p $(pgrep -f ingestion)
-
Optimize Concurrency
# Adjust concurrency based on CPU cores
max_concurrent = min(10, cpu_count * 2) -
Implement CPU Monitoring
def check_cpu_usage():
cpu_percent = psutil.cpu_percent(interval=1)
if cpu_percent > 80:
# Scale down or optimize
pass
Backup and Recovery
Backup Procedures
Daily Backup
Procedure: Daily database backup
Duration: 30 minutes
Steps:
- Create Database Backup
# SQLite backup
sqlite3 ingestion.db ".backup backup_$(date +%Y%m%d).db"
# PostgreSQL backup
pg_dump -h localhost -U pipeline_user ingestion > backup_$(date +%Y%m%d).sql
-
Verify Backup
# Verify backup size
ls -lh backup_$(date +%Y%m%d).db
# Verify backup integrity
sqlite3 backup_$(date +%Y%m%d).db "SELECT count(*) FROM documents;" -
Upload to Cloud Storage
# Upload to S3
aws s3 cp backup_$(date +%Y%m%d).db s3://ingestion-backups/
# Set retention policy
aws s3api put-bucket-lifecycle-configuration \
--bucket ingestion-backups \
--lifecycle-configuration file://lifecycle.json
Weekly Backup
Procedure: Weekly full system backup
Duration: 2 hours
Steps:
- Backup Application Code
# Create code backup
tar -czf code_backup_$(date +%Y%m%d).tar.gz /opt/ingestion/
-
Backup Configuration
# Backup configuration files
tar -czf config_backup_$(date +%Y%m%d).tar.gz /etc/ingestion/ -
Backup Logs
# Backup application logs
tar -czf logs_backup_$(date +%Y%m%d).tar.gz /var/log/ingestion/
Recovery Procedures
Database Recovery
Procedure: Restore database from backup
Duration: 1 hour
Steps:
- Stop Services
# Stop ingestion services
systemctl stop ingestion-api
systemctl stop ingestion-worker
-
Restore Database
# Restore SQLite database
cp backup_20231201.db ingestion.db
# Restore PostgreSQL database
psql -h localhost -U pipeline_user -d ingestion < backup_20231201.sql -
Verify Recovery
# Verify data integrity
sqlite3 ingestion.db "SELECT count(*) FROM documents;"
# Check for corruption
sqlite3 ingestion.db "PRAGMA integrity_check;"
- Restart Services
# Start services
systemctl start ingestion-api
systemctl start ingestion-worker
# Verify service health
curl -X GET "https://ingestion-api.company.com/health" | jq
Full System Recovery
Procedure: Complete system recovery
Duration: 4 hours
Steps:
-
Provision New Server
- Launch new instance
- Install operating system
- Configure network
-
Install Application
# Install dependencies
apt update && apt install -y python3 python3-pip postgresql
# Install application
pip install -r requirements.txt
-
Restore Data
# Restore database
# Restore configuration
# Restore logs -
Configure Services
# Configure systemd services
# Configure monitoring
# Configure backup -
Verify Recovery
# Run health checks
# Test functionality
# Monitor for issues
Security Operations
Security Monitoring
Daily Security Review
Procedure: Daily security monitoring
Duration: 15 minutes
Steps:
- Check Security Logs
# Review authentication logs
grep "authentication" /var/log/ingestion/security.log | tail -50
# Check for failed login attempts
grep "failed" /var/log/ingestion/auth.log | tail -20
- Review Access Logs
# Check API access logs
tail -100 /var/log/ingestion/access.log | grep -E "(POST|PUT|DELETE)"
# Check for suspicious activity
grep -E "(\.\.\/|script|union|select)" /var/log/ingestion/access.log
- Verify Security Metrics
# Check security metrics
curl -X GET "https://ingestion-api.company.com/metrics/security" | jq
Vulnerability Management
Weekly Vulnerability Scan
Procedure: Weekly security scanning
Duration: 1 hour
Steps:
-
Dependency Scan
# Scan Python dependencies
safety check
# Scan system packages
apt list --upgradable -
Code Security Scan
# Run Bandit security linter
bandit -r recoagent/
# Run Semgrep
semgrep --config=auto recoagent/
- Infrastructure Scan
# Run nmap scan
nmap -sS -O localhost
# Check open ports
netstat -tulpn
Incident Response
Security Incident Response
Procedure: Respond to security incidents
Response Time: 5 minutes
Steps:
-
Immediate Containment
# Block suspicious IPs
iptables -A INPUT -s 192.168.1.100 -j DROP
# Disable compromised accounts
curl -X POST "https://ingestion-api.company.com/admin/users/disable" \
-d '{"username": "compromised_user"}' -
Evidence Collection
# Collect system logs
tar -czf security_evidence_$(date +%Y%m%d_%H%M%S).tar.gz /var/log/
# Collect network logs
tcpdump -i eth0 -w security_capture.pcap -
Investigation
# Analyze logs
grep -E "192.168.1.100" /var/log/ingestion/access.log
# Check system integrity
aide --check -
Recovery
# Restore from backup if necessary
# Patch vulnerabilities
# Update security controls
Support Contacts
Escalation Matrix
Issue Type | Primary | Secondary | Tertiary |
---|---|---|---|
P1 Critical | On-call Engineer | Senior Engineer | Engineering Manager |
P2 High | On-call Engineer | Senior Engineer | Engineering Manager |
P3 Medium | On-call Engineer | Senior Engineer | - |
P4 Low | On-call Engineer | - | - |
Contact Information
- On-call Engineer: +1-555-0123 (24/7)
- Senior Engineer: +1-555-0124 (Business hours)
- Engineering Manager: +1-555-0125 (Business hours)
- Security Team: security@company.com
- Incident Response: incident@company.com
External Support
- Database Support: db-support@company.com
- Infrastructure Support: infra-support@company.com
- Security Vendor: security-vendor@company.com
Documentation Updates
Runbook Maintenance
Procedure: Keep runbooks current
Frequency: Monthly
Steps:
-
Review Runbooks
- Check for outdated procedures
- Verify contact information
- Update system-specific details
-
Update Procedures
- Document new procedures
- Remove obsolete steps
- Clarify ambiguous instructions
-
Version Control
- Track changes in Git
- Tag releases
- Maintain change log
-
Team Review
- Get feedback from team members
- Test procedures in staging
- Update based on lessons learned
Change Management
Procedure: Manage runbook changes
Steps:
-
Propose Changes
- Create pull request
- Document rationale
- Get team review
-
Test Changes
- Validate in staging environment
- Update test procedures
- Document test results
-
Deploy Changes
- Merge to main branch
- Update documentation site
- Notify team of changes
-
Monitor Impact
- Track usage of updated procedures
- Collect feedback
- Iterate based on results
Next Steps
- Deployment Guide - Production deployment
- Security Guide - Security best practices
- Performance Guide - High-volume optimization