Skip to main content

Operational Runbooks

This guide provides comprehensive operational procedures for managing the document ingestion pipeline in production environments.

Table of Contents

  1. Daily Operations
  2. Monitoring and Alerting
  3. Troubleshooting
  4. Maintenance Procedures
  5. Incident Response
  6. Performance Tuning
  7. Backup and Recovery
  8. Security Operations

Daily Operations

Morning Health Check

Procedure: Daily system health verification

Duration: 15 minutes

Steps:

  1. Check System Status
   # Check pipeline status
curl -X GET "https://ingestion-api.company.com/health" | jq

# Check database connectivity
curl -X GET "https://ingestion-api.company.com/health/database" | jq

# Check file system access
curl -X GET "https://ingestion-api.company.com/health/filesystem" | jq
  1. Review Overnight Metrics

    # Get processing statistics
    curl -X GET "https://ingestion-api.company.com/stats" | jq

    # Check error rates
    curl -X GET "https://ingestion-api.company.com/metrics/error_rate" | jq
  2. Review Dead Letter Queue

   # Check DLQ statistics
curl -X GET "https://ingestion-api.company.com/dlq/stats" | jq

# List pending items
curl -X GET "https://ingestion-api.company.com/dlq/pending?limit=10" | jq
  1. Check Active Alerts
    # List active alerts
    curl -X GET "https://ingestion-api.company.com/alerts/active" | jq

Expected Results:

  • All health checks return "healthy" status
  • Error rate < 5%
  • DLQ items < 50
  • No critical alerts active

Escalation: If any checks fail, escalate to on-call engineer

Document Processing Review

Procedure: Review document processing performance

Duration: 10 minutes

Steps:

  1. Check Processing Volume
   # Get processing statistics for last 24 hours
curl -X GET "https://ingestion-api.company.com/stats?period=24h" | jq
  1. Review Processing Times
   # Check processing time percentiles
curl -X GET "https://ingestion-api.company.com/metrics/processing_time" | jq
  1. Identify Slow Documents
   # Get slow processing documents
curl -X GET "https://ingestion-api.company.com/documents/slow?threshold=30000" | jq

Expected Results:

  • Processing volume within expected range
  • P95 processing time < 30 seconds
  • No documents stuck in processing

Source System Health Check

Procedure: Verify source system connectivity

Duration: 10 minutes

Steps:

  1. Check SharePoint Connectivity
   # Test SharePoint connection
curl -X POST "https://ingestion-api.company.com/test/sharepoint" \
-H "Content-Type: application/json" \
-d '{"endpoint": "sharepoint.company.com"}'
  1. Check S3 Connectivity

    # Test S3 connection
    curl -X POST "https://ingestion-api.company.com/test/s3" \
    -H "Content-Type: application/json" \
    -d '{"bucket": "documents-bucket"}'
  2. Check Database Connectivity

    # Test database connections
    curl -X POST "https://ingestion-api.company.com/test/database" \
    -H "Content-Type: application/json" \
    -d '{"source": "hr_database"}'

Expected Results:

  • All source systems accessible
  • Response times < 5 seconds
  • No authentication errors

Monitoring and Alerting

Alert Management

Procedure: Respond to and manage alerts

Severity Levels:

  • Critical: System down, data loss, security breach
  • High: High error rates, performance degradation
  • Medium: Warning conditions, capacity issues
  • Low: Informational alerts

Critical Alert Response

Procedure: Respond to critical alerts

Response Time: 15 minutes

Steps:

  1. Immediate Assessment
   # Check system status
curl -X GET "https://ingestion-api.company.com/health" | jq

# Check recent errors
tail -100 /var/log/ingestion/error.log
  1. Identify Root Cause
   # Check system resources
top -p $(pgrep -f ingestion)
df -h
free -h

# Check database status
psql -h localhost -U pipeline_user -d ingestion -c "SELECT * FROM pg_stat_activity;"
  1. Implement Fix

    • Follow specific runbook for the alert type
    • Document actions taken
    • Monitor for resolution
  2. Escalation

    • If not resolved in 15 minutes, escalate to senior engineer
    • If not resolved in 30 minutes, escalate to manager
    • If not resolved in 60 minutes, escalate to director

Alert Acknowledgment

Procedure: Acknowledge and track alerts

Steps:

  1. Acknowledge Alert
   # Acknowledge alert
curl -X POST "https://ingestion-api.company.com/alerts/{alert_id}/acknowledge" \
-H "Content-Type: application/json" \
-d '{"acknowledged_by": "engineer_name", "notes": "Investigating issue"}'
  1. Update Alert Status
    # Resolve alert
    curl -X POST "https://ingestion-api.company.com/alerts/{alert_id}/resolve" \
    -H "Content-Type: application/json" \
    -d '{"resolution_notes": "Issue resolved by restarting service"}'

Troubleshooting

Common Issues

High Error Rate

Symptoms:

  • Error rate > 10%
  • Multiple failed documents
  • DLQ items increasing

Diagnosis:

# Check error logs
grep "ERROR" /var/log/ingestion/application.log | tail -50

# Check error categories
curl -X GET "https://ingestion-api.company.com/errors/categories" | jq

# Check recent failures
curl -X GET "https://ingestion-api.company.com/documents/failed?limit=20" | jq

Common Causes:

  • Source system connectivity issues
  • File format changes
  • Authentication failures
  • Resource constraints

Resolution:

  1. Check source system connectivity
  2. Verify file formats haven't changed
  3. Check authentication credentials
  4. Monitor system resources
  5. Review recent configuration changes

Slow Processing

Symptoms:

  • Processing time > 30 seconds
  • Queue backup
  • High CPU/memory usage

Diagnosis:

# Check processing times
curl -X GET "https://ingestion-api.company.com/metrics/processing_time" | jq

# Check system resources
top -p $(pgrep -f ingestion)
iostat -x 1 5

# Check database performance
psql -h localhost -U pipeline_user -d ingestion -c "
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;"

Resolution:

  1. Scale up processing capacity
  2. Optimize database queries
  3. Check for resource contention
  4. Review document sizes
  5. Consider horizontal scaling

Dead Letter Queue Issues

Symptoms:

  • DLQ items not decreasing
  • Items stuck in review
  • High priority items aging

Diagnosis:

# Check DLQ statistics
curl -X GET "https://ingestion-api.company.com/dlq/stats" | jq

# Check aging items
curl -X GET "https://ingestion-api.company.com/dlq/aging" | jq

# Check review workflows
curl -X GET "https://ingestion-api.company.com/dlq/reviews" | jq

Resolution:

  1. Review pending items
  2. Assign items to reviewers
  3. Process retry candidates
  4. Escalate high-priority items
  5. Clean up resolved items

Database Issues

Connection Pool Exhaustion

Symptoms:

  • "Too many connections" errors
  • Slow database responses
  • Application timeouts

Diagnosis:

-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

-- Check connection limits
SHOW max_connections;

-- Check connection usage
SELECT datname, usename, count(*)
FROM pg_stat_activity
GROUP BY datname, usename;

Resolution:

  1. Increase connection pool size
  2. Optimize long-running queries
  3. Add database read replicas
  4. Implement connection pooling
  5. Review connection timeout settings

Lock Contention

Symptoms:

  • Queries hanging
  • Deadlock errors
  • Slow updates

Diagnosis:

-- Check for locks
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Resolution:

  1. Kill blocking queries
  2. Optimize query performance
  3. Review transaction isolation levels
  4. Implement proper indexing
  5. Consider read replicas for reporting

File System Issues

Disk Space Issues

Symptoms:

  • "No space left on device" errors
  • Slow file operations
  • Application failures

Diagnosis:

# Check disk usage
df -h

# Find large files
find /var/log/ingestion -type f -size +100M -exec ls -lh {} \;

# Check for log rotation
ls -la /var/log/ingestion/

Resolution:

  1. Clean up old log files
  2. Archive old documents
  3. Increase disk space
  4. Implement log rotation
  5. Move data to external storage

Permission Issues

Symptoms:

  • "Permission denied" errors
  • File access failures
  • Authentication errors

Diagnosis:

# Check file permissions
ls -la /var/data/ingestion/

# Check directory permissions
ls -ld /var/data/ingestion/

# Check process user
ps aux | grep ingestion

Resolution:

  1. Fix file permissions
  2. Check user/group ownership
  3. Verify process user
  4. Review security policies
  5. Update access controls

Maintenance Procedures

Daily Maintenance

Procedure: Daily system maintenance

Duration: 30 minutes

Steps:

  1. Log Rotation

    # Rotate application logs
    logrotate -f /etc/logrotate.d/ingestion

    # Clean up old logs
    find /var/log/ingestion -name "*.log.*" -mtime +7 -delete
  2. Database Maintenance

    -- Update table statistics
    ANALYZE;

    -- Clean up old data
    DELETE FROM metrics WHERE timestamp < NOW() - INTERVAL '30 days';
    DELETE FROM health_checks WHERE timestamp < NOW() - INTERVAL '7 days';
  3. Cache Cleanup

   # Clear application cache
curl -X POST "https://ingestion-api.company.com/admin/cache/clear"

# Clean up temporary files
find /tmp -name "ingestion_*" -mtime +1 -delete

Weekly Maintenance

Procedure: Weekly system maintenance

Duration: 2 hours

Steps:

  1. Database Optimization

    -- Reindex tables
    REINDEX DATABASE ingestion;

    -- Vacuum tables
    VACUUM ANALYZE;

    -- Update statistics
    ANALYZE;
  2. Dead Letter Queue Cleanup

    # Archive old DLQ items
    curl -X POST "https://ingestion-api.company.com/dlq/cleanup" \
    -H "Content-Type: application/json" \
    -d '{"days_old": 30}'
  3. Document Version Cleanup

   # Archive old versions
curl -X POST "https://ingestion-api.company.com/versions/cleanup" \
-H "Content-Type: application/json" \
-d '{"days_old": 90, "keep_latest": 5}'
  1. Performance Review
    # Generate performance report
    curl -X GET "https://ingestion-api.company.com/reports/performance" > performance_report.json

Monthly Maintenance

Procedure: Monthly system maintenance

Duration: 4 hours

Steps:

  1. Security Updates
   # Update system packages
apt update && apt upgrade -y

# Update Python dependencies
pip install --upgrade -r requirements.txt

# Run security scan
bandit -r recoagent/
  1. Backup Verification

    # Verify backup integrity
    sqlite3 backup_test.db "SELECT count(*) FROM documents;"

    # Test restore procedure
    ./scripts/test_restore.sh
  2. Capacity Planning

   # Generate capacity report
curl -X GET "https://ingestion-api.company.com/reports/capacity" > capacity_report.json

# Review growth trends
curl -X GET "https://ingestion-api.company.com/metrics/growth" | jq
  1. Documentation Update
    • Review and update runbooks
    • Update system diagrams
    • Document lessons learned
    • Update contact information

Incident Response

Incident Classification

Severity Levels:

  • P1 (Critical): System down, data loss, security breach
  • P2 (High): Major functionality affected, high error rates
  • P3 (Medium): Minor functionality affected, performance issues
  • P4 (Low): Cosmetic issues, minor bugs

P1 Incident Response

Procedure: Critical incident response

Response Time: 15 minutes

Steps:

  1. Immediate Response
   # Check system status
curl -X GET "https://ingestion-api.company.com/health" | jq

# Check recent errors
tail -100 /var/log/ingestion/error.log

# Check system resources
top -p $(pgrep -f ingestion)
df -h
free -h
  1. Assess Impact

    • Number of users affected
    • Data at risk
    • Business impact
    • Estimated resolution time
  2. Communicate Status

    • Notify stakeholders
    • Update status page
    • Send incident notifications
  3. Implement Fix

    • Follow incident-specific runbook
    • Document all actions
    • Test resolution
  4. Post-Incident

    • Conduct post-mortem
    • Update runbooks
    • Implement preventive measures

Communication Templates

Incident Notification

Subject: [P1] Document Ingestion System Down

Incident Summary:
- Severity: P1 (Critical)
- System: Document Ingestion Pipeline
- Status: Investigating
- Impact: All document processing affected
- ETA: TBD

Next Update: 30 minutes

Incident Commander: [Name]

Status Update

Subject: [P1] Document Ingestion System - Status Update

Current Status: Investigating
Progress: Identified root cause as database connection pool exhaustion
Next Steps: Scaling up database connections and implementing connection pooling
ETA: 2 hours

Next Update: 1 hour

Incident Commander: [Name]

Resolution Notification

Subject: [P1] Document Ingestion System - RESOLVED

Status: RESOLVED
Resolution: Increased database connection pool size and implemented connection pooling
Duration: 1.5 hours
Root Cause: Database connection pool exhaustion under high load
Prevention: Implemented connection pooling and monitoring

Post-mortem scheduled for: [Date/Time]

Incident Commander: [Name]

Performance Tuning

Database Performance

Query Optimization

Procedure: Optimize slow database queries

Steps:

  1. Identify Slow Queries

    -- Find slow queries
    SELECT query, mean_time, calls, total_time
    FROM pg_stat_statements
    ORDER BY mean_time DESC
    LIMIT 10;
  2. Analyze Query Plans

    -- Analyze query execution plan
    EXPLAIN ANALYZE SELECT * FROM documents WHERE status = 'processing';
  3. Add Indexes

    -- Add missing indexes
    CREATE INDEX CONCURRENTLY idx_documents_status ON documents(status);
    CREATE INDEX CONCURRENTLY idx_documents_created_at ON documents(created_at);

Connection Pooling

Configuration:

# database.yml
pool:
min_connections: 5
max_connections: 50
connection_timeout: 30
idle_timeout: 600

Application Performance

Memory Optimization

Procedure: Optimize memory usage

Steps:

  1. Profile Memory Usage
   # Monitor memory usage
ps aux | grep ingestion
pmap -x $(pgrep -f ingestion)
  1. Optimize Chunk Sizes

    # Adjust chunk sizes based on available memory
    chunk_size = min(1000, available_memory_mb // 10)
  2. Implement Memory Monitoring

    import psutil

    def check_memory_usage():
    memory = psutil.virtual_memory()
    if memory.percent > 80:
    # Trigger cleanup or scaling
    pass

CPU Optimization

Procedure: Optimize CPU usage

Steps:

  1. Monitor CPU Usage
   # Monitor CPU usage
top -p $(pgrep -f ingestion)
htop -p $(pgrep -f ingestion)
  1. Optimize Concurrency

    # Adjust concurrency based on CPU cores
    max_concurrent = min(10, cpu_count * 2)
  2. Implement CPU Monitoring

    def check_cpu_usage():
    cpu_percent = psutil.cpu_percent(interval=1)
    if cpu_percent > 80:
    # Scale down or optimize
    pass

Backup and Recovery

Backup Procedures

Daily Backup

Procedure: Daily database backup

Duration: 30 minutes

Steps:

  1. Create Database Backup
   # SQLite backup
sqlite3 ingestion.db ".backup backup_$(date +%Y%m%d).db"

# PostgreSQL backup
pg_dump -h localhost -U pipeline_user ingestion > backup_$(date +%Y%m%d).sql
  1. Verify Backup

    # Verify backup size
    ls -lh backup_$(date +%Y%m%d).db

    # Verify backup integrity
    sqlite3 backup_$(date +%Y%m%d).db "SELECT count(*) FROM documents;"
  2. Upload to Cloud Storage

   # Upload to S3
aws s3 cp backup_$(date +%Y%m%d).db s3://ingestion-backups/

# Set retention policy
aws s3api put-bucket-lifecycle-configuration \
--bucket ingestion-backups \
--lifecycle-configuration file://lifecycle.json

Weekly Backup

Procedure: Weekly full system backup

Duration: 2 hours

Steps:

  1. Backup Application Code
   # Create code backup
tar -czf code_backup_$(date +%Y%m%d).tar.gz /opt/ingestion/
  1. Backup Configuration

    # Backup configuration files
    tar -czf config_backup_$(date +%Y%m%d).tar.gz /etc/ingestion/
  2. Backup Logs

   # Backup application logs
tar -czf logs_backup_$(date +%Y%m%d).tar.gz /var/log/ingestion/

Recovery Procedures

Database Recovery

Procedure: Restore database from backup

Duration: 1 hour

Steps:

  1. Stop Services
   # Stop ingestion services
systemctl stop ingestion-api
systemctl stop ingestion-worker
  1. Restore Database

    # Restore SQLite database
    cp backup_20231201.db ingestion.db

    # Restore PostgreSQL database
    psql -h localhost -U pipeline_user -d ingestion < backup_20231201.sql
  2. Verify Recovery

   # Verify data integrity
sqlite3 ingestion.db "SELECT count(*) FROM documents;"

# Check for corruption
sqlite3 ingestion.db "PRAGMA integrity_check;"
  1. Restart Services
    # Start services
    systemctl start ingestion-api
    systemctl start ingestion-worker

    # Verify service health
    curl -X GET "https://ingestion-api.company.com/health" | jq

Full System Recovery

Procedure: Complete system recovery

Duration: 4 hours

Steps:

  1. Provision New Server

    • Launch new instance
    • Install operating system
    • Configure network
  2. Install Application

   # Install dependencies
apt update && apt install -y python3 python3-pip postgresql

# Install application
pip install -r requirements.txt
  1. Restore Data

    # Restore database
    # Restore configuration
    # Restore logs
  2. Configure Services

    # Configure systemd services
    # Configure monitoring
    # Configure backup
  3. Verify Recovery

   # Run health checks
# Test functionality
# Monitor for issues

Security Operations

Security Monitoring

Daily Security Review

Procedure: Daily security monitoring

Duration: 15 minutes

Steps:

  1. Check Security Logs
   # Review authentication logs
grep "authentication" /var/log/ingestion/security.log | tail -50

# Check for failed login attempts
grep "failed" /var/log/ingestion/auth.log | tail -20
  1. Review Access Logs
   # Check API access logs
tail -100 /var/log/ingestion/access.log | grep -E "(POST|PUT|DELETE)"

# Check for suspicious activity
grep -E "(\.\.\/|script|union|select)" /var/log/ingestion/access.log
  1. Verify Security Metrics
   # Check security metrics
curl -X GET "https://ingestion-api.company.com/metrics/security" | jq

Vulnerability Management

Weekly Vulnerability Scan

Procedure: Weekly security scanning

Duration: 1 hour

Steps:

  1. Dependency Scan

    # Scan Python dependencies
    safety check

    # Scan system packages
    apt list --upgradable
  2. Code Security Scan

   # Run Bandit security linter
bandit -r recoagent/

# Run Semgrep
semgrep --config=auto recoagent/
  1. Infrastructure Scan
    # Run nmap scan
    nmap -sS -O localhost

    # Check open ports
    netstat -tulpn

Incident Response

Security Incident Response

Procedure: Respond to security incidents

Response Time: 5 minutes

Steps:

  1. Immediate Containment

    # Block suspicious IPs
    iptables -A INPUT -s 192.168.1.100 -j DROP

    # Disable compromised accounts
    curl -X POST "https://ingestion-api.company.com/admin/users/disable" \
    -d '{"username": "compromised_user"}'
  2. Evidence Collection

    # Collect system logs
    tar -czf security_evidence_$(date +%Y%m%d_%H%M%S).tar.gz /var/log/

    # Collect network logs
    tcpdump -i eth0 -w security_capture.pcap
  3. Investigation

    # Analyze logs
    grep -E "192.168.1.100" /var/log/ingestion/access.log

    # Check system integrity
    aide --check
  4. Recovery

   # Restore from backup if necessary
# Patch vulnerabilities
# Update security controls

Support Contacts

Escalation Matrix

Issue TypePrimarySecondaryTertiary
P1 CriticalOn-call EngineerSenior EngineerEngineering Manager
P2 HighOn-call EngineerSenior EngineerEngineering Manager
P3 MediumOn-call EngineerSenior Engineer-
P4 LowOn-call Engineer--

Contact Information

  • On-call Engineer: +1-555-0123 (24/7)
  • Senior Engineer: +1-555-0124 (Business hours)
  • Engineering Manager: +1-555-0125 (Business hours)
  • Security Team: security@company.com
  • Incident Response: incident@company.com

External Support

Documentation Updates

Runbook Maintenance

Procedure: Keep runbooks current

Frequency: Monthly

Steps:

  1. Review Runbooks

    • Check for outdated procedures
    • Verify contact information
    • Update system-specific details
  2. Update Procedures

    • Document new procedures
    • Remove obsolete steps
    • Clarify ambiguous instructions
  3. Version Control

    • Track changes in Git
    • Tag releases
    • Maintain change log
  4. Team Review

    • Get feedback from team members
    • Test procedures in staging
    • Update based on lessons learned

Change Management

Procedure: Manage runbook changes

Steps:

  1. Propose Changes

    • Create pull request
    • Document rationale
    • Get team review
  2. Test Changes

    • Validate in staging environment
    • Update test procedures
    • Document test results
  3. Deploy Changes

    • Merge to main branch
    • Update documentation site
    • Notify team of changes
  4. Monitor Impact

    • Track usage of updated procedures
    • Collect feedback
    • Iterate based on results

Next Steps