Operational Runbooks

This guide provides comprehensive operational procedures for managing the document ingestion pipeline in production environments.

Daily Operations
Monitoring and Alerting
Troubleshooting
Maintenance Procedures
Incident Response
Performance Tuning
Backup and Recovery
Security Operations

Daily Operations

Morning Health Check

Procedure: Daily system health verification

Duration: 15 minutes

Steps:

Check System Status

   # Check pipeline status
   curl -X GET "https://ingestion-api.company.com/health" | jq
   
   # Check database connectivity
   curl -X GET "https://ingestion-api.company.com/health/database" | jq
   
   # Check file system access
   curl -X GET "https://ingestion-api.company.com/health/filesystem" | jq

Review Overnight Metrics

# Get processing statistics
curl -X GET "https://ingestion-api.company.com/stats" | jq

# Check error rates
curl -X GET "https://ingestion-api.company.com/metrics/error_rate" | jq

Review Dead Letter Queue

   # Check DLQ statistics
   curl -X GET "https://ingestion-api.company.com/dlq/stats" | jq
   
   # List pending items
   curl -X GET "https://ingestion-api.company.com/dlq/pending?limit=10" | jq

Check Active Alerts

# List active alerts
curl -X GET "https://ingestion-api.company.com/alerts/active" | jq

Expected Results:

All health checks return "healthy" status
Error rate < 5%
DLQ items < 50
No critical alerts active

Escalation: If any checks fail, escalate to on-call engineer

Document Processing Review

Procedure: Review document processing performance

Duration: 10 minutes

Steps:

Check Processing Volume

   # Get processing statistics for last 24 hours
   curl -X GET "https://ingestion-api.company.com/stats?period=24h" | jq

Review Processing Times

   # Check processing time percentiles
   curl -X GET "https://ingestion-api.company.com/metrics/processing_time" | jq

Identify Slow Documents

   # Get slow processing documents
   curl -X GET "https://ingestion-api.company.com/documents/slow?threshold=30000" | jq

Expected Results:

Processing volume within expected range
P95 processing time < 30 seconds
No documents stuck in processing

Source System Health Check

Procedure: Verify source system connectivity

Duration: 10 minutes

Steps:

Check SharePoint Connectivity

   # Test SharePoint connection
   curl -X POST "https://ingestion-api.company.com/test/sharepoint" \
     -H "Content-Type: application/json" \
     -d '{"endpoint": "sharepoint.company.com"}'

Check S3 Connectivity

# Test S3 connection
curl -X POST "https://ingestion-api.company.com/test/s3" \
  -H "Content-Type: application/json" \
  -d '{"bucket": "documents-bucket"}'

Check Database Connectivity

# Test database connections
curl -X POST "https://ingestion-api.company.com/test/database" \
  -H "Content-Type: application/json" \
  -d '{"source": "hr_database"}'

Expected Results:

All source systems accessible
Response times < 5 seconds
No authentication errors

Monitoring and Alerting

Alert Management

Procedure: Respond to and manage alerts

Severity Levels:

Critical: System down, data loss, security breach
High: High error rates, performance degradation
Medium: Warning conditions, capacity issues
Low: Informational alerts

Critical Alert Response

Procedure: Respond to critical alerts

Response Time: 15 minutes

Steps:

Immediate Assessment

   # Check system status
   curl -X GET "https://ingestion-api.company.com/health" | jq

   # Check recent errors
   tail -100 /var/log/ingestion/error.log

Identify Root Cause

   # Check system resources
   top -p $(pgrep -f ingestion)
   df -h
   free -h
   
   # Check database status
   psql -h localhost -U pipeline_user -d ingestion -c "SELECT * FROM pg_stat_activity;"

Implement Fix
- Follow specific runbook for the alert type
- Document actions taken
- Monitor for resolution
Escalation
- If not resolved in 15 minutes, escalate to senior engineer
- If not resolved in 30 minutes, escalate to manager
- If not resolved in 60 minutes, escalate to director

Alert Acknowledgment

Procedure: Acknowledge and track alerts

Steps:

Acknowledge Alert

   # Acknowledge alert
   curl -X POST "https://ingestion-api.company.com/alerts/{alert_id}/acknowledge" \
     -H "Content-Type: application/json" \
     -d '{"acknowledged_by": "engineer_name", "notes": "Investigating issue"}'

Update Alert Status

# Resolve alert
curl -X POST "https://ingestion-api.company.com/alerts/{alert_id}/resolve" \
  -H "Content-Type: application/json" \
  -d '{"resolution_notes": "Issue resolved by restarting service"}'

Troubleshooting

Common Issues

High Error Rate

Symptoms:

Error rate > 10%
Multiple failed documents
DLQ items increasing

Diagnosis:

# Check error logs
grep "ERROR" /var/log/ingestion/application.log | tail -50

# Check error categories
curl -X GET "https://ingestion-api.company.com/errors/categories" | jq

# Check recent failures
curl -X GET "https://ingestion-api.company.com/documents/failed?limit=20" | jq

Common Causes:

Source system connectivity issues
File format changes
Authentication failures
Resource constraints

Resolution:

Check source system connectivity
Verify file formats haven't changed
Check authentication credentials
Monitor system resources
Review recent configuration changes

Slow Processing

Symptoms:

Processing time > 30 seconds
Queue backup
High CPU/memory usage

Diagnosis:

# Check processing times
curl -X GET "https://ingestion-api.company.com/metrics/processing_time" | jq

# Check system resources
top -p $(pgrep -f ingestion)
iostat -x 1 5

# Check database performance
psql -h localhost -U pipeline_user -d ingestion -c "
SELECT query, mean_time, calls 
FROM pg_stat_statements 
ORDER BY mean_time DESC 
LIMIT 10;"

Resolution:

Scale up processing capacity
Optimize database queries
Check for resource contention
Review document sizes
Consider horizontal scaling

Dead Letter Queue Issues

Symptoms:

DLQ items not decreasing
Items stuck in review
High priority items aging

Diagnosis:

# Check DLQ statistics
curl -X GET "https://ingestion-api.company.com/dlq/stats" | jq

# Check aging items
curl -X GET "https://ingestion-api.company.com/dlq/aging" | jq

# Check review workflows
curl -X GET "https://ingestion-api.company.com/dlq/reviews" | jq

Resolution:

Review pending items
Assign items to reviewers
Process retry candidates
Escalate high-priority items
Clean up resolved items

Database Issues

Connection Pool Exhaustion

Symptoms:

"Too many connections" errors
Slow database responses
Application timeouts

Diagnosis:

-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

-- Check connection limits
SHOW max_connections;

-- Check connection usage
SELECT datname, usename, count(*) 
FROM pg_stat_activity 
GROUP BY datname, usename;

Resolution:

Increase connection pool size
Optimize long-running queries
Add database read replicas
Implement connection pooling
Review connection timeout settings

Lock Contention

Symptoms:

Queries hanging
Deadlock errors
Slow updates

Diagnosis:

-- Check for locks
SELECT 
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocked_activity.query AS blocked_statement,
    blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Resolution:

Kill blocking queries
Optimize query performance
Review transaction isolation levels
Implement proper indexing
Consider read replicas for reporting

File System Issues

Disk Space Issues

Symptoms:

"No space left on device" errors
Slow file operations
Application failures

Diagnosis:

# Check disk usage
df -h

# Find large files
find /var/log/ingestion -type f -size +100M -exec ls -lh {} \;

# Check for log rotation
ls -la /var/log/ingestion/

Resolution:

Clean up old log files
Archive old documents
Increase disk space
Implement log rotation
Move data to external storage

Permission Issues

Symptoms:

"Permission denied" errors
File access failures
Authentication errors

Diagnosis:

# Check file permissions
ls -la /var/data/ingestion/

# Check directory permissions
ls -ld /var/data/ingestion/

# Check process user
ps aux | grep ingestion

Resolution:

Fix file permissions
Check user/group ownership
Verify process user
Review security policies
Update access controls

Maintenance Procedures

Daily Maintenance

Procedure: Daily system maintenance

Duration: 30 minutes

Steps:

Log Rotation

# Rotate application logs
logrotate -f /etc/logrotate.d/ingestion

# Clean up old logs
find /var/log/ingestion -name "*.log.*" -mtime +7 -delete

Database Maintenance

-- Update table statistics
ANALYZE;

-- Clean up old data
DELETE FROM metrics WHERE timestamp < NOW() - INTERVAL '30 days';
DELETE FROM health_checks WHERE timestamp < NOW() - INTERVAL '7 days';

Cache Cleanup

   # Clear application cache
   curl -X POST "https://ingestion-api.company.com/admin/cache/clear"
   
   # Clean up temporary files
   find /tmp -name "ingestion_*" -mtime +1 -delete

Weekly Maintenance

Procedure: Weekly system maintenance

Duration: 2 hours

Steps:

Database Optimization

-- Reindex tables
REINDEX DATABASE ingestion;

-- Vacuum tables
VACUUM ANALYZE;

-- Update statistics
ANALYZE;

Dead Letter Queue Cleanup

# Archive old DLQ items
curl -X POST "https://ingestion-api.company.com/dlq/cleanup" \
  -H "Content-Type: application/json" \
  -d '{"days_old": 30}'

Document Version Cleanup

   # Archive old versions
   curl -X POST "https://ingestion-api.company.com/versions/cleanup" \
     -H "Content-Type: application/json" \
     -d '{"days_old": 90, "keep_latest": 5}'

Performance Review

# Generate performance report
curl -X GET "https://ingestion-api.company.com/reports/performance" > performance_report.json

Monthly Maintenance

Procedure: Monthly system maintenance

Duration: 4 hours

Steps:

Security Updates

   # Update system packages
   apt update && apt upgrade -y
   
   # Update Python dependencies
   pip install --upgrade -r requirements.txt
   
   # Run security scan
   bandit -r recoagent/

Backup Verification

# Verify backup integrity
sqlite3 backup_test.db "SELECT count(*) FROM documents;"

# Test restore procedure
./scripts/test_restore.sh

Capacity Planning

   # Generate capacity report
   curl -X GET "https://ingestion-api.company.com/reports/capacity" > capacity_report.json
   
   # Review growth trends
   curl -X GET "https://ingestion-api.company.com/metrics/growth" | jq

Documentation Update
- Review and update runbooks
- Update system diagrams
- Document lessons learned
- Update contact information

Incident Response

Incident Classification

Severity Levels:

P1 (Critical): System down, data loss, security breach
P2 (High): Major functionality affected, high error rates
P3 (Medium): Minor functionality affected, performance issues
P4 (Low): Cosmetic issues, minor bugs

P1 Incident Response

Procedure: Critical incident response

Response Time: 15 minutes

Steps:

Immediate Response

   # Check system status
   curl -X GET "https://ingestion-api.company.com/health" | jq
   
   # Check recent errors
   tail -100 /var/log/ingestion/error.log
   
   # Check system resources
   top -p $(pgrep -f ingestion)
   df -h
   free -h

Assess Impact
- Number of users affected
- Data at risk
- Business impact
- Estimated resolution time
Communicate Status
- Notify stakeholders
- Update status page
- Send incident notifications
Implement Fix
- Follow incident-specific runbook
- Document all actions
- Test resolution
Post-Incident
- Conduct post-mortem
- Update runbooks
- Implement preventive measures

Communication Templates

Incident Notification

Subject: [P1] Document Ingestion System Down

Incident Summary:
- Severity: P1 (Critical)
- System: Document Ingestion Pipeline
- Status: Investigating
- Impact: All document processing affected
- ETA: TBD

Next Update: 30 minutes

Incident Commander: [Name]

Status Update

Subject: [P1] Document Ingestion System - Status Update

Current Status: Investigating
Progress: Identified root cause as database connection pool exhaustion
Next Steps: Scaling up database connections and implementing connection pooling
ETA: 2 hours

Next Update: 1 hour

Incident Commander: [Name]

Resolution Notification

Subject: [P1] Document Ingestion System - RESOLVED

Status: RESOLVED
Resolution: Increased database connection pool size and implemented connection pooling
Duration: 1.5 hours
Root Cause: Database connection pool exhaustion under high load
Prevention: Implemented connection pooling and monitoring

Post-mortem scheduled for: [Date/Time]

Incident Commander: [Name]

Performance Tuning

Database Performance

Query Optimization

Procedure: Optimize slow database queries

Steps:

Identify Slow Queries

-- Find slow queries
SELECT query, mean_time, calls, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

Analyze Query Plans

-- Analyze query execution plan
EXPLAIN ANALYZE SELECT * FROM documents WHERE status = 'processing';

Add Indexes

-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_documents_status ON documents(status);
CREATE INDEX CONCURRENTLY idx_documents_created_at ON documents(created_at);

Connection Pooling

Configuration:

# database.yml
pool:
  min_connections: 5
  max_connections: 50
  connection_timeout: 30
  idle_timeout: 600

Application Performance

Memory Optimization

Procedure: Optimize memory usage

Steps:

Profile Memory Usage

   # Monitor memory usage
   ps aux | grep ingestion
   pmap -x $(pgrep -f ingestion)

Optimize Chunk Sizes

# Adjust chunk sizes based on available memory
chunk_size = min(1000, available_memory_mb // 10)

Implement Memory Monitoring

import psutil

def check_memory_usage():
    memory = psutil.virtual_memory()
    if memory.percent > 80:
        # Trigger cleanup or scaling
        pass

CPU Optimization

Procedure: Optimize CPU usage

Steps:

Monitor CPU Usage

   # Monitor CPU usage
   top -p $(pgrep -f ingestion)
   htop -p $(pgrep -f ingestion)

Optimize Concurrency

# Adjust concurrency based on CPU cores
max_concurrent = min(10, cpu_count * 2)

Implement CPU Monitoring

def check_cpu_usage():
    cpu_percent = psutil.cpu_percent(interval=1)
    if cpu_percent > 80:
        # Scale down or optimize
        pass

Backup and Recovery

Backup Procedures

Daily Backup

Procedure: Daily database backup

Duration: 30 minutes

Steps:

Create Database Backup

   # SQLite backup
   sqlite3 ingestion.db ".backup backup_$(date +%Y%m%d).db"
   
   # PostgreSQL backup
   pg_dump -h localhost -U pipeline_user ingestion > backup_$(date +%Y%m%d).sql

Verify Backup

# Verify backup size
ls -lh backup_$(date +%Y%m%d).db

# Verify backup integrity
sqlite3 backup_$(date +%Y%m%d).db "SELECT count(*) FROM documents;"

Upload to Cloud Storage

   # Upload to S3
   aws s3 cp backup_$(date +%Y%m%d).db s3://ingestion-backups/
   
   # Set retention policy
   aws s3api put-bucket-lifecycle-configuration \
     --bucket ingestion-backups \
     --lifecycle-configuration file://lifecycle.json

Weekly Backup

Procedure: Weekly full system backup

Duration: 2 hours

Steps:

Backup Application Code

   # Create code backup
   tar -czf code_backup_$(date +%Y%m%d).tar.gz /opt/ingestion/

Backup Configuration

# Backup configuration files
tar -czf config_backup_$(date +%Y%m%d).tar.gz /etc/ingestion/

Backup Logs

   # Backup application logs
   tar -czf logs_backup_$(date +%Y%m%d).tar.gz /var/log/ingestion/

Recovery Procedures

Database Recovery

Procedure: Restore database from backup

Duration: 1 hour

Steps:

Stop Services

   # Stop ingestion services
   systemctl stop ingestion-api
   systemctl stop ingestion-worker

Restore Database

# Restore SQLite database
cp backup_20231201.db ingestion.db

# Restore PostgreSQL database
psql -h localhost -U pipeline_user -d ingestion < backup_20231201.sql

Verify Recovery

   # Verify data integrity
   sqlite3 ingestion.db "SELECT count(*) FROM documents;"
   
   # Check for corruption
   sqlite3 ingestion.db "PRAGMA integrity_check;"

Restart Services

# Start services
systemctl start ingestion-api
systemctl start ingestion-worker

# Verify service health
curl -X GET "https://ingestion-api.company.com/health" | jq

Full System Recovery

Procedure: Complete system recovery

Duration: 4 hours

Steps:

Provision New Server
- Launch new instance
- Install operating system
- Configure network
Install Application

   # Install dependencies
   apt update && apt install -y python3 python3-pip postgresql
   
   # Install application
   pip install -r requirements.txt

Restore Data

# Restore database
# Restore configuration
# Restore logs

Configure Services

# Configure systemd services
# Configure monitoring
# Configure backup

Verify Recovery

   # Run health checks
   # Test functionality
   # Monitor for issues

Security Operations

Security Monitoring

Daily Security Review

Procedure: Daily security monitoring

Duration: 15 minutes

Steps:

Check Security Logs

   # Review authentication logs
   grep "authentication" /var/log/ingestion/security.log | tail -50
   
   # Check for failed login attempts
   grep "failed" /var/log/ingestion/auth.log | tail -20

Review Access Logs

   # Check API access logs
   tail -100 /var/log/ingestion/access.log | grep -E "(POST|PUT|DELETE)"
   
   # Check for suspicious activity
   grep -E "(\.\.\/|script|union|select)" /var/log/ingestion/access.log

Verify Security Metrics

   # Check security metrics
   curl -X GET "https://ingestion-api.company.com/metrics/security" | jq

Vulnerability Management

Weekly Vulnerability Scan

Procedure: Weekly security scanning

Duration: 1 hour

Steps:

Dependency Scan

# Scan Python dependencies
safety check

# Scan system packages
apt list --upgradable

Code Security Scan

   # Run Bandit security linter
   bandit -r recoagent/
   
   # Run Semgrep
   semgrep --config=auto recoagent/

Infrastructure Scan

# Run nmap scan
nmap -sS -O localhost

# Check open ports
netstat -tulpn

Incident Response

Security Incident Response

Procedure: Respond to security incidents

Response Time: 5 minutes

Steps:

Immediate Containment

# Block suspicious IPs
iptables -A INPUT -s 192.168.1.100 -j DROP

# Disable compromised accounts
curl -X POST "https://ingestion-api.company.com/admin/users/disable" \
  -d '{"username": "compromised_user"}'

Evidence Collection

# Collect system logs
tar -czf security_evidence_$(date +%Y%m%d_%H%M%S).tar.gz /var/log/

# Collect network logs
tcpdump -i eth0 -w security_capture.pcap

Investigation

# Analyze logs
grep -E "192.168.1.100" /var/log/ingestion/access.log

# Check system integrity
aide --check

Recovery

   # Restore from backup if necessary
   # Patch vulnerabilities
   # Update security controls

Support Contacts

Escalation Matrix

Issue Type	Primary	Secondary	Tertiary
P1 Critical	On-call Engineer	Senior Engineer	Engineering Manager
P2 High	On-call Engineer	Senior Engineer	Engineering Manager
P3 Medium	On-call Engineer	Senior Engineer	-
P4 Low	On-call Engineer	-	-

Contact Information

On-call Engineer: +1-555-0123 (24/7)
Senior Engineer: +1-555-0124 (Business hours)
Engineering Manager: +1-555-0125 (Business hours)
Security Team: security@company.com
Incident Response: incident@company.com

External Support

Database Support: db-support@company.com
Infrastructure Support: infra-support@company.com
Security Vendor: security-vendor@company.com

Documentation Updates

Runbook Maintenance

Procedure: Keep runbooks current

Frequency: Monthly

Steps:

Review Runbooks
- Check for outdated procedures
- Verify contact information
- Update system-specific details
Update Procedures
- Document new procedures
- Remove obsolete steps
- Clarify ambiguous instructions
Version Control
- Track changes in Git
- Tag releases
- Maintain change log
Team Review
- Get feedback from team members
- Test procedures in staging
- Update based on lessons learned

Change Management

Procedure: Manage runbook changes

Steps:

Propose Changes
- Create pull request
- Document rationale
- Get team review
Test Changes
- Validate in staging environment
- Update test procedures
- Document test results
Deploy Changes
- Merge to main branch
- Update documentation site
- Notify team of changes
Monitor Impact
- Track usage of updated procedures
- Collect feedback
- Iterate based on results

Next Steps

Deployment Guide - Production deployment
Security Guide - Security best practices
Performance Guide - High-volume optimization

Table of Contents​

Daily Operations​

Morning Health Check​

Document Processing Review​

Source System Health Check​

Monitoring and Alerting​

Alert Management​

Critical Alert Response​

Alert Acknowledgment​

Troubleshooting​

Common Issues​

High Error Rate​

Slow Processing​

Dead Letter Queue Issues​

Database Issues​

Connection Pool Exhaustion​

Lock Contention​

File System Issues​

Disk Space Issues​

Permission Issues​

Maintenance Procedures​

Daily Maintenance​

Weekly Maintenance​

Monthly Maintenance​

Incident Response​

Incident Classification​

P1 Incident Response​

Communication Templates​

Incident Notification​

Status Update​

Resolution Notification​

Performance Tuning​

Database Performance​

Query Optimization​

Connection Pooling​

Application Performance​

Memory Optimization​

CPU Optimization​

Backup and Recovery​

Backup Procedures​

Daily Backup​

Weekly Backup​

Recovery Procedures​

Database Recovery​

Full System Recovery​

Security Operations​

Security Monitoring​

Daily Security Review​

Vulnerability Management​

Weekly Vulnerability Scan​

Incident Response​

Security Incident Response​

Support Contacts​

Escalation Matrix​

Contact Information​

External Support​

Documentation Updates​

Runbook Maintenance​

Change Management​

Next Steps​

Table of Contents

Daily Operations

Morning Health Check

Document Processing Review

Source System Health Check

Monitoring and Alerting

Alert Management

Critical Alert Response

Alert Acknowledgment

Troubleshooting

Common Issues

High Error Rate

Slow Processing

Dead Letter Queue Issues

Database Issues

Connection Pool Exhaustion

Lock Contention

File System Issues

Disk Space Issues

Permission Issues

Maintenance Procedures

Daily Maintenance

Weekly Maintenance

Monthly Maintenance

Incident Response

Incident Classification

P1 Incident Response

Communication Templates

Incident Notification

Status Update

Resolution Notification

Performance Tuning

Database Performance

Query Optimization

Connection Pooling

Application Performance

Memory Optimization

CPU Optimization

Backup and Recovery

Backup Procedures

Daily Backup

Weekly Backup

Recovery Procedures

Database Recovery

Full System Recovery

Security Operations

Security Monitoring

Daily Security Review

Vulnerability Management

Weekly Vulnerability Scan

Incident Response

Security Incident Response

Support Contacts

Escalation Matrix

Contact Information

External Support

Documentation Updates

Runbook Maintenance

Change Management

Next Steps