Query Expansion System - Comprehensive Guide
Table of Contents
- Overview
- Architecture
- Expansion Strategies
- Synonym Management
- API Reference
- Configuration
- Analytics and Monitoring
- Best Practices
- Troubleshooting
- Optimization Guide
Overview
The Query Expansion System is a comprehensive solution designed to address vocabulary mismatch between user queries and document content in enterprise search systems. It provides intelligent query expansion capabilities with domain-specific synonym management and adaptive learning.
Key Features
- Domain-Specific Synonym Management: Curated synonym dictionaries for different domains
- Dynamic Query Expansion: Multiple expansion strategies using embeddings and semantic similarity
- User Feedback Integration: Adaptive learning from user interactions
- Acronym and Abbreviation Expansion: Automatic resolution of technical abbreviations
- Contextual Synonym Selection: Role and department-based expansion
- Confidence Scoring: Relevance weighting for expansion quality
- Conflict Resolution: Handling of ambiguous terms and conflicting synonyms
- Analytics Dashboard: Comprehensive monitoring and effectiveness tracking
- Management Interface: Admin tools for synonym curation and maintenance
Problem Statement
Enterprise users often search using their own terminology, abbreviations, and domain-specific language that differs from how documents are written. This vocabulary mismatch significantly reduces search recall, causing relevant content to be missed.
Solution Approach
The system addresses this through:
- Intelligent Expansion: Multiple expansion strategies working together
- Domain Awareness: Context-aware expansion based on user domain
- Continuous Learning: Adaptive improvement through user feedback
- Quality Assurance: Confidence scoring and conflict resolution
Architecture
System Components
Core Modules
-
Query Expansion Engine (
query_expansion.py
)- Main expansion orchestration
- Strategy selection and execution
- Result ranking and filtering
-
Synonym Database (
query_expansion.py
)- SQLite-based storage
- CRUD operations
- Audit logging
-
Expansion Strategies
- Synonym-based expansion
- Acronym resolution
- Semantic similarity
- Contextual expansion
-
Analytics System (
synonym_analytics.py
)- Performance metrics
- Usage analytics
- Quality monitoring
-
Management Interface (
synonym_management.py
)- Admin tools
- Bulk operations
- Conflict resolution
-
API Layer (
query_expansion_api.py
)- REST endpoints
- Authentication
- Request/response handling
Expansion Strategies
1. Synonym Expansion Strategy
Purpose: Replace terms with their synonyms to improve recall.
How it works:
- Tokenizes the input query
- Looks up synonyms for each meaningful token
- Filters synonyms by confidence and relevance
- Creates expanded queries with synonym alternatives
Example:
Input: "API documentation"
Expansion: "API documentation OR Application Programming Interface documentation"
Configuration:
strategy = SynonymExpansionStrategy(synonym_db)
# Confidence threshold: 0.5
# Relevance threshold: 0.6
2. Acronym Expansion Strategy
Purpose: Resolve acronyms and abbreviations to their full forms.
How it works:
- Identifies potential acronyms using regex patterns
- Looks up expansions in the acronym database
- Creates expansions with both acronym and full form
Example:
Input: "REST API documentation"
Expansion: "REST API documentation OR Representational State Transfer API documentation"
Supported Patterns:
\b[A-Z]{2,6}\b
: Standard acronyms (API, SQL, HTTP)\b[A-Z]\.?[A-Z]\.?[A-Z]?\.?\b
: Abbreviated forms (U.S.A., Ph.D.)
3. Semantic Expansion Strategy
Purpose: Find semantically similar terms using embeddings.
How it works:
- Generates embeddings for the query
- Finds semantically similar terms in the knowledge base
- Creates expansions with related concepts
Example:
Input: "machine learning algorithms"
Expansion: "machine learning algorithms OR artificial intelligence algorithms OR ML algorithms"
Configuration:
- Similarity threshold: 0.7
- Maximum similar terms: 3
4. Contextual Expansion Strategy
Purpose: Expand based on user role and department context.
How it works:
- Analyzes user role and department
- Identifies role-specific terminology
- Adds contextual terms to the query
Example:
Input: "code documentation" (Developer role)
Expansion: "code documentation OR programming documentation OR software documentation"
Role Mappings:
- Developer: code, programming, software, development
- Manager: strategy, planning, team, leadership
- Analyst: data, analysis, metrics, reporting
- Designer: UI, UX, interface, user experience
Synonym Management
Database Schema
The synonym database uses SQLite with the following key tables:
Synonyms Table
CREATE TABLE synonyms (
id INTEGER PRIMARY KEY AUTOINCREMENT,
term TEXT NOT NULL,
synonym TEXT NOT NULL,
domain TEXT NOT NULL,
confidence REAL NOT NULL,
relevance_score REAL NOT NULL,
usage_count INTEGER DEFAULT 0,
success_rate REAL DEFAULT 0.0,
source TEXT NOT NULL,
context TEXT,
user_role TEXT,
department TEXT,
metadata TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(term, synonym, domain)
);
Expansion History Table
CREATE TABLE expansion_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
original_query TEXT NOT NULL,
expanded_query TEXT NOT NULL,
expansion_type TEXT NOT NULL,
synonyms_used TEXT,
confidence_score REAL,
relevance_score REAL,
was_successful BOOLEAN,
user_feedback TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Adding Synonyms
Programmatic Addition
from packages.rag.query_expansion import QueryExpansionSystem, SynonymSource
system = QueryExpansionSystem()
await system.add_synonym(
term="API",
synonym="Application Programming Interface",
domain="technical",
source=SynonymSource.MANUAL_CURATION,
confidence=0.9,
context="software development",
user_role="developer",
department="engineering"
)
Bulk Import
from packages.rag.synonym_management import SynonymManagementSystem
management = SynonymManagementSystem()
result = await management.bulk_import_synonyms(
file_path="synonyms.csv",
import_format="csv",
created_by="admin_user",
domain="technical"
)
CSV Format
term,synonym,domain,confidence,context,user_role,department,tags,notes
API,Application Programming Interface,technical,0.9,software development,developer,engineering,"programming,interface","Common programming interface"
SQL,Structured Query Language,technical,0.95,database queries,developer,engineering,"database,query","Database query language"
Quality Management
Confidence Scoring
- High (0.8-1.0): Manually curated, expert-verified
- Medium (0.5-0.8): User-suggested, community-validated
- Low (0.0-0.5): Auto-generated, needs review
Relevance Weighting
- Based on usage frequency and success rate
- Updated dynamically through user feedback
- Influences expansion ranking
Conflict Resolution
- Exact Match Conflicts: Duplicate entries
- Semantic Conflicts: Contradictory meanings
- Domain Conflicts: Different meanings across domains
Resolution strategies:
- Merge entries with higher confidence
- Add disambiguation context
- Create domain-specific versions
API Reference
Authentication
All API endpoints require authentication. Include the user ID in the request context.
Core Endpoints
Expand Query
POST /query-expansion/expand
Content-Type: application/json
{
"query": "API documentation",
"user_id": "user123",
"user_role": "developer",
"department": "engineering",
"domain": "technical",
"session_id": "session123",
"enabled_strategies": ["synonym_expansion", "acronym_expansion"],
"expansion_settings": {
"max_expansions": 3,
"confidence_threshold": 0.7
}
}
Response:
{
"original_query": "API documentation",
"expanded_queries": [
{
"expanded_query": "API documentation OR Application Programming Interface documentation",
"expansion_type": "synonym_expansion",
"confidence_score": 0.9,
"relevance_score": 0.8,
"synonyms_used": [
{
"term": "API",
"synonym": "Application Programming Interface",
"confidence": 0.9
}
],
"metadata": {
"expanded_token": "API",
"synonym_count": 1
}
}
],
"expansion_metadata": {
"total_expansions": 1,
"strategies_used": ["synonym_expansion"],
"avg_confidence": 0.9
},
"success": true
}
Add Synonym
POST /query-expansion/synonyms
Content-Type: application/json
{
"term": "ML",
"synonym": "Machine Learning",
"domain": "technical",
"confidence": 0.9,
"context": "artificial intelligence",
"user_role": "developer",
"department": "engineering",
"tags": ["ai", "learning"],
"notes": "Machine learning abbreviation"
}
Submit Feedback
POST /query-expansion/feedback
Content-Type: application/json
{
"expansion_id": "exp_123",
"was_helpful": true,
"rating": 5,
"comment": "Very helpful expansion"
}
Analytics Endpoints
Get Expansion Analytics
GET /query-expansion/analytics/expansion?start_date=2024-01-01&end_date=2024-01-31&domain=technical
Get Synonym Analytics
GET /query-expansion/analytics/synonyms?domain=technical
Get Comprehensive Report
GET /query-expansion/analytics/report?start_date=2024-01-01&end_date=2024-01-31
Admin Endpoints
Get Dashboard Data
GET /query-expansion/admin/dashboard
Bulk Import
POST /query-expansion/admin/bulk-import
Content-Type: application/json
{
"file_path": "/path/to/synonyms.csv",
"import_format": "csv",
"domain": "technical"
}
Export Synonyms
POST /query-expansion/admin/export
Content-Type: application/json
{
"output_path": "/path/to/export.csv",
"export_format": "csv",
"domain": "technical",
"status": "active"
}
Configuration
Environment Variables
# Database Configuration
SYNONYM_DB_PATH=synonyms.db
# Expansion Settings
DEFAULT_CONFIDENCE_THRESHOLD=0.5
DEFAULT_RELEVANCE_THRESHOLD=0.6
MAX_EXPANSIONS_PER_QUERY=5
SIMILARITY_THRESHOLD=0.7
# Analytics Settings
ANALYTICS_CACHE_TTL=300
ENABLE_ANALYTICS=true
ANALYTICS_RETENTION_DAYS=90
# Management Settings
ENABLE_ADMIN_INTERFACE=true
REQUIRE_APPROVAL=true
AUTO_APPROVE_CONFIDENCE_THRESHOLD=0.9
Configuration File
Create config/query_expansion.yml
:
expansion:
strategies:
synonym_expansion:
enabled: true
confidence_threshold: 0.5
relevance_threshold: 0.6
acronym_expansion:
enabled: true
patterns:
- "\b[A-Z]{2,6}\b"
- "\b[A-Z]\.?[A-Z]\.?[A-Z]?\.?\b"
semantic_expansion:
enabled: true
similarity_threshold: 0.7
max_similar_terms: 3
contextual_expansion:
enabled: true
role_mappings:
developer: ["code", "programming", "software"]
manager: ["strategy", "planning", "team"]
analyst: ["data", "analysis", "metrics"]
designer: ["UI", "UX", "interface"]
synonym_management:
default_confidence: 0.8
require_approval: true
auto_approve_threshold: 0.9
max_synonyms_per_term: 10
analytics:
enabled: true
cache_ttl: 300
retention_days: 90
enable_visualizations: true
database:
path: "synonyms.db"
backup_enabled: true
backup_interval_hours: 24
Analytics and Monitoring
Key Metrics
Expansion Effectiveness
- Success Rate: Percentage of expansions marked as helpful
- Confidence Score: Average confidence of expansions
- Relevance Score: Average relevance of expansions
- Usage Frequency: How often expansions are used
Synonym Quality
- Active Synonyms: Synonyms with usage > 0
- Success Rates: Per-synonym success rates
- Domain Distribution: Synonym distribution across domains
- Source Breakdown: Synonyms by source (manual, user, ML)
User Engagement
- Active Users: Users with recent activity
- Role Preferences: Expansion usage by role
- Department Usage: Expansion usage by department
- Feedback Patterns: User feedback trends
Dashboard Views
Overview Dashboard
- Total expansions and success rate
- Recent activity and trends
- Top performing synonyms
- Domain distribution
Domain-Specific Dashboard
- Domain-specific metrics
- Popular terms and synonyms
- Quality indicators
- Improvement opportunities
User Analytics Dashboard
- User engagement metrics
- Role-based usage patterns
- Feedback analysis
- Satisfaction trends
Monitoring Alerts
Set up alerts for:
- Low expansion success rate (< 50%)
- High synonym conflict rate (> 10%)
- Low user engagement (< 30% active users)
- Quality degradation (F1 score < 0.6)
Best Practices
Synonym Curation
Do's
- Use domain-specific terminology
- Include context and usage examples
- Set appropriate confidence scores
- Regular review and maintenance
- User feedback integration
Don'ts
- Avoid overly broad synonyms
- Don't use ambiguous terms without context
- Avoid low-confidence synonyms without review
- Don't ignore user feedback
- Avoid duplicate entries
Expansion Strategy Selection
For Technical Domains
- Prioritize acronym expansion
- Use high-confidence synonyms
- Include contextual expansion for roles
- Focus on precision over recall
For Business Domains
- Emphasize semantic expansion
- Use broader synonym sets
- Include contextual expansion for departments
- Balance precision and recall
For General Domains
- Use all strategies equally
- Focus on user feedback
- Maintain high quality standards
- Regular performance review
Quality Assurance
Regular Reviews
- Weekly synonym quality review
- Monthly expansion effectiveness analysis
- Quarterly strategy performance evaluation
- Annual comprehensive system audit
User Feedback Integration
- Prompt for feedback on expansions
- Track feedback trends
- Use feedback for continuous improvement
- Reward high-quality contributions
Troubleshooting
Common Issues
Low Expansion Success Rate
Symptoms:
- Success rate < 50%
- High user dissatisfaction
- Low confidence scores
Causes:
- Poor quality synonyms
- Incorrect confidence thresholds
- Outdated terminology
- Domain mismatch
Solutions:
- Review and curate synonyms
- Adjust confidence thresholds
- Update terminology regularly
- Verify domain assignments
High Synonym Conflicts
Symptoms:
- Multiple conflicting synonyms
- User confusion
- Inconsistent expansions
Causes:
- Duplicate entries
- Ambiguous terms
- Domain overlap
- Poor conflict resolution
Solutions:
- Implement conflict detection
- Add disambiguation context
- Create domain-specific versions
- Improve resolution algorithms
Performance Issues
Symptoms:
- Slow expansion response
- High database load
- Memory usage spikes
Causes:
- Large synonym database
- Inefficient queries
- No caching
- Poor indexing
Solutions:
- Implement caching
- Optimize database queries
- Add proper indexing
- Consider database partitioning
Debugging Tools
Expansion Debugging
# Enable debug logging
import logging
logging.getLogger("query_expansion").setLevel(logging.DEBUG)
# Test expansion with debug info
context = ExpansionContext(...)
expansions = await system.expand_query(query, context)
print(f"Expansion debug info: {expansions[0].expansion_metadata}")
Database Inspection
# Check synonym quality
from packages.rag.synonym_analytics import create_synonym_analytics
analytics = create_synonym_analytics()
metrics = await analytics.get_synonym_metrics()
print(f"Low confidence synonyms: {metrics.confidence_distribution['low']}")
Performance Monitoring
# Monitor expansion performance
import time
start_time = time.time()
expansions = await system.expand_query(query, context)
end_time = time.time()
print(f"Expansion time: {end_time - start_time:.3f}s")
Optimization Guide
Performance Optimization
Database Optimization
- Indexing: Add indexes on frequently queried columns
CREATE INDEX idx_synonyms_term_domain ON synonyms(term, domain);
CREATE INDEX idx_synonyms_confidence ON synonyms(confidence);
CREATE INDEX idx_expansion_history_user_date ON expansion_history(user_id, created_at);
- Caching: Implement Redis caching for frequent queries
import redis
redis_client = redis.Redis(host='localhost', port=6379, db=0)
async def get_cached_synonyms(term, domain):
cache_key = f"synonyms:{term}:{domain}"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
synonyms = await synonym_db.get_synonyms(term, domain)
redis_client.setex(cache_key, 300, json.dumps(synonyms)) # 5 min cache
return synonyms
- Connection Pooling: Use connection pooling for database access
import sqlite3
from contextlib import contextmanager
@contextmanager
def get_db_connection():
conn = sqlite3.connect(self.db_path, check_same_thread=False)
try:
yield conn
finally:
conn.close()
Algorithm Optimization
- Parallel Processing: Run expansion strategies in parallel
import asyncio
async def expand_query_parallel(self, query, context):
tasks = []
for strategy in self.strategies.values():
tasks.append(strategy.expand_query(query, context))
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if not isinstance(r, Exception)]
- Early Termination: Stop expansion when confidence is sufficient
async def expand_with_early_termination(self, query, context, min_confidence=0.8):
expansions = []
for strategy in self.strategies.values():
strategy_expansions = await strategy.expand_query(query, context)
expansions.extend(strategy_expansions)
# Check if we have sufficient confidence
if expansions and max(e.confidence_score for e in expansions) >= min_confidence:
break
return expansions
Quality Optimization
Synonym Quality Improvement
- Automated Quality Scoring
def calculate_synonym_quality(synonym):
quality_score = (
synonym.confidence * 0.4 +
synonym.relevance_score * 0.3 +
synonym.success_rate * 0.2 +
min(synonym.usage_count / 100, 0.1) * 0.1
)
return quality_score
- Active Learning: Use user feedback to improve synonyms
async def update_synonym_from_feedback(synonym, feedback):
if feedback.was_helpful:
synonym.success_rate = (synonym.success_rate * synonym.usage_count + 1) / (synonym.usage_count + 1)
else:
synonym.success_rate = (synonym.success_rate * synonym.usage_count) / (synonym.usage_count + 1)
synonym.usage_count += 1
await self.synonym_db.add_synonym(synonym)
Expansion Strategy Optimization
- Dynamic Strategy Selection: Choose strategies based on query characteristics
def select_strategies(query, context):
strategies = []
# Always include synonym expansion
strategies.append(ExpansionType.SYNONYM_EXPANSION)
# Add acronym expansion for technical queries
if any(term in query.upper() for term in ['API', 'SQL', 'HTTP', 'XML']):
strategies.append(ExpansionType.ACRONYM_EXPANSION)
# Add semantic expansion for complex queries
if len(query.split()) > 3:
strategies.append(ExpansionType.SEMANTIC)
# Add contextual expansion based on user role
if context.user_role in ['developer', 'manager', 'analyst']:
strategies.append(ExpansionType.CONTEXTUAL)
return strategies
- Confidence-Based Filtering: Filter expansions by confidence
def filter_expansions_by_confidence(expansions, min_confidence=0.6):
return [e for e in expansions if e.confidence_score >= min_confidence]
Scalability Optimization
Horizontal Scaling
- Database Sharding: Shard by domain
class ShardedSynonymDatabase:
def __init__(self, shard_config):
self.shards = {}
for domain, db_path in shard_config.items():
self.shards[domain] = SynonymDatabase(db_path)
async def get_synonyms(self, term, domain):
shard = self.shards.get(domain, self.shards['default'])
return await shard.get_synonyms(term, domain)
- Load Balancing: Distribute expansion requests
class LoadBalancedExpansionSystem:
def __init__(self, expansion_systems):
self.systems = expansion_systems
self.current_index = 0
async def expand_query(self, query, context):
system = self.systems[self.current_index]
self.current_index = (self.current_index + 1) % len(self.systems)
return await system.expand_query(query, context)
Vertical Scaling
- Memory Optimization: Use efficient data structures
from collections import defaultdict
import pickle
class OptimizedSynonymCache:
def __init__(self):
self.cache = defaultdict(dict)
self.compression_enabled = True
def get(self, term, domain):
if self.compression_enabled:
return pickle.loads(self.cache[domain].get(term, b''))
return self.cache[domain].get(term)
def set(self, term, domain, synonyms):
if self.compression_enabled:
self.cache[domain][term] = pickle.dumps(synonyms)
else:
self.cache[domain][term] = synonyms
- CPU Optimization: Use efficient algorithms
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
class OptimizedSemanticExpansion:
def __init__(self):
self.vectorizer = TfidfVectorizer(max_features=1000)
self.term_vectors = {}
def precompute_vectors(self, terms):
vectors = self.vectorizer.fit_transform(terms)
for i, term in enumerate(terms):
self.term_vectors[term] = vectors[i].toarray().flatten()
def find_similar_terms(self, query, threshold=0.7):
query_vector = self.vectorizer.transform([query]).toarray().flatten()
similarities = []
for term, vector in self.term_vectors.items():
similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
if similarity >= threshold:
similarities.append((term, similarity))
return sorted(similarities, key=lambda x: x[1], reverse=True)
This comprehensive guide provides everything needed to implement, configure, and optimize the query expansion system for enterprise use cases.