📝 Session 4: Query Expansion Practice¶
📝 PARTICIPANT PATH CONTENT Prerequisites: Complete 🎯 Observer path and 📝 HyDE Implementation Time Investment: 45-60 minutes Outcome: Multi-strategy query expansion system
Learning Outcomes¶
After completing this practice guide, you will:
- Build intelligent query expansion systems
- Implement multiple expansion strategies (semantic, contextual, domain-specific)
- Create multi-query generation from different perspectives
- Develop query decomposition for complex questions
Complete Query Expansion System¶
Step 1: Intelligent Query Expander Setup¶
Build a comprehensive query expansion system with multiple strategies:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import wordnet
from collections import defaultdict
class IntelligentQueryExpander:
"""Advanced query expansion using multiple strategies."""
def __init__(self, llm_model, domain_corpus: Optional[List[str]] = None):
self.llm_model = llm_model
self.domain_corpus = domain_corpus
Query Expansion Libraries: These imports provide statistical analysis, semantic relationships, and domain-specific term extraction capabilities.
# Initialize expansion strategies
self.expansion_strategies = {
'synonym': self._synonym_expansion,
'semantic': self._semantic_expansion,
'contextual': self._contextual_expansion,
'domain_specific': self._domain_specific_expansion
}
# Domain-specific TF-IDF if corpus provided
if domain_corpus:
self.domain_tfidf = TfidfVectorizer(
max_features=10000,
stop_words='english',
ngram_range=(1, 3)
)
self.domain_tfidf.fit(domain_corpus)
Strategy Framework: Multiple expansion approaches ensure comprehensive coverage - from simple synonyms to complex semantic relationships and domain-specific terminology.
Step 2: Core Expansion Workflow¶
Implement the main query expansion coordination system:
def expand_query(self, query: str,
strategies: List[str] = ['semantic', 'contextual'],
max_expansions: int = 5) -> Dict[str, Any]:
"""Expand query using multiple strategies."""
expansion_results = {}
combined_expansions = set()
# Apply each expansion strategy
for strategy in strategies:
if strategy in self.expansion_strategies:
expansions = self.expansion_strategies[strategy](
query, max_expansions
)
expansion_results[strategy] = expansions
combined_expansions.update(expansions)
Multi-Strategy Coordination: Coordinating different expansion approaches allows leveraging the strengths of each method for comprehensive query enhancement.
# Create final expanded query
expanded_query = self._create_expanded_query(
query, list(combined_expansions)
)
return {
'original_query': query,
'expansions_by_strategy': expansion_results,
'all_expansions': list(combined_expansions),
'expanded_query': expanded_query,
'expansion_count': len(combined_expansions)
}
Step 3: Semantic Expansion Using LLM¶
Generate semantically related terms using LLM understanding:
def _semantic_expansion(self, query: str, max_expansions: int) -> List[str]:
"""Generate semantic expansions using LLM understanding."""
semantic_prompt = f"""
Given this query, generate {max_expansions} semantically related terms or phrases:
Query: {query}
Requirements:
1. Include synonyms and related concepts
2. Add domain-specific terminology if applicable
3. Include both broader and narrower terms
4. Focus on terms likely to appear in relevant documents
Return only the expanded terms, one per line:
"""
Semantic Expansion Strategy: Using LLM understanding to generate related terms creates expansions that capture conceptual relationships beyond simple synonyms.
try:
response = self.llm_model.predict(semantic_prompt)
expansions = [
term.strip()
for term in response.strip().split('\n')
if term.strip() and not term.strip().startswith(('-', '*', '•'))
]
return expansions[:max_expansions]
except Exception as e:
print(f"Semantic expansion error: {e}")
return []
Response Processing: Filtering and cleaning LLM output ensures we get high-quality expansion terms while removing formatting artifacts.
Step 4: Contextual Query Reformulation¶
Create multiple ways to express the same information need:
def _contextual_expansion(self, query: str, max_expansions: int) -> List[str]:
"""Generate contextual reformulations of the query."""
reformulation_prompt = f"""
Reformulate this query in {max_expansions} different ways that express the same information need:
Original Query: {query}
Create variations that:
1. Use different phrasing and vocabulary
2. Approach the question from different angles
3. Include both specific and general formulations
4. Maintain the original intent and meaning
Reformulations:
"""
Reformulation Strategy: Creating multiple ways to express the same information need increases the likelihood of matching documents with different linguistic styles.
try:
response = self.llm_model.predict(reformulation_prompt)
reformulations = [
reform.strip().rstrip('.')
for reform in response.strip().split('\n')
if reform.strip() and ('?' in reform or len(reform.split()) > 3)
]
return reformulations[:max_expansions]
except Exception as e:
print(f"Contextual expansion error: {e}")
return []
Quality Filtering: Ensuring reformulations are substantive questions rather than fragments improves the quality of query variations.
Step 5: Domain-Specific Expansion¶
Use domain knowledge to enhance queries with specialized terminology:
def _domain_specific_expansion(self, query: str, max_expansions: int) -> List[str]:
"""Generate domain-specific expansions using corpus knowledge."""
if not hasattr(self, 'domain_tfidf'):
return []
# Transform query to TF-IDF vector
query_vector = self.domain_tfidf.transform([query])
# Get feature names and scores
feature_names = self.domain_tfidf.get_feature_names_out()
tfidf_scores = query_vector.toarray()[0]
# Find highly relevant terms
term_scores = list(zip(feature_names, tfidf_scores))
term_scores.sort(key=lambda x: x[1], reverse=True)
Domain Knowledge Integration: Using TF-IDF analysis of domain-specific corpus to identify relevant specialized terms.
# Extract top domain terms not in original query
query_terms = set(query.lower().split())
domain_expansions = []
for term, score in term_scores[:max_expansions * 3]:
if score > 0 and term not in query_terms:
domain_expansions.append(term)
if len(domain_expansions) >= max_expansions:
break
return domain_expansions
Step 6: Multi-Query Generation System¶
Generate comprehensive query sets from multiple perspectives:
class MultiQueryGenerator:
"""Generate multiple query perspectives for comprehensive retrieval."""
def __init__(self, llm_model):
self.llm_model = llm_model
self.query_perspectives = {
'decomposition': self._decompose_complex_query,
'specificity_levels': self._generate_specificity_variants,
'temporal_variants': self._generate_temporal_variants,
'perspective_shifts': self._generate_perspective_variants,
'domain_focused': self._generate_domain_variants
}
Perspective Framework: Multiple query generation approaches ensure comprehensive coverage by viewing the same information need from different angles and specificity levels.
def generate_multi_query_set(self, query: str,
perspectives: List[str] = None,
total_queries: int = 8) -> Dict[str, Any]:
"""Generate comprehensive query set from multiple perspectives."""
if perspectives is None:
perspectives = ['decomposition', 'specificity_levels', 'perspective_shifts']
all_queries = {'original': query}
generation_metadata = {}
# Distribute query generation across perspectives
queries_per_perspective = total_queries // len(perspectives)
remaining_queries = total_queries % len(perspectives)
Query Distribution Strategy: Intelligently distributing query generation across different perspectives ensures balanced coverage of the information space.
for i, perspective in enumerate(perspectives):
num_queries = queries_per_perspective
if i < remaining_queries:
num_queries += 1
generated = self.query_perspectives[perspective](query, num_queries)
all_queries[perspective] = generated
generation_metadata[perspective] = {
'count': len(generated),
'method': perspective
}
# Flatten and deduplicate
flattened_queries = self._flatten_and_deduplicate(all_queries)
return {
'original_query': query,
'query_variants': flattened_queries,
'queries_by_perspective': all_queries,
'generation_metadata': generation_metadata,
'total_variants': len(flattened_queries)
}
Step 7: Complex Query Decomposition¶
Break complex questions into manageable sub-questions:
def _decompose_complex_query(self, query: str, num_queries: int) -> List[str]:
"""Decompose complex queries into simpler sub-questions."""
decomposition_prompt = f"""
Break down this complex query into {num_queries} simpler, focused sub-questions:
Complex Query: {query}
Requirements:
1. Each sub-question should be independently searchable
2. Sub-questions should cover different aspects of the main query
3. Avoid redundancy between sub-questions
4. Maintain logical flow and completeness
Sub-questions:
"""
Decomposition Strategy: Breaking complex questions into focused sub-questions enables more precise retrieval and comprehensive coverage of multifaceted information needs.
try:
response = self.llm_model.predict(decomposition_prompt)
sub_questions = [
q.strip().rstrip('?') + '?'
for q in response.strip().split('\n')
if q.strip() and ('?' in q or len(q.split()) > 3)
]
# Ensure proper question formatting
sub_questions = [
q if q.endswith('?') else q + '?'
for q in sub_questions
]
return sub_questions[:num_queries]
except Exception as e:
print(f"Decomposition error: {e}")
return []
Step 8: Specificity Level Variants¶
Generate queries at different levels of granularity:
def _generate_specificity_variants(self, query: str, num_queries: int) -> List[str]:
"""Generate queries at different levels of specificity."""
specificity_prompt = f"""
Generate {num_queries} variants of this query at different specificity levels:
Original Query: {query}
Create variants that range from:
1. Very broad/general versions
2. Medium specificity versions
3. Very specific/detailed versions
Each variant should maintain the core intent but adjust the scope:
"""
Specificity Adjustment: Creating query variants at different granularity levels ensures retrieval of both overview information and detailed specifics.
try:
response = self.llm_model.predict(specificity_prompt)
variants = [
variant.strip()
for variant in response.strip().split('\n')
if variant.strip() and len(variant.split()) > 2
]
return variants[:num_queries]
except Exception as e:
print(f"Specificity variant error: {e}")
return []
Step 9: Enhanced Query Assembly¶
Create final expanded queries that combine all enhancement techniques:
def _create_expanded_query(self, original_query: str,
expansions: List[str]) -> str:
"""Create final expanded query combining original and expansions."""
# Remove duplicates and very similar terms
filtered_expansions = self._filter_similar_terms(
original_query, expansions
)
# Limit expansion count to prevent query bloat
max_expansions = 8
if len(filtered_expansions) > max_expansions:
filtered_expansions = filtered_expansions[:max_expansions]
# Create expanded query with OR logic
if filtered_expansions:
expansion_text = ' OR '.join(f'"{term}"' for term in filtered_expansions)
expanded_query = f"{original_query} OR ({expansion_text})"
else:
expanded_query = original_query
return expanded_query
Query Assembly: Intelligent combination of original query and expansions creates enhanced search queries while preventing query bloat.
Step 10: Similarity Filtering¶
Remove redundant or overly similar expansion terms:
def _filter_similar_terms(self, original_query: str,
expansions: List[str]) -> List[str]:
"""Filter out terms too similar to original query or each other."""
original_words = set(original_query.lower().split())
filtered_expansions = []
for expansion in expansions:
expansion_words = set(expansion.lower().split())
# Skip if too much overlap with original
overlap_ratio = len(expansion_words & original_words) / len(expansion_words)
if overlap_ratio > 0.7:
continue
# Skip if too similar to already selected expansions
is_too_similar = False
for selected in filtered_expansions:
selected_words = set(selected.lower().split())
similarity = len(expansion_words & selected_words) / len(expansion_words | selected_words)
if similarity > 0.6:
is_too_similar = True
break
if not is_too_similar:
filtered_expansions.append(expansion)
return filtered_expansions
Similarity Control: Preventing redundant expansions ensures query enhancement adds value without creating noise.
Testing Your Query Expansion System¶
Test the complete expansion system with various query types:
# Initialize the expansion system
expander = IntelligentQueryExpander(llm_model, domain_corpus)
multi_gen = MultiQueryGenerator(llm_model)
# Test semantic expansion
semantic_result = expander.expand_query(
"How to optimize database performance?",
strategies=['semantic', 'contextual']
)
# Test multi-query generation
multi_result = multi_gen.generate_multi_query_set(
"What are the best practices for microservices architecture?",
perspectives=['decomposition', 'specificity_levels'],
total_queries=6
)
print("Expansion Results:")
print(f"Original: {semantic_result['original_query']}")
print(f"Expanded: {semantic_result['expanded_query']}")
print(f"Expansion count: {semantic_result['expansion_count']}")
print("\nMulti-Query Results:")
print(f"Total variants: {multi_result['total_variants']}")
print("Query variants:", multi_result['query_variants'][:3])
Integration with Search Systems¶
Connect your expansion system to vector search:
def search_with_expansion(self, query: str, vector_store,
expansion_strategies: List[str] = ['semantic'],
top_k: int = 10):
"""Perform enhanced search using query expansion."""
# Generate expanded query
expansion_result = self.expander.expand_query(
query, strategies=expansion_strategies
)
# Search with both original and expanded queries
original_results = vector_store.similarity_search(query, k=top_k)
expanded_results = vector_store.similarity_search(
expansion_result['expanded_query'], k=top_k
)
# Generate multiple query variants
multi_results = self.multi_gen.generate_multi_query_set(query)
variant_results = []
for variant in multi_results['query_variants'][:3]:
variant_results.extend(
vector_store.similarity_search(variant, k=top_k//3)
)
return {
'original_results': original_results,
'expanded_results': expanded_results,
'variant_results': variant_results,
'expansion_metadata': expansion_result,
'multi_query_metadata': multi_results
}
Comprehensive Search: Combining multiple expansion strategies with variant generation provides maximum retrieval coverage.
Practice Exercises¶
- Custom Strategies: Implement domain-specific expansion strategies
- Performance Comparison: Compare expansion results with baseline search
- Query Complexity: Test with increasingly complex multi-part questions
- Expansion Quality: Develop metrics to assess expansion effectiveness
🧭 Navigation¶
Previous: Session 3 - Advanced Patterns →
Next: Session 5 - Type-Safe Development →