📝 Session 5: Automated Testing Practice¶
📝 PARTICIPANT PATH - Practical Testing Implementation Prerequisites: Complete 🎯 Observer Path and RAGAS Practice Time Investment: 2.5-3 hours Outcome: Implement scientific A/B testing for RAG optimization
Learning Outcomes¶
By completing this section, you will:
- Set up A/B testing frameworks for RAG component comparison
- Implement statistical significance testing for enhancement validation
- Create multi-armed bandit systems for adaptive optimization
- Build automated test pipelines with regression detection
Prerequisites Check¶
Before starting implementation, ensure you have:
- Completed 🎯 RAG Evaluation Essentials
- Completed 📝 RAGAS Implementation Practice
- Working RAGAS evaluation setup
- Multiple RAG system variants to compare
📝 A/B Testing Framework Implementation¶
Scientific Enhancement Validation¶
The enhancement techniques from Session 4 (HyDE, query expansion, context optimization) sound theoretically sound, but do they actually improve user outcomes? A/B testing provides the scientific rigor to answer these questions definitively by comparing enhancement strategies under controlled conditions with statistical significance testing.
Let's implement a comprehensive A/B testing framework:
import numpy as np
import time
from scipy import stats
from collections import defaultdict
from dataclasses import dataclass
from typing import Dict, List, Any, Optional
@dataclass
class ABTestResult:
"""Results from A/B testing comparison."""
test_name: str
variant_performance: Dict[str, float]
statistical_significance: Dict[str, bool]
effect_sizes: Dict[str, float]
winner: Optional[str]
confidence_level: float
recommendation: str
This data structure captures all essential A/B test information in a format that's easy to analyze and report on.
A/B Testing Framework Core Implementation¶
Now let's build the main A/B testing system that can compare different RAG configurations:
class RAGABTestFramework:
"""Scientific A/B testing framework for RAG system optimization."""
def __init__(self, evaluation_framework):
self.evaluation_framework = evaluation_framework
self.active_tests = {}
self.test_history = []
self.significance_threshold = 0.05 # 95% confidence level
def create_ab_test(self, test_name, variant_configs, test_dataset):
"""Create new A/B test comparing RAG variants."""
test_setup = {
'test_name': test_name,
'variants': variant_configs,
'dataset': test_dataset,
'start_time': time.time(),
'status': 'created',
'results': {}
}
# Validate test setup
if len(variant_configs) < 2:
raise ValueError("A/B test requires at least 2 variants")
if len(test_dataset) < 20:
print("Warning: Small dataset size may affect statistical power")
self.active_tests[test_name] = test_setup
print(f"Created A/B test '{test_name}' with {len(variant_configs)} variants")
return test_setup
This framework provides the foundation for scientific comparison of RAG system variants, ensuring proper experimental design from the start.
Test Execution with Performance Measurement¶
Let's implement the core test execution that systematically evaluates each variant:
def execute_ab_test(self, test_name, evaluation_metrics=None):
"""Execute A/B test and collect performance data."""
if test_name not in self.active_tests:
raise ValueError(f"Test '{test_name}' not found")
test_setup = self.active_tests[test_name]
test_setup['status'] = 'running'
if evaluation_metrics is None:
evaluation_metrics = ['faithfulness', 'answer_relevancy', 'context_precision']
variant_results = {}
print(f"Executing A/B test: {test_name}")
print(f"Testing {len(test_setup['variants'])} variants on {len(test_setup['dataset'])} examples")
# Test each variant systematically
for variant_name, variant_config in test_setup['variants'].items():
print(f" Evaluating variant: {variant_name}")
# Create RAG system with variant configuration
variant_system = self._create_rag_variant(variant_config)
# Generate responses for test dataset
rag_responses = self._generate_variant_responses(
variant_system, test_setup['dataset']
)
This systematic approach ensures each variant is tested under identical conditions, providing reliable comparison data.
We continue with evaluation and results collection:
# Evaluate variant performance
evaluation_result = self.evaluation_framework.run_comprehensive_evaluation(
rag_responses, include_ground_truth=True
)
variant_results[variant_name] = {
'evaluation': evaluation_result,
'response_times': [r.get('response_time', 0) for r in rag_responses],
'success_rate': len([r for r in rag_responses if r['generated_answer']]) / len(rag_responses),
'timestamp': time.time()
}
print(f" Overall Score: {evaluation_result.get('overall_score', 'N/A'):.3f}")
# Analyze results statistically
analysis_result = self._analyze_statistical_significance(variant_results)
# Complete test
test_result = ABTestResult(
test_name=test_name,
variant_performance={name: result['evaluation']['overall_score']
for name, result in variant_results.items()},
statistical_significance=analysis_result['significance'],
effect_sizes=analysis_result['effect_sizes'],
winner=analysis_result['winner'],
confidence_level=1 - self.significance_threshold,
recommendation=analysis_result['recommendation']
)
test_setup['results'] = test_result
test_setup['status'] = 'completed'
self.test_history.append(test_result)
return test_result
The comprehensive analysis ensures you get actionable insights from your A/B tests, not just raw performance numbers.
📝 Statistical Significance Implementation¶
Rigorous Statistical Analysis¶
Let's implement proper statistical testing to ensure your optimization decisions are based on real differences, not random variation:
def _analyze_statistical_significance(self, variant_results):
"""Analyze A/B test results with statistical significance testing."""
analysis = {
'significance': {},
'effect_sizes': {},
'winner': None,
'confidence_intervals': {},
'recommendation': ''
}
variant_names = list(variant_results.keys())
# Extract performance metrics for each variant
performance_data = {}
for variant_name, result in variant_results.items():
# For proper statistical testing, we need individual response scores
# Here we approximate with overall score (in practice, collect individual scores)
overall_score = result['evaluation']['overall_score']
response_times = result['response_times']
performance_data[variant_name] = {
'quality_scores': [overall_score] * len(response_times), # Approximation
'response_times': response_times,
'sample_size': len(response_times)
}
This statistical foundation enables rigorous comparison that accounts for sample size and natural variation in performance.
We implement pairwise comparisons with proper statistical tests:
# Perform pairwise statistical comparisons
for i, variant_a in enumerate(variant_names):
for variant_b in variant_names[i+1:]:
data_a = performance_data[variant_a]
data_b = performance_data[variant_b]
# Quality score comparison (t-test approximation)
score_a = np.mean(data_a['quality_scores'])
score_b = np.mean(data_b['quality_scores'])
# Calculate effect size (Cohen's d approximation)
pooled_std = np.sqrt(
(np.var(data_a['quality_scores']) + np.var(data_b['quality_scores'])) / 2
)
effect_size = abs(score_a - score_b) / (pooled_std + 1e-8)
# Simple significance test (replace with proper t-test in practice)
difference = abs(score_a - score_b)
is_significant = difference > 0.05 # Simplified threshold
comparison_key = f"{variant_a}_vs_{variant_b}"
analysis['significance'][comparison_key] = is_significant
analysis['effect_sizes'][comparison_key] = effect_size
Proper effect size calculation helps you understand not just whether differences are statistically significant, but whether they're practically meaningful.
Finally, we determine the winning variant and generate recommendations:
# Determine overall winner based on performance and significance
best_variant = max(
variant_names,
key=lambda v: variant_results[v]['evaluation']['overall_score']
)
# Check if winner is significantly better
winner_is_significant = any(
analysis['significance'].get(f"{best_variant}_vs_{other}", False)
for other in variant_names if other != best_variant
)
if winner_is_significant:
analysis['winner'] = best_variant
analysis['recommendation'] = f"Deploy {best_variant}: significantly outperforms alternatives"
else:
analysis['winner'] = None
analysis['recommendation'] = "No significant difference detected: consider longer test or larger sample"
return analysis
This analysis provides clear, actionable guidance on whether to adopt new RAG enhancements or continue with existing configurations.
📝 Multi-Armed Bandit Implementation¶
Adaptive Optimization Strategy¶
Multi-armed bandit algorithms provide an alternative to traditional A/B testing by adaptively learning which variants perform best while continuing to serve users:
class RAGMultiArmedBandit:
"""Adaptive multi-armed bandit for RAG system optimization."""
def __init__(self, variant_configs, exploration_rate=0.1):
self.variants = list(variant_configs.keys())
self.variant_configs = variant_configs
self.exploration_rate = exploration_rate
# Initialize bandit arms
self.arm_counts = {variant: 0 for variant in self.variants}
self.arm_rewards = {variant: 0.0 for variant in self.variants}
self.arm_avg_rewards = {variant: 0.0 for variant in self.variants}
# Tracking
self.total_trials = 0
self.trial_history = []
self.rag_systems = {}
# Initialize RAG systems for each variant
self._initialize_rag_variants()
The bandit approach balances exploration of potentially better variants with exploitation of currently best-performing ones.
Epsilon-Greedy Selection Strategy¶
Let's implement the core selection algorithm that decides which variant to use for each query:
def select_variant_for_query(self, query):
"""Select RAG variant using epsilon-greedy strategy."""
# Exploration: randomly select variant to gather more data
if np.random.random() < self.exploration_rate:
selected_variant = np.random.choice(self.variants)
selection_reason = "exploration"
else:
# Exploitation: select best performing variant
if self.total_trials == 0:
selected_variant = np.random.choice(self.variants)
selection_reason = "initial_random"
else:
best_variant = max(
self.arm_avg_rewards.items(),
key=lambda x: x[1]
)[0]
selected_variant = best_variant
selection_reason = "exploitation"
return {
'variant': selected_variant,
'reason': selection_reason,
'rag_system': self.rag_systems[selected_variant]
}
This selection strategy ensures you collect data on all variants while increasingly favoring those that perform better.
Performance Feedback Integration¶
Now let's implement the reward update mechanism that learns from user interactions:
def update_performance(self, variant, query, response, user_feedback=None):
"""Update variant performance based on query results."""
# Calculate reward based on available feedback
reward = self._calculate_reward(query, response, user_feedback)
# Update bandit statistics
self.arm_counts[variant] += 1
self.arm_rewards[variant] += reward
self.arm_avg_rewards[variant] = (
self.arm_rewards[variant] / self.arm_counts[variant]
)
self.total_trials += 1
# Record trial for analysis
trial_record = {
'trial': self.total_trials,
'variant': variant,
'query': query[:100], # Truncate for storage
'reward': reward,
'avg_reward': self.arm_avg_rewards[variant],
'timestamp': time.time()
}
self.trial_history.append(trial_record)
return trial_record
The feedback mechanism enables the bandit to learn which variants provide better user experiences over time.
Reward Calculation Strategy¶
Let's implement a practical reward function that combines multiple quality signals:
def _calculate_reward(self, query, response, user_feedback=None):
"""Calculate reward based on response quality indicators."""
reward = 0.0
# User feedback (if available) - highest weight
if user_feedback is not None:
if user_feedback == 'positive':
reward += 1.0
elif user_feedback == 'negative':
reward -= 0.5
# neutral feedback adds 0
# Response quality heuristics
response_length = len(response.split())
# Reward appropriate response length
if 50 <= response_length <= 300:
reward += 0.3
elif response_length < 20:
reward -= 0.2 # Too brief
elif response_length > 500:
reward -= 0.1 # Too verbose
# Reward presence of citations or references
citation_patterns = ['[', 'according to', 'source:', 'reference:']
if any(pattern.lower() in response.lower() for pattern in citation_patterns):
reward += 0.2
# Basic content quality check
if 'I don\'t know' in response or 'I cannot' in response:
reward -= 0.3 # Penalize non-answers
# Normalize reward to [0, 1] range
return max(0, min(1, reward))
This reward function combines explicit user feedback with implicit quality signals, enabling learning even when direct feedback isn't available.
📝 Automated Test Pipeline Implementation¶
Continuous Testing Infrastructure¶
Let's create an automated pipeline that runs regular tests to detect performance regressions:
class AutomatedRAGTestPipeline:
"""Automated testing pipeline for continuous RAG evaluation."""
def __init__(self, evaluation_framework, ab_testing_framework, test_configs):
self.evaluation_framework = evaluation_framework
self.ab_testing_framework = ab_testing_framework
self.test_configs = test_configs
# Test scheduling and history
self.test_schedule = {}
self.test_history = []
self.performance_baselines = {}
self.regression_alerts = []
def schedule_regression_test(self, test_name, rag_system, baseline_performance,
schedule_hours=24):
"""Schedule automated regression testing."""
test_config = {
'test_name': f"regression_{test_name}",
'rag_system': rag_system,
'baseline': baseline_performance,
'schedule_interval': schedule_hours,
'last_run': 0,
'test_dataset': self.test_configs.get('regression_dataset', [])
}
self.test_schedule[test_name] = test_config
self.performance_baselines[test_name] = baseline_performance
print(f"Scheduled regression test '{test_name}' every {schedule_hours} hours")
This automated pipeline ensures your RAG system performance doesn't degrade as you make changes or as data evolves over time.
Regression Detection Implementation¶
Now let's implement the core regression detection logic:
def run_regression_detection(self, test_name, significance_threshold=0.05):
"""Run regression testing and detect performance drops."""
if test_name not in self.test_schedule:
raise ValueError(f"Test '{test_name}' not scheduled")
test_config = self.test_schedule[test_name]
baseline = self.performance_baselines[test_name]
print(f"Running regression test: {test_name}")
# Generate current performance data
current_results = self.evaluation_framework.evaluate_rag_system(
test_config['test_dataset'],
test_config['rag_system'],
{'include_ragas': True}
)
current_performance = current_results['aggregate_metrics']
# Compare against baseline
regression_analysis = {
'test_name': test_name,
'current_performance': current_performance,
'baseline_performance': baseline,
'regressions_detected': [],
'improvements_detected': [],
'timestamp': time.time()
}
The regression detector systematically compares current performance against established baselines to identify quality degradation.
We analyze performance changes across different metrics:
# Analyze each metric for regression
for metric_name in baseline:
if metric_name in current_performance:
baseline_score = baseline[metric_name]
current_score = current_performance[metric_name]
# Calculate change
change = current_score - baseline_score
change_percent = (change / baseline_score) * 100 if baseline_score > 0 else 0
# Detect significant regression (performance drop)
if change < -significance_threshold:
regression_analysis['regressions_detected'].append({
'metric': metric_name,
'baseline_score': baseline_score,
'current_score': current_score,
'change': change,
'change_percent': change_percent,
'severity': 'high' if change_percent < -10 else 'medium'
})
# Detect improvements
elif change > significance_threshold:
regression_analysis['improvements_detected'].append({
'metric': metric_name,
'baseline_score': baseline_score,
'current_score': current_score,
'change': change,
'change_percent': change_percent
})
# Update test history and trigger alerts if needed
self.test_history.append(regression_analysis)
if regression_analysis['regressions_detected']:
self._trigger_regression_alerts(regression_analysis)
return regression_analysis
This comprehensive analysis provides detailed insight into performance changes and automatically flags concerning regressions.
📝 Practical Integration Examples¶
Production A/B Test Setup¶
Here's a practical example of setting up A/B tests for common RAG enhancements:
# Example: Testing Query Enhancement Strategies
def setup_query_enhancement_ab_test():
"""Example A/B test setup for query enhancement comparison."""
# Define variant configurations
variant_configs = {
'baseline': {
'query_enhancement': None,
'retrieval_method': 'standard',
'context_optimization': False
},
'hyde_enhanced': {
'query_enhancement': 'hyde',
'retrieval_method': 'standard',
'context_optimization': False
},
'query_expansion': {
'query_enhancement': 'expansion',
'retrieval_method': 'standard',
'context_optimization': True
},
'full_optimization': {
'query_enhancement': 'hyde',
'retrieval_method': 'hybrid',
'context_optimization': True
}
}
# Create test dataset (you would load your actual test data)
test_dataset = load_test_dataset('query_enhancement_evaluation')
# Setup A/B test
ab_framework = RAGABTestFramework(evaluation_framework)
test_result = ab_framework.create_ab_test(
'query_enhancement_comparison',
variant_configs,
test_dataset
)
return test_result
This example shows how to structure real-world A/B tests comparing different enhancement strategies.
Practice Exercises¶
Exercise 1: A/B Test Implementation¶
- Choose two RAG configurations to compare (e.g., with/without HyDE)
- Create test dataset with 50+ examples
- Implement complete A/B test with statistical analysis
- Document results and make deployment recommendations
Exercise 2: Multi-Armed Bandit Setup¶
- Implement 3-variant bandit system
- Define reward function based on your quality criteria
- Run simulation with 100+ queries
- Analyze learning curve and variant performance
Exercise 3: Automated Testing Pipeline¶
- Set up automated regression testing
- Establish performance baselines
- Implement alerting for quality degradation
- Test pipeline with intentional performance changes
Learning Path Summary¶
📝 Participant Path Complete: You've implemented scientific A/B testing frameworks, multi-armed bandit optimization, and automated regression detection pipelines. You can now systematically validate RAG enhancements and maintain quality over time.
Next Steps for Enterprise Implementation:
- ⚙️ Implementer Path: Advanced Custom Metrics → - Build sophisticated domain-specific evaluators
- ⚙️ Implementer Path: Enterprise Monitoring → - Production-scale monitoring and alerting systems
🧭 Navigation¶
Previous: Session 4 - Team Orchestration →
Next: Session 6 - Modular Architecture →