📝 Session 5: Automated Testing Practice¶

📝 PARTICIPANT PATH - Practical Testing Implementation Prerequisites: Complete 🎯 Observer Path and RAGAS Practice Time Investment: 2.5-3 hours Outcome: Implement scientific A/B testing for RAG optimization

Learning Outcomes¶

By completing this section, you will:

Set up A/B testing frameworks for RAG component comparison
Implement statistical significance testing for enhancement validation
Create multi-armed bandit systems for adaptive optimization
Build automated test pipelines with regression detection

Prerequisites Check¶

Before starting implementation, ensure you have:

Completed 🎯 RAG Evaluation Essentials
Completed 📝 RAGAS Implementation Practice
Working RAGAS evaluation setup
Multiple RAG system variants to compare

📝 A/B Testing Framework Implementation¶

Scientific Enhancement Validation¶

The enhancement techniques from Session 4 (HyDE, query expansion, context optimization) sound theoretically sound, but do they actually improve user outcomes? A/B testing provides the scientific rigor to answer these questions definitively by comparing enhancement strategies under controlled conditions with statistical significance testing.

Let's implement a comprehensive A/B testing framework:

import numpy as np
import time
from scipy import stats
from collections import defaultdict
from dataclasses import dataclass
from typing import Dict, List, Any, Optional

@dataclass
class ABTestResult:
    """Results from A/B testing comparison."""
    test_name: str
    variant_performance: Dict[str, float]
    statistical_significance: Dict[str, bool]
    effect_sizes: Dict[str, float]
    winner: Optional[str]
    confidence_level: float
    recommendation: str

This data structure captures all essential A/B test information in a format that's easy to analyze and report on.

A/B Testing Framework Core Implementation¶

Now let's build the main A/B testing system that can compare different RAG configurations:

class RAGABTestFramework:
    """Scientific A/B testing framework for RAG system optimization."""

    def __init__(self, evaluation_framework):
        self.evaluation_framework = evaluation_framework
        self.active_tests = {}
        self.test_history = []
        self.significance_threshold = 0.05  # 95% confidence level

    def create_ab_test(self, test_name, variant_configs, test_dataset):
        """Create new A/B test comparing RAG variants."""

        test_setup = {
            'test_name': test_name,
            'variants': variant_configs,
            'dataset': test_dataset,
            'start_time': time.time(),
            'status': 'created',
            'results': {}
        }

        # Validate test setup
        if len(variant_configs) < 2:
            raise ValueError("A/B test requires at least 2 variants")

        if len(test_dataset) < 20:
            print("Warning: Small dataset size may affect statistical power")

        self.active_tests[test_name] = test_setup

        print(f"Created A/B test '{test_name}' with {len(variant_configs)} variants")
        return test_setup

This framework provides the foundation for scientific comparison of RAG system variants, ensuring proper experimental design from the start.

Test Execution with Performance Measurement¶

Let's implement the core test execution that systematically evaluates each variant:

    def execute_ab_test(self, test_name, evaluation_metrics=None):
        """Execute A/B test and collect performance data."""

        if test_name not in self.active_tests:
            raise ValueError(f"Test '{test_name}' not found")

        test_setup = self.active_tests[test_name]
        test_setup['status'] = 'running'

        if evaluation_metrics is None:
            evaluation_metrics = ['faithfulness', 'answer_relevancy', 'context_precision']

        variant_results = {}

        print(f"Executing A/B test: {test_name}")
        print(f"Testing {len(test_setup['variants'])} variants on {len(test_setup['dataset'])} examples")

        # Test each variant systematically
        for variant_name, variant_config in test_setup['variants'].items():
            print(f"  Evaluating variant: {variant_name}")

            # Create RAG system with variant configuration
            variant_system = self._create_rag_variant(variant_config)

            # Generate responses for test dataset
            rag_responses = self._generate_variant_responses(
                variant_system, test_setup['dataset']
            )

This systematic approach ensures each variant is tested under identical conditions, providing reliable comparison data.

We continue with evaluation and results collection:

            # Evaluate variant performance
            evaluation_result = self.evaluation_framework.run_comprehensive_evaluation(
                rag_responses, include_ground_truth=True
            )

            variant_results[variant_name] = {
                'evaluation': evaluation_result,
                'response_times': [r.get('response_time', 0) for r in rag_responses],
                'success_rate': len([r for r in rag_responses if r['generated_answer']]) / len(rag_responses),
                'timestamp': time.time()
            }

            print(f"    Overall Score: {evaluation_result.get('overall_score', 'N/A'):.3f}")

        # Analyze results statistically
        analysis_result = self._analyze_statistical_significance(variant_results)

        # Complete test
        test_result = ABTestResult(
            test_name=test_name,
            variant_performance={name: result['evaluation']['overall_score']
                               for name, result in variant_results.items()},
            statistical_significance=analysis_result['significance'],
            effect_sizes=analysis_result['effect_sizes'],
            winner=analysis_result['winner'],
            confidence_level=1 - self.significance_threshold,
            recommendation=analysis_result['recommendation']
        )

        test_setup['results'] = test_result
        test_setup['status'] = 'completed'
        self.test_history.append(test_result)

        return test_result

The comprehensive analysis ensures you get actionable insights from your A/B tests, not just raw performance numbers.

📝 Statistical Significance Implementation¶

Rigorous Statistical Analysis¶

Let's implement proper statistical testing to ensure your optimization decisions are based on real differences, not random variation:

    def _analyze_statistical_significance(self, variant_results):
        """Analyze A/B test results with statistical significance testing."""

        analysis = {
            'significance': {},
            'effect_sizes': {},
            'winner': None,
            'confidence_intervals': {},
            'recommendation': ''
        }

        variant_names = list(variant_results.keys())

        # Extract performance metrics for each variant
        performance_data = {}
        for variant_name, result in variant_results.items():
            # For proper statistical testing, we need individual response scores
            # Here we approximate with overall score (in practice, collect individual scores)
            overall_score = result['evaluation']['overall_score']
            response_times = result['response_times']

            performance_data[variant_name] = {
                'quality_scores': [overall_score] * len(response_times),  # Approximation
                'response_times': response_times,
                'sample_size': len(response_times)
            }

This statistical foundation enables rigorous comparison that accounts for sample size and natural variation in performance.

We implement pairwise comparisons with proper statistical tests:

        # Perform pairwise statistical comparisons
        for i, variant_a in enumerate(variant_names):
            for variant_b in variant_names[i+1:]:

                data_a = performance_data[variant_a]
                data_b = performance_data[variant_b]

                # Quality score comparison (t-test approximation)
                score_a = np.mean(data_a['quality_scores'])
                score_b = np.mean(data_b['quality_scores'])

                # Calculate effect size (Cohen's d approximation)
                pooled_std = np.sqrt(
                    (np.var(data_a['quality_scores']) + np.var(data_b['quality_scores'])) / 2
                )
                effect_size = abs(score_a - score_b) / (pooled_std + 1e-8)

                # Simple significance test (replace with proper t-test in practice)
                difference = abs(score_a - score_b)
                is_significant = difference > 0.05  # Simplified threshold

                comparison_key = f"{variant_a}_vs_{variant_b}"
                analysis['significance'][comparison_key] = is_significant
                analysis['effect_sizes'][comparison_key] = effect_size

Proper effect size calculation helps you understand not just whether differences are statistically significant, but whether they're practically meaningful.

Finally, we determine the winning variant and generate recommendations:

        # Determine overall winner based on performance and significance
        best_variant = max(
            variant_names,
            key=lambda v: variant_results[v]['evaluation']['overall_score']
        )

        # Check if winner is significantly better
        winner_is_significant = any(
            analysis['significance'].get(f"{best_variant}_vs_{other}", False)
            for other in variant_names if other != best_variant
        )

        if winner_is_significant:
            analysis['winner'] = best_variant
            analysis['recommendation'] = f"Deploy {best_variant}: significantly outperforms alternatives"
        else:
            analysis['winner'] = None
            analysis['recommendation'] = "No significant difference detected: consider longer test or larger sample"

        return analysis

This analysis provides clear, actionable guidance on whether to adopt new RAG enhancements or continue with existing configurations.

📝 Multi-Armed Bandit Implementation¶

Adaptive Optimization Strategy¶

Multi-armed bandit algorithms provide an alternative to traditional A/B testing by adaptively learning which variants perform best while continuing to serve users:

class RAGMultiArmedBandit:
    """Adaptive multi-armed bandit for RAG system optimization."""

    def __init__(self, variant_configs, exploration_rate=0.1):
        self.variants = list(variant_configs.keys())
        self.variant_configs = variant_configs
        self.exploration_rate = exploration_rate

        # Initialize bandit arms
        self.arm_counts = {variant: 0 for variant in self.variants}
        self.arm_rewards = {variant: 0.0 for variant in self.variants}
        self.arm_avg_rewards = {variant: 0.0 for variant in self.variants}

        # Tracking
        self.total_trials = 0
        self.trial_history = []
        self.rag_systems = {}

        # Initialize RAG systems for each variant
        self._initialize_rag_variants()

The bandit approach balances exploration of potentially better variants with exploitation of currently best-performing ones.

Epsilon-Greedy Selection Strategy¶

Let's implement the core selection algorithm that decides which variant to use for each query:

    def select_variant_for_query(self, query):
        """Select RAG variant using epsilon-greedy strategy."""

        # Exploration: randomly select variant to gather more data
        if np.random.random() < self.exploration_rate:
            selected_variant = np.random.choice(self.variants)
            selection_reason = "exploration"
        else:
            # Exploitation: select best performing variant
            if self.total_trials == 0:
                selected_variant = np.random.choice(self.variants)
                selection_reason = "initial_random"
            else:
                best_variant = max(
                    self.arm_avg_rewards.items(),
                    key=lambda x: x[1]
                )[0]
                selected_variant = best_variant
                selection_reason = "exploitation"

        return {
            'variant': selected_variant,
            'reason': selection_reason,
            'rag_system': self.rag_systems[selected_variant]
        }

This selection strategy ensures you collect data on all variants while increasingly favoring those that perform better.

Performance Feedback Integration¶

Now let's implement the reward update mechanism that learns from user interactions:

    def update_performance(self, variant, query, response, user_feedback=None):
        """Update variant performance based on query results."""

        # Calculate reward based on available feedback
        reward = self._calculate_reward(query, response, user_feedback)

        # Update bandit statistics
        self.arm_counts[variant] += 1
        self.arm_rewards[variant] += reward
        self.arm_avg_rewards[variant] = (
            self.arm_rewards[variant] / self.arm_counts[variant]
        )

        self.total_trials += 1

        # Record trial for analysis
        trial_record = {
            'trial': self.total_trials,
            'variant': variant,
            'query': query[:100],  # Truncate for storage
            'reward': reward,
            'avg_reward': self.arm_avg_rewards[variant],
            'timestamp': time.time()
        }

        self.trial_history.append(trial_record)

        return trial_record

The feedback mechanism enables the bandit to learn which variants provide better user experiences over time.

Reward Calculation Strategy¶

Let's implement a practical reward function that combines multiple quality signals:

    def _calculate_reward(self, query, response, user_feedback=None):
        """Calculate reward based on response quality indicators."""

        reward = 0.0

        # User feedback (if available) - highest weight
        if user_feedback is not None:
            if user_feedback == 'positive':
                reward += 1.0
            elif user_feedback == 'negative':
                reward -= 0.5
            # neutral feedback adds 0

        # Response quality heuristics
        response_length = len(response.split())

        # Reward appropriate response length
        if 50 <= response_length <= 300:
            reward += 0.3
        elif response_length < 20:
            reward -= 0.2  # Too brief
        elif response_length > 500:
            reward -= 0.1  # Too verbose

        # Reward presence of citations or references
        citation_patterns = ['[', 'according to', 'source:', 'reference:']
        if any(pattern.lower() in response.lower() for pattern in citation_patterns):
            reward += 0.2

        # Basic content quality check
        if 'I don\'t know' in response or 'I cannot' in response:
            reward -= 0.3  # Penalize non-answers

        # Normalize reward to [0, 1] range
        return max(0, min(1, reward))

This reward function combines explicit user feedback with implicit quality signals, enabling learning even when direct feedback isn't available.

📝 Automated Test Pipeline Implementation¶

Continuous Testing Infrastructure¶

Let's create an automated pipeline that runs regular tests to detect performance regressions:

class AutomatedRAGTestPipeline:
    """Automated testing pipeline for continuous RAG evaluation."""

    def __init__(self, evaluation_framework, ab_testing_framework, test_configs):
        self.evaluation_framework = evaluation_framework
        self.ab_testing_framework = ab_testing_framework
        self.test_configs = test_configs

        # Test scheduling and history
        self.test_schedule = {}
        self.test_history = []
        self.performance_baselines = {}
        self.regression_alerts = []

    def schedule_regression_test(self, test_name, rag_system, baseline_performance,
                               schedule_hours=24):
        """Schedule automated regression testing."""

        test_config = {
            'test_name': f"regression_{test_name}",
            'rag_system': rag_system,
            'baseline': baseline_performance,
            'schedule_interval': schedule_hours,
            'last_run': 0,
            'test_dataset': self.test_configs.get('regression_dataset', [])
        }

        self.test_schedule[test_name] = test_config
        self.performance_baselines[test_name] = baseline_performance

        print(f"Scheduled regression test '{test_name}' every {schedule_hours} hours")

This automated pipeline ensures your RAG system performance doesn't degrade as you make changes or as data evolves over time.

Regression Detection Implementation¶

Now let's implement the core regression detection logic:

    def run_regression_detection(self, test_name, significance_threshold=0.05):
        """Run regression testing and detect performance drops."""

        if test_name not in self.test_schedule:
            raise ValueError(f"Test '{test_name}' not scheduled")

        test_config = self.test_schedule[test_name]
        baseline = self.performance_baselines[test_name]

        print(f"Running regression test: {test_name}")

        # Generate current performance data
        current_results = self.evaluation_framework.evaluate_rag_system(
            test_config['test_dataset'],
            test_config['rag_system'],
            {'include_ragas': True}
        )

        current_performance = current_results['aggregate_metrics']

        # Compare against baseline
        regression_analysis = {
            'test_name': test_name,
            'current_performance': current_performance,
            'baseline_performance': baseline,
            'regressions_detected': [],
            'improvements_detected': [],
            'timestamp': time.time()
        }

The regression detector systematically compares current performance against established baselines to identify quality degradation.

We analyze performance changes across different metrics:

        # Analyze each metric for regression
        for metric_name in baseline:
            if metric_name in current_performance:
                baseline_score = baseline[metric_name]
                current_score = current_performance[metric_name]

                # Calculate change
                change = current_score - baseline_score
                change_percent = (change / baseline_score) * 100 if baseline_score > 0 else 0

                # Detect significant regression (performance drop)
                if change < -significance_threshold:
                    regression_analysis['regressions_detected'].append({
                        'metric': metric_name,
                        'baseline_score': baseline_score,
                        'current_score': current_score,
                        'change': change,
                        'change_percent': change_percent,
                        'severity': 'high' if change_percent < -10 else 'medium'
                    })

                # Detect improvements
                elif change > significance_threshold:
                    regression_analysis['improvements_detected'].append({
                        'metric': metric_name,
                        'baseline_score': baseline_score,
                        'current_score': current_score,
                        'change': change,
                        'change_percent': change_percent
                    })

        # Update test history and trigger alerts if needed
        self.test_history.append(regression_analysis)

        if regression_analysis['regressions_detected']:
            self._trigger_regression_alerts(regression_analysis)

        return regression_analysis

This comprehensive analysis provides detailed insight into performance changes and automatically flags concerning regressions.

📝 Practical Integration Examples¶

Production A/B Test Setup¶

Here's a practical example of setting up A/B tests for common RAG enhancements:

# Example: Testing Query Enhancement Strategies
def setup_query_enhancement_ab_test():
    """Example A/B test setup for query enhancement comparison."""

    # Define variant configurations
    variant_configs = {
        'baseline': {
            'query_enhancement': None,
            'retrieval_method': 'standard',
            'context_optimization': False
        },
        'hyde_enhanced': {
            'query_enhancement': 'hyde',
            'retrieval_method': 'standard',
            'context_optimization': False
        },
        'query_expansion': {
            'query_enhancement': 'expansion',
            'retrieval_method': 'standard',
            'context_optimization': True
        },
        'full_optimization': {
            'query_enhancement': 'hyde',
            'retrieval_method': 'hybrid',
            'context_optimization': True
        }
    }

    # Create test dataset (you would load your actual test data)
    test_dataset = load_test_dataset('query_enhancement_evaluation')

    # Setup A/B test
    ab_framework = RAGABTestFramework(evaluation_framework)
    test_result = ab_framework.create_ab_test(
        'query_enhancement_comparison',
        variant_configs,
        test_dataset
    )

    return test_result

This example shows how to structure real-world A/B tests comparing different enhancement strategies.

Practice Exercises¶

Exercise 1: A/B Test Implementation¶

Choose two RAG configurations to compare (e.g., with/without HyDE)
Create test dataset with 50+ examples
Implement complete A/B test with statistical analysis
Document results and make deployment recommendations

Exercise 2: Multi-Armed Bandit Setup¶

Implement 3-variant bandit system
Define reward function based on your quality criteria
Run simulation with 100+ queries
Analyze learning curve and variant performance

Exercise 3: Automated Testing Pipeline¶

Set up automated regression testing
Establish performance baselines
Implement alerting for quality degradation
Test pipeline with intentional performance changes

Learning Path Summary¶

📝 Participant Path Complete: You've implemented scientific A/B testing frameworks, multi-armed bandit optimization, and automated regression detection pipelines. You can now systematically validate RAG enhancements and maintain quality over time.

Next Steps for Enterprise Implementation:

⚙️ Implementer Path: Advanced Custom Metrics → - Build sophisticated domain-specific evaluators
⚙️ Implementer Path: Enterprise Monitoring → - Production-scale monitoring and alerting systems

Previous: Session 4 - Team Orchestration →
Next: Session 6 - Modular Architecture →