⚙️ Session 6 Advanced: Code GraphRAG Implementation¶
⚙️ IMPLEMENTER PATH CONTENT Prerequisites: Complete 🎯 Observer and 📝 Participant paths Time Investment: 2-3 hours Outcome: Master code analysis with graph-based reasoning
Advanced Learning Outcomes¶
After completing this module, you will master:
- AST-based code analysis for graph construction
- Software dependency modeling with GraphRAG
- Code pattern recognition through graph traversal
- Integration with development workflow tools
Understanding Software Knowledge Graphs¶
Code GraphRAG transforms software repositories into queryable knowledge graphs where:
- Nodes represent functions, classes, modules, and dependencies
- Edges represent calls, imports, inheritance, and data flow
- Attributes capture metrics, documentation, and metadata
This enables complex queries like "Find all functions that use deprecated APIs" or "Trace data flow from input validation to database operations."
Advanced AST-Based Graph Construction¶
Multi-Language AST Processing¶
Different programming languages require specialized AST processing approaches:
class MultiLanguageASTProcessor:
"""Advanced AST processor for multiple programming languages"""
def __init__(self):
self.language_processors = {
'python': PythonASTProcessor(),
'javascript': JavaScriptASTProcessor(),
'java': JavaASTProcessor(),
'typescript': TypeScriptASTProcessor()
}
The multi-language approach enables consistent graph representation across polyglot codebases while respecting language-specific semantics.
def process_file(self, file_path, content):
"""Process source file based on language detection"""
language = self.detect_language(file_path)
processor = self.language_processors.get(language)
if not processor:
return self.fallback_processing(file_path, content)
# Language-specific AST parsing
ast_tree = processor.parse_ast(content)
# Extract unified node representations
code_nodes = self.extract_unified_nodes(ast_tree, language)
code_edges = self.extract_relationships(ast_tree, language)
return {
'nodes': code_nodes,
'edges': code_edges,
'metadata': {
'language': language,
'file_path': file_path,
'complexity_metrics': processor.calculate_complexity(ast_tree)
}
}
Advanced Function Analysis¶
Function nodes require rich metadata for meaningful graph queries:
def analyze_function_advanced(func_node, ast_tree, file_context):
"""Extract comprehensive function metadata"""
# Basic function information
func_info = {
'name': func_node.name,
'line_start': func_node.lineno,
'line_end': func_node.end_lineno,
'type': 'function'
}
Basic information provides the foundation for graph construction and source code linking.
# Advanced static analysis
func_info.update({
'cyclomatic_complexity': calculate_cyclomatic_complexity(func_node),
'parameter_count': len(func_node.args.args),
'return_complexity': analyze_return_patterns(func_node),
'side_effects': detect_side_effects(func_node),
'external_dependencies': find_external_calls(func_node)
})
Static analysis metrics enable quality assessment and architectural pattern detection through graph queries.
# Documentation and semantic analysis
docstring = ast.get_docstring(func_node)
if docstring:
func_info.update({
'documentation': docstring,
'semantic_tags': extract_semantic_tags(docstring),
'api_stability': assess_api_stability(docstring)
})
Semantic analysis of documentation enables natural language queries about code functionality and API stability assessments.
# Control flow analysis
func_info['control_flow'] = {
'branches': count_conditional_branches(func_node),
'loops': identify_loop_structures(func_node),
'error_handling': analyze_exception_handling(func_node),
'async_patterns': detect_async_patterns(func_node)
}
return func_info
Dependency Modeling and Analysis¶
Advanced Dependency Graph Construction¶
Software dependencies form complex networks requiring sophisticated modeling:
class AdvancedDependencyAnalyzer:
"""Comprehensive dependency analysis for code graphs"""
def __init__(self, project_root):
self.project_root = project_root
self.dependency_types = {
'import': 'direct module import',
'call': 'function/method invocation',
'inheritance': 'class inheritance relationship',
'composition': 'object composition pattern',
'data_flow': 'data passing between components'
}
Different dependency types require different analysis approaches and have different implications for code understanding.
def analyze_project_dependencies(self):
"""Build comprehensive dependency graph"""
all_files = self.discover_source_files()
dependency_graph = nx.DiGraph()
# First pass: identify all code entities
for file_path in all_files:
entities = self.extract_code_entities(file_path)
for entity in entities:
dependency_graph.add_node(
entity['id'],
**entity['metadata']
)
The two-pass approach ensures all entities are registered before analyzing relationships, enabling forward references and circular dependencies.
# Second pass: analyze relationships
for file_path in all_files:
relationships = self.analyze_file_relationships(
file_path,
dependency_graph
)
for rel in relationships:
if rel['target'] in dependency_graph.nodes:
dependency_graph.add_edge(
rel['source'],
rel['target'],
relationship_type=rel['type'],
strength=rel['strength'],
evidence=rel['evidence']
)
return dependency_graph
Circular Dependency Detection¶
Circular dependencies indicate architectural issues and can be detected through graph analysis:
def detect_circular_dependencies(dependency_graph, max_cycle_length=10):
"""Identify circular dependencies with detailed analysis"""
cycles = []
# Find all strongly connected components
scc_list = list(nx.strongly_connected_components(dependency_graph))
for scc in scc_list:
if len(scc) > 1: # True circular dependency
# Analyze the cycle structure
subgraph = dependency_graph.subgraph(scc)
cycle_info = {
'components': list(scc),
'cycle_length': len(scc),
'complexity_score': calculate_cycle_complexity(subgraph)
}
Strongly connected components reveal the structure of circular dependencies and their complexity impact.
# Find the shortest cycle path for explanation
try:
first_node = list(scc)[0]
cycle_path = nx.shortest_path(
subgraph,
first_node,
first_node
)
cycle_info['example_path'] = cycle_path
cycle_info['path_relationships'] = [
dependency_graph[cycle_path[i]][cycle_path[i+1]]
for i in range(len(cycle_path)-1)
]
except nx.NetworkXNoPath:
cycle_info['example_path'] = None
cycles.append(cycle_info)
return sorted(cycles, key=lambda x: x['complexity_score'], reverse=True)
Advanced Code Pattern Recognition¶
Architectural Pattern Detection¶
Graph traversal can identify common architectural patterns in codebases:
def detect_architectural_patterns(code_graph):
"""Identify common architectural patterns through graph analysis"""
patterns_detected = {}
# Singleton pattern detection
singletons = detect_singleton_pattern(code_graph)
if singletons:
patterns_detected['singleton'] = {
'instances': singletons,
'confidence': calculate_pattern_confidence(singletons, 'singleton')
}
Pattern detection uses graph structure and code analysis to identify design patterns with confidence scores.
# Factory pattern detection
factories = detect_factory_pattern(code_graph)
if factories:
patterns_detected['factory'] = {
'instances': factories,
'variants': classify_factory_variants(factories),
'complexity': assess_factory_complexity(factories)
}
Factory pattern detection identifies different variants (Simple Factory, Factory Method, Abstract Factory) based on graph structure.
# Observer pattern detection
observers = detect_observer_pattern(code_graph)
if observers:
patterns_detected['observer'] = {
'subjects': [obs['subject'] for obs in observers],
'observers': [obs['observers'] for obs in observers],
'event_flow': trace_event_propagation(observers, code_graph)
}
return patterns_detected
Singleton Pattern Detection Algorithm¶
def detect_singleton_pattern(code_graph):
"""Detect Singleton pattern through graph analysis"""
potential_singletons = []
for node_id, node_data in code_graph.nodes(data=True):
if node_data.get('type') != 'class':
continue
class_info = node_data
# Check for singleton characteristics
singleton_indicators = {
'private_constructor': False,
'static_instance_method': False,
'instance_storage': False,
'thread_safety': False
}
Singleton detection analyzes class structure for pattern-specific characteristics rather than relying on naming conventions.
# Analyze class methods
class_methods = get_class_methods(code_graph, node_id)
for method in class_methods:
method_data = code_graph.nodes[method]
# Check for private constructor
if (method_data.get('name') == '__init__' and
method_data.get('visibility') == 'private'):
singleton_indicators['private_constructor'] = True
# Check for getInstance-style method
if (method_data.get('is_static') and
'instance' in method_data.get('name', '').lower()):
singleton_indicators['static_instance_method'] = True
The analysis examines method characteristics to identify the structural elements of the Singleton pattern.
# Check for instance storage (class-level variable)
class_variables = get_class_variables(code_graph, node_id)
for var in class_variables:
if ('instance' in var.get('name', '').lower() and
var.get('is_static')):
singleton_indicators['instance_storage'] = True
break
# Calculate confidence score
confidence = sum(singleton_indicators.values()) / len(singleton_indicators)
if confidence >= 0.5: # At least half the indicators present
potential_singletons.append({
'class_id': node_id,
'class_name': class_info.get('name'),
'confidence': confidence,
'indicators': singleton_indicators,
'thread_safety': analyze_thread_safety(code_graph, node_id)
})
return potential_singletons
Integration with Development Workflows¶
Git History Integration¶
Code GraphRAG can incorporate version control history for temporal analysis:
class GitIntegratedCodeAnalysis:
"""Integrate Git history with code graph analysis"""
def __init__(self, repo_path):
self.repo = git.Repo(repo_path)
self.code_graph = nx.DiGraph()
self.temporal_data = {}
Git integration enables analysis of code evolution, author contributions, and change impact assessment.
def analyze_code_evolution(self, file_path, commit_range=None):
"""Analyze how code structure evolved over time"""
if commit_range is None:
commit_range = f"HEAD~10..HEAD" # Last 10 commits
commits = list(self.repo.iter_commits(commit_range, paths=file_path))
evolution_data = []
for commit in commits:
try:
# Get file content at this commit
file_content = self.repo.git.show(f"{commit.hexsha}:{file_path}")
# Analyze code structure at this point in time
ast_analysis = self.analyze_file_ast(file_content)
evolution_data.append({
'commit': commit.hexsha,
'timestamp': commit.committed_datetime,
'author': commit.author.name,
'message': commit.message.strip(),
'code_metrics': ast_analysis['metrics'],
'structure_changes': self.detect_structural_changes(
ast_analysis,
evolution_data[-1] if evolution_data else None
)
})
except Exception as e:
continue # Skip commits where file doesn't exist
return evolution_data
Code Quality Assessment Through Graph Metrics¶
Graph metrics provide objective code quality indicators:
def assess_code_quality_through_graph(code_graph):
"""Calculate code quality metrics using graph analysis"""
quality_metrics = {}
# Coupling metrics
quality_metrics['coupling'] = {
'afferent_coupling': calculate_afferent_coupling(code_graph),
'efferent_coupling': calculate_efferent_coupling(code_graph),
'instability': calculate_instability_metric(code_graph)
}
Coupling metrics assess how interconnected different code components are, indicating maintainability challenges.
# Cohesion analysis
quality_metrics['cohesion'] = {
'module_cohesion': analyze_module_cohesion(code_graph),
'class_cohesion': analyze_class_cohesion(code_graph),
'functional_cohesion': assess_functional_cohesion(code_graph)
}
Cohesion metrics evaluate how well-focused individual components are, indicating design quality.
# Complexity indicators
quality_metrics['complexity'] = {
'structural_complexity': nx.density(code_graph),
'cyclomatic_complexity': aggregate_cyclomatic_complexity(code_graph),
'dependency_depth': calculate_max_dependency_depth(code_graph),
'fan_out_complexity': analyze_fan_out_patterns(code_graph)
}
# Generate quality score
quality_metrics['overall_score'] = calculate_composite_quality_score(
quality_metrics
)
return quality_metrics
Advanced Query Patterns for Code Analysis¶
Multi-Hop Code Queries¶
GraphRAG enables sophisticated code queries that span multiple relationships:
def find_security_vulnerable_paths(code_graph, entry_points, sink_functions):
"""Find potential security vulnerability paths in code"""
vulnerability_paths = []
for entry_point in entry_points:
for sink in sink_functions:
try:
# Find all paths from entry points to dangerous sinks
paths = nx.all_simple_paths(
code_graph,
entry_point,
sink,
cutoff=8 # Maximum path length to prevent explosion
)
for path in paths:
path_analysis = analyze_security_path(code_graph, path)
if path_analysis['risk_score'] > 0.6:
vulnerability_paths.append({
'path': path,
'risk_score': path_analysis['risk_score'],
'vulnerabilities': path_analysis['vulnerabilities'],
'entry_point': entry_point,
'sink_function': sink,
'mitigation_suggestions': suggest_mitigations(
path_analysis
)
})
except nx.NetworkXNoPath:
continue # No path exists
return sorted(vulnerability_paths, key=lambda x: x['risk_score'], reverse=True)
API Usage Analysis¶
Track how APIs are used across the codebase through graph traversal:
def analyze_api_usage_patterns(code_graph, api_functions):
"""Analyze how specific APIs are used across the codebase"""
usage_analysis = {}
for api_func in api_functions:
if api_func not in code_graph.nodes:
continue
# Find all callers of this API
callers = list(code_graph.predecessors(api_func))
usage_patterns = {
'total_usage_count': len(callers),
'usage_contexts': [],
'parameter_patterns': {},
'error_handling_analysis': {}
}
API usage analysis helps understand how interfaces are consumed and identifies potential improvement opportunities.
for caller in callers:
caller_info = code_graph.nodes[caller]
# Analyze the calling context
context_analysis = {
'caller_name': caller_info.get('name'),
'caller_type': caller_info.get('type'),
'call_frequency': estimate_call_frequency(
code_graph, caller, api_func
),
'error_handling': has_error_handling(
code_graph, caller, api_func
)
}
usage_patterns['usage_contexts'].append(context_analysis)
# Aggregate pattern analysis
usage_patterns['most_common_contexts'] = identify_common_contexts(
usage_patterns['usage_contexts']
)
usage_patterns['error_handling_coverage'] = calculate_error_coverage(
usage_patterns['usage_contexts']
)
usage_analysis[api_func] = usage_patterns
return usage_analysis
Performance Optimization for Code Graphs¶
Incremental Graph Updates¶
Large codebases require efficient incremental updates rather than full rebuilds:
class IncrementalCodeGraphUpdater:
"""Efficiently update code graphs for modified files"""
def __init__(self, base_graph):
self.base_graph = base_graph
self.change_tracker = {}
self.dependency_cache = {}
Incremental updates maintain performance for large codebases by only reprocessing changed components and their direct dependencies.
def update_for_modified_file(self, file_path, new_content):
"""Update graph for a single modified file"""
# Identify existing nodes from this file
existing_nodes = self.get_nodes_from_file(file_path)
# Analyze the new file content
new_analysis = self.analyze_file_content(file_path, new_content)
# Remove obsolete nodes and edges
self.remove_obsolete_elements(existing_nodes, new_analysis)
# Add new nodes and edges
self.add_new_elements(new_analysis)
# Update affected dependencies
self.update_dependent_relationships(file_path)
# Invalidate affected caches
self.invalidate_analysis_caches(file_path)
The incremental update process minimizes computational overhead while maintaining graph consistency and correctness.
🧭 Navigation¶
Previous: Session 5 - Type-Safe Development →
Next: Session 7 - Agent Systems →