🎯📝⚙️ Session 2: Advanced Chunking & Preprocessing¶
🎯📝⚙️ Learning Path Overview¶
This session offers three distinct learning paths designed to match your goals and time investment:
Focus: Understanding concepts and architecture
Activities: Advanced chunking principles, document structure analysis
Ideal for: Decision makers, architects, overview learners
Focus: Guided implementation and analysis
Activities: Implement structure-aware chunking, metadata extraction systems
Ideal for: Developers, technical leads, hands-on learners
Focus: Complete implementation and customization
Activities: Enterprise-grade preprocessing systems, quality assessment frameworks
Ideal for: Senior engineers, architects, specialists
Session Introduction¶
In Session 1, you implemented a working RAG system that splits documents into manageable chunks. But that's where many RAG implementations stop – and where they start to fail in production. When your chunker encounters a table and splits it down the middle, or breaks apart a code block that loses meaning without its function definition, you've hit the fundamental limitation of naive text splitting.
This session transforms your RAG system from simple text splitters into intelligent document understanding engines. You'll implement structure-aware chunking that preserves document hierarchy, extract rich metadata that enhances retrieval precision, and handle complex content types including tables, code blocks, and domain-specific formatting. The goal is to ensure that every chunk carries not just content, but context.
Figure 1: Common problems with naive chunking approaches that advanced preprocessing solves.
🎯 Observer Path: Core Concepts Overview¶
The Challenge with Naive Chunking¶
Standard text splitting destroys document structure:
- Tables broken mid-row: Revenue data becomes meaningless
- Code blocks fragmented: Function definitions lose context
- Headings separated: Topics lose their organizational hierarchy
- Lists split arbitrarily: Enumerated items lose sequence
Advanced chunking solves these problems by understanding document structure before making splitting decisions.
Core Processing Stages¶
Intelligent document preprocessing follows four key stages:
- Structure Analysis: Identify content types (headings, tables, code)
- Hierarchical Chunking: Respect document organization and relationships
- Metadata Extraction: Add context through entities, keywords, topics
- Quality Assessment: Measure and optimize chunk effectiveness
For detailed implementation, continue to the 📝 Participant path below.
📝 Participant Path: Implementation Overview¶
Prerequisites: Complete 🎯 Observer path sections above
For hands-on implementation of structure-aware chunking:
- 📝 Hierarchical Chunking Practice: Build structure-aware chunkers
- 📝 Metadata Extraction Implementation: Extract rich context from content
⚙️ Implementer Path: Advanced Systems¶
Prerequisites: Complete 🎯 Observer and 📝 Participant paths
For enterprise-grade preprocessing and optimization:
- ⚙️ Advanced Processing Pipeline: Complete enterprise systems
- ⚙️ Quality Assessment Systems: Comprehensive quality control
Basic Content Type Detection Example¶
Here's a simple example of how content type detection works:
from enum import Enum
class ContentType(Enum):
HEADING = "heading"
PARAGRAPH = "paragraph"
TABLE = "table"
CODE = "code"
LIST = "list"
This enumeration defines the basic content types our system needs to recognize.
def detect_content_type(text_line):
"""Detect content type from a single line."""
text = text_line.strip()
# Check for markdown heading
if text.startswith('#'):
return ContentType.HEADING
# Check for table (pipe-separated)
if '|' in text and text.count('|') >= 2:
return ContentType.TABLE
The detection logic examines text patterns to classify content. Markdown headers start with '#' symbols, while tables use '|' characters as column separators.
# Check for code (indentation)
if text.startswith(' ') or text.startswith('\t'):
return ContentType.CODE
# Check for list item
if text.startswith('- ') or text.startswith('* '):
return ContentType.LIST
return ContentType.PARAGRAPH
Code blocks are identified by indentation (4 spaces or tabs), list items by bullet markers, and everything else defaults to paragraph content.
Document Element Structure¶
Structured representation preserves content relationships:
from dataclasses import dataclass
from typing import Dict, Any
@dataclass
class DocumentElement:
"""Structured document element."""
content: str
element_type: ContentType
level: int
metadata: Dict[str, Any]
position: int
This structure captures content, type, hierarchy level, and metadata for each document element.
def get_hierarchy_context(self):
"""Get readable hierarchy info."""
hierarchy_labels = {
0: "Document Root",
1: "Major Section",
2: "Subsection"
}
return hierarchy_labels.get(self.level)
The hierarchy context method provides human-readable descriptions of element levels in the document structure.
📝 Participant Path: Implementation Guide¶
Prerequisites: Complete 🎯 Observer path sections above
Hierarchical Chunking Fundamentals¶
The key insight of hierarchical chunking is respecting document structure rather than treating all text equally. Documents have natural boundaries:
- Headings introduce new topics
- Paragraphs develop those topics
- Sections relate hierarchically
For detailed implementation guide: 📝 Hierarchical Chunking Practice →
Simple Chunking Example¶
Here's the core concept demonstrated with a basic implementation:
def simple_hierarchical_chunk(elements, max_size=500):
"""Create structure-aware chunks."""
chunks = []
current_chunk = []
current_size = 0
The chunker maintains a current chunk and tracks its size while processing document elements.
for element in elements:
element_size = len(element.content)
# Start new chunk on major headings
if (element.element_type == ContentType.HEADING
and element.level <= 1 and current_chunk):
chunks.append('\n'.join(current_chunk))
current_chunk = []
current_size = 0
Major headings (level 0-1) trigger new chunks, preserving topic boundaries rather than splitting arbitrarily.
# Add element if size permits
if current_size + element_size <= max_size:
current_chunk.append(element.content)
current_size += element_size
else:
# Finalize current, start new chunk
if current_chunk:
chunks.append('\n'.join(current_chunk))
current_chunk = [element.content]
current_size = element_size
return chunks
Size management ensures chunks remain manageable while preferring structural boundaries over arbitrary cuts.
Key Benefits¶
- Natural boundaries: Respects document organization
- Complete sections: Keeps related content together
- Size management: Balances structure with practical limits
📝 Practice Exercises¶
Exercise 1: Content Type Detection¶
Implement a function that identifies content types in a sample document. Test with various formats including tables, code blocks, and lists.
Exercise 2: Basic Hierarchical Chunking¶
Create a simple chunker that respects heading boundaries. Compare results with naive text splitting on a structured document.
Exercise 3: Metadata Extraction¶
Build a metadata extractor that identifies entities, topics, and difficulty levels. Test on technical and business documents.
Testing Your Implementation¶
# Test with sample document
sample_doc = """
# Machine Learning Overview
Machine learning algorithms learn patterns from data.
## Supervised Learning
Supervised learning uses labeled training data.
### Classification
Classification predicts discrete categories.
"""
# Test chunking results
chunks = your_chunker.create_chunks(sample_doc)
print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk.metadata}")
Discussion¶
Key Takeaways¶
🎯 Observer Path Summary¶
Core Concepts Mastered:
- Document structure matters more than naive text splitting
- Content types (headings, tables, code) require specialized handling
- Hierarchical organization preserves meaning and relationships
- Quality assessment enables optimization and monitoring
📝 Participant Path Summary¶
Implementation Skills Gained:
- Built structure-aware chunking systems
- Implemented metadata extraction for enhanced context
- Created quality assessment metrics
- Balanced structure preservation with size constraints
⚙️ Implementer Path Summary¶
Enterprise Capabilities Achieved:
- Deployed production-grade preprocessing pipelines
- Implemented comprehensive quality control systems
- Created domain-specific analysis capabilities
- Built automated optimization feedback loops
📝 Multiple Choice Test - Session 2¶
Test your understanding of advanced chunking and preprocessing concepts:
Question 1: What is the primary benefit of detecting content types (headings, tables, code) during document analysis?
A) Reduces processing time
B) Enables structure-aware chunking that preserves meaning
C) Reduces storage requirements
D) Improves embedding quality
Question 2: In hierarchical chunking, why is it important to track element hierarchy levels?
A) To reduce memory usage
B) To simplify the codebase
C) To improve processing speed
D) To preserve document structure and create meaningful chunk boundaries
Question 3: What is the main advantage of extracting entities, keywords, and topics during preprocessing?
A) Reduces chunk size
B) Enables more precise retrieval through enriched context
C) Simplifies the chunking process
D) Improves computational efficiency
Question 4: Why do tables require specialized processing in RAG systems?
A) Tables use different encoding formats
B) Tables contain more text than paragraphs
C) Tables are always larger than the chunk size
D) Tables have structured relationships that are lost in naive chunking
Question 5: When processing documents with images, what is the best practice for RAG systems?
A) Store images as binary data in chunks
B) Create separate chunks for each image
C) Replace image references with descriptive text
D) Ignore images completely
Question 6: Which metric is most important for measuring chunk coherence in hierarchical chunking?
A) Topic consistency between related chunks
B) Number of chunks created
C) Average chunk size
D) Processing speed
Question 7: What is the optimal overlap ratio for hierarchical chunks?
A) 100% - complete duplication
B) 0% - no overlap needed
C) 10-20% - balanced context and efficiency
D) 50% - maximum context preservation
Question 8: Why should the advanced processing pipeline analyze document complexity before choosing a processing strategy?
A) To select the most appropriate processing approach for the content type
B) To set the embedding model parameters
C) To reduce computational costs
D) To determine the number of chunks to create
🧭 Navigation¶
Previous: Session 1 - Foundations →
Next: Session 3 - Advanced Patterns →