🎯📝⚙️ Session 2: Advanced Chunking & Preprocessing¶

🎯📝⚙️ Learning Path Overview¶

This session offers three distinct learning paths designed to match your goals and time investment:

🎯 Observer (45-60 min)📝 Participant (3-4 hours)⚙️ Implementer (8-10 hours)

Focus: Understanding concepts and architecture

Activities: Advanced chunking principles, document structure analysis

Ideal for: Decision makers, architects, overview learners

Focus: Guided implementation and analysis

Activities: Implement structure-aware chunking, metadata extraction systems

Ideal for: Developers, technical leads, hands-on learners

Focus: Complete implementation and customization

Activities: Enterprise-grade preprocessing systems, quality assessment frameworks

Ideal for: Senior engineers, architects, specialists

Session Introduction¶

In Session 1, you implemented a working RAG system that splits documents into manageable chunks. But that's where many RAG implementations stop – and where they start to fail in production. When your chunker encounters a table and splits it down the middle, or breaks apart a code block that loses meaning without its function definition, you've hit the fundamental limitation of naive text splitting.

This session transforms your RAG system from simple text splitters into intelligent document understanding engines. You'll implement structure-aware chunking that preserves document hierarchy, extract rich metadata that enhances retrieval precision, and handle complex content types including tables, code blocks, and domain-specific formatting. The goal is to ensure that every chunk carries not just content, but context.

RAG Problems Overview Figure 1: Common problems with naive chunking approaches that advanced preprocessing solves.

🎯 Observer Path: Core Concepts Overview¶

The Challenge with Naive Chunking¶

Standard text splitting destroys document structure:

Tables broken mid-row: Revenue data becomes meaningless
Code blocks fragmented: Function definitions lose context
Headings separated: Topics lose their organizational hierarchy
Lists split arbitrarily: Enumerated items lose sequence

Advanced chunking solves these problems by understanding document structure before making splitting decisions.

Core Processing Stages¶

Intelligent document preprocessing follows four key stages:

Structure Analysis: Identify content types (headings, tables, code)
Hierarchical Chunking: Respect document organization and relationships
Metadata Extraction: Add context through entities, keywords, topics
Quality Assessment: Measure and optimize chunk effectiveness

For detailed implementation, continue to the 📝 Participant path below.

📝 Participant Path: Implementation Overview¶

Prerequisites: Complete 🎯 Observer path sections above

For hands-on implementation of structure-aware chunking:

📝 Hierarchical Chunking Practice: Build structure-aware chunkers
📝 Metadata Extraction Implementation: Extract rich context from content

⚙️ Implementer Path: Advanced Systems¶

Prerequisites: Complete 🎯 Observer and 📝 Participant paths

For enterprise-grade preprocessing and optimization:

⚙️ Advanced Processing Pipeline: Complete enterprise systems
⚙️ Quality Assessment Systems: Comprehensive quality control

Basic Content Type Detection Example¶

Here's a simple example of how content type detection works:

from enum import Enum

class ContentType(Enum):
    HEADING = "heading"
    PARAGRAPH = "paragraph"
    TABLE = "table"
    CODE = "code"
    LIST = "list"

This enumeration defines the basic content types our system needs to recognize.

def detect_content_type(text_line):
    """Detect content type from a single line."""
    text = text_line.strip()

    # Check for markdown heading
    if text.startswith('#'):
        return ContentType.HEADING

    # Check for table (pipe-separated)
    if '|' in text and text.count('|') >= 2:
        return ContentType.TABLE

The detection logic examines text patterns to classify content. Markdown headers start with '#' symbols, while tables use '|' characters as column separators.

    # Check for code (indentation)
    if text.startswith('    ') or text.startswith('\t'):
        return ContentType.CODE

    # Check for list item
    if text.startswith('- ') or text.startswith('* '):
        return ContentType.LIST

    return ContentType.PARAGRAPH

Code blocks are identified by indentation (4 spaces or tabs), list items by bullet markers, and everything else defaults to paragraph content.

Document Element Structure¶

Structured representation preserves content relationships:

from dataclasses import dataclass
from typing import Dict, Any

@dataclass
class DocumentElement:
    """Structured document element."""
    content: str
    element_type: ContentType
    level: int
    metadata: Dict[str, Any]
    position: int

This structure captures content, type, hierarchy level, and metadata for each document element.

    def get_hierarchy_context(self):
        """Get readable hierarchy info."""
        hierarchy_labels = {
            0: "Document Root",
            1: "Major Section",
            2: "Subsection"
        }
        return hierarchy_labels.get(self.level)

The hierarchy context method provides human-readable descriptions of element levels in the document structure.

📝 Participant Path: Implementation Guide¶

Prerequisites: Complete 🎯 Observer path sections above

Hierarchical Chunking Fundamentals¶

The key insight of hierarchical chunking is respecting document structure rather than treating all text equally. Documents have natural boundaries:

Headings introduce new topics
Paragraphs develop those topics
Sections relate hierarchically

For detailed implementation guide: 📝 Hierarchical Chunking Practice →

Simple Chunking Example¶

Here's the core concept demonstrated with a basic implementation:

def simple_hierarchical_chunk(elements, max_size=500):
    """Create structure-aware chunks."""
    chunks = []
    current_chunk = []
    current_size = 0

The chunker maintains a current chunk and tracks its size while processing document elements.

    for element in elements:
        element_size = len(element.content)

        # Start new chunk on major headings
        if (element.element_type == ContentType.HEADING
            and element.level <= 1 and current_chunk):

            chunks.append('\n'.join(current_chunk))
            current_chunk = []
            current_size = 0

Major headings (level 0-1) trigger new chunks, preserving topic boundaries rather than splitting arbitrarily.

        # Add element if size permits
        if current_size + element_size <= max_size:
            current_chunk.append(element.content)
            current_size += element_size
        else:
            # Finalize current, start new chunk
            if current_chunk:
                chunks.append('\n'.join(current_chunk))
            current_chunk = [element.content]
            current_size = element_size

    return chunks

Size management ensures chunks remain manageable while preferring structural boundaries over arbitrary cuts.

Key Benefits¶

Natural boundaries: Respects document organization
Complete sections: Keeps related content together
Size management: Balances structure with practical limits

📝 Practice Exercises¶

Exercise 1: Content Type Detection¶

Implement a function that identifies content types in a sample document. Test with various formats including tables, code blocks, and lists.

Exercise 2: Basic Hierarchical Chunking¶

Create a simple chunker that respects heading boundaries. Compare results with naive text splitting on a structured document.

Exercise 3: Metadata Extraction¶

Build a metadata extractor that identifies entities, topics, and difficulty levels. Test on technical and business documents.

Testing Your Implementation¶

# Test with sample document
sample_doc = """
# Machine Learning Overview

Machine learning algorithms learn patterns from data.

## Supervised Learning

Supervised learning uses labeled training data.

### Classification

Classification predicts discrete categories.
"""

# Test chunking results
chunks = your_chunker.create_chunks(sample_doc)
print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk.metadata}")

Discussion¶

Key Takeaways¶

🎯 Observer Path Summary¶

Core Concepts Mastered:
- Document structure matters more than naive text splitting
- Content types (headings, tables, code) require specialized handling
- Hierarchical organization preserves meaning and relationships
- Quality assessment enables optimization and monitoring

📝 Participant Path Summary¶

Implementation Skills Gained:
- Built structure-aware chunking systems
- Implemented metadata extraction for enhanced context
- Created quality assessment metrics
- Balanced structure preservation with size constraints

⚙️ Implementer Path Summary¶

Enterprise Capabilities Achieved:
- Deployed production-grade preprocessing pipelines
- Implemented comprehensive quality control systems
- Created domain-specific analysis capabilities
- Built automated optimization feedback loops

📝 Multiple Choice Test - Session 2¶

Test your understanding of advanced chunking and preprocessing concepts:

Question 1: What is the primary benefit of detecting content types (headings, tables, code) during document analysis?
A) Reduces processing time
B) Enables structure-aware chunking that preserves meaning
C) Reduces storage requirements
D) Improves embedding quality

Question 2: In hierarchical chunking, why is it important to track element hierarchy levels?
A) To reduce memory usage
B) To simplify the codebase
C) To improve processing speed
D) To preserve document structure and create meaningful chunk boundaries

Question 3: What is the main advantage of extracting entities, keywords, and topics during preprocessing?
A) Reduces chunk size
B) Enables more precise retrieval through enriched context
C) Simplifies the chunking process
D) Improves computational efficiency

Question 4: Why do tables require specialized processing in RAG systems?
A) Tables use different encoding formats
B) Tables contain more text than paragraphs
C) Tables are always larger than the chunk size
D) Tables have structured relationships that are lost in naive chunking

Question 5: When processing documents with images, what is the best practice for RAG systems?
A) Store images as binary data in chunks
B) Create separate chunks for each image
C) Replace image references with descriptive text
D) Ignore images completely

Question 6: Which metric is most important for measuring chunk coherence in hierarchical chunking?
A) Topic consistency between related chunks
B) Number of chunks created
C) Average chunk size
D) Processing speed

Question 7: What is the optimal overlap ratio for hierarchical chunks?
A) 100% - complete duplication
B) 0% - no overlap needed
C) 10-20% - balanced context and efficiency
D) 50% - maximum context preservation

Question 8: Why should the advanced processing pipeline analyze document complexity before choosing a processing strategy?
A) To select the most appropriate processing approach for the content type
B) To set the embedding model parameters
C) To reduce computational costs
D) To determine the number of chunks to create

View Solutions →

Previous: Session 1 - Foundations →
Next: Session 3 - Advanced Patterns →