📝 Session 2: Hierarchical Chunking Practice¶

📝 PARTICIPANT PATH CONTENT Prerequisites: Complete 🎯 Observer path concepts Time Investment: 2-3 hours Outcome: Build structure-aware chunking systems

Learning Outcomes¶

After completing this implementation guide, you will:

Build hierarchical chunkers that respect document structure
Implement intelligent overlap management for context preservation
Create section-aware grouping algorithms
Handle edge cases in document structure analysis

Advanced Hierarchical Chunker Implementation¶

The simple approach demonstrates the concept, but production systems need more sophisticated handling of edge cases, overlap management, and metadata preservation. Our advanced chunker addresses these requirements.

Hierarchical Chunker - Core Initialization¶

Let's start with the chunker's initialization and main entry point:

from langchain.schema import Document

class HierarchicalChunker:
    """Creates intelligent chunks based on document hierarchy."""

    def __init__(self, max_chunk_size: int = 1000, overlap_ratio: float = 0.1):
        self.max_chunk_size = max_chunk_size
        self.overlap_ratio = overlap_ratio
        self.analyzer = DocumentStructureAnalyzer()

The hierarchical chunker initializes with configurable parameters for chunk size and overlap. The overlap_ratio (typically 0.1 = 10%) ensures continuity between chunks by including some content from the previous chunk in the next one.

This prevents context loss at chunk boundaries, which is crucial for maintaining semantic coherence.

Three-Step Chunking Process¶

    def create_hierarchical_chunks(self, document: Document) -> List[Document]:
        """Create chunks that preserve document hierarchy."""
        # Step 1: Analyze document structure
        elements = self.analyzer.analyze_structure(document.page_content)

        # Step 2: Group elements into logical sections
        sections = self._group_elements_by_hierarchy(elements)

        # Step 3: Create chunks from sections
        chunks = []
        for section in sections:
            section_chunks = self._chunk_section(section, document.metadata)
            chunks.extend(section_chunks)

        return chunks

This main method orchestrates the three-step chunking process: analyze structure, group by hierarchy, and create chunks. Notice how we preserve the original document metadata throughout the process.

This ensures each chunk maintains its provenance and context information.

Section Grouping Logic¶

The heart of hierarchical chunking is understanding document structure:

Hierarchical Section Grouping¶

    def _group_elements_by_hierarchy(self, elements: List[DocumentElement]) -> List[List[DocumentElement]]:
        """Group elements into hierarchical sections."""
        sections = []
        current_section = []
        current_level = -1

        for element in elements:
            # Start new section on same or higher level heading
            if (element.element_type == ContentType.HEADING and
                element.level <= current_level and current_section):

                sections.append(current_section)
                current_section = [element]
                current_level = element.level

This logic implements the core hierarchical principle: start a new section when you encounter a heading at the same level or higher (closer to root) than the current section.

This respects document hierarchy - a new "## Introduction" section should close any previous "### Details" subsection.

Edge Case Handling¶

            elif element.element_type == ContentType.HEADING and not current_section:
                # First heading initializes first section
                current_section = [element]
                current_level = element.level
            else:
                # Add element to current section
                current_section.append(element)

        # Add final section
        if current_section:
            sections.append(current_section)

        return sections

The grouping logic handles edge cases: the first heading initializes the first section, and we ensure the final section isn't lost. This creates logical document sections that can be processed independently while maintaining their internal structure.

Intelligent Section Chunking¶

Once sections are identified, we chunk them with size management and overlap:

Section Chunking with Size Management¶

    def _chunk_section(self, section: List[DocumentElement],
                      base_metadata: Dict) -> List[Document]:
        """Create chunks from a document section with intelligent overlap."""
        chunks = []
        current_chunk_elements = []
        current_size = 0

        section_title = self._extract_section_title(section)

        for element in section:
            element_size = len(element.content)

            # Check if adding this element would exceed size limit
            if current_size + element_size > self.max_chunk_size and current_chunk_elements:

This method balances structure preservation with size constraints. We track both the elements and their cumulative size, making decisions based on content boundaries rather than arbitrary character counts.

The section title extraction provides context for each chunk.

Intelligent Overlap Management¶

                # Create chunk from current elements
                chunk = self._create_chunk_from_elements(
                    current_chunk_elements, base_metadata, section_title
                )
                chunks.append(chunk)

                # Start new chunk with overlap for continuity
                overlap_elements = self._get_overlap_elements(current_chunk_elements)
                current_chunk_elements = overlap_elements + [element]
                current_size = sum(len(e.content) for e in current_chunk_elements)

When a size limit is reached, we create a chunk and start the next one with intelligent overlap. The overlap elements typically include the last few sentences or the section heading, ensuring context continuity.

This prevents information loss at chunk boundaries.

Element Accumulation Logic¶

            else:
                current_chunk_elements.append(element)
                current_size += element_size

        # Create final chunk
        if current_chunk_elements:
            chunk = self._create_chunk_from_elements(
                current_chunk_elements, base_metadata, section_title
            )
            chunks.append(chunk)

        return chunks

The method handles the accumulation case (element fits in current chunk) and ensures the final chunk is created. This systematic approach ensures no content is lost while respecting both structural and size constraints.

Rich Chunk Creation with Metadata¶

The final step creates chunks with comprehensive metadata for enhanced retrieval:

Content Assembly with Formatting¶

    def _create_chunk_from_elements(self, elements: List[DocumentElement],
                                  base_metadata: Dict, section_title: str) -> Document:
        """Create a document chunk with rich metadata."""
        # Combine element content with proper formatting
        content_parts = []
        for element in elements:
            if element.element_type == ContentType.HEADING:
                content_parts.append(f"\n{element.content}\n")
            else:
                content_parts.append(element.content)

        content = "\n".join(content_parts).strip()

Content assembly preserves formatting by treating headings specially - they get extra spacing to maintain their visual prominence. This formatting preservation helps both human readers and embedding models understand the content structure.

Enhanced Metadata Creation¶

        # Build enhanced metadata
        content_types = [e.element_type.value for e in elements]
        hierarchy_levels = [e.level for e in elements]

        enhanced_metadata = {
            **base_metadata,
            "section_title": section_title,
            "chunk_type": "hierarchical",
            "content_types": list(set(content_types)),
            "hierarchy_levels": hierarchy_levels,
            "element_count": len(elements),
            "has_heading": ContentType.HEADING.value in content_types,
            "has_table": ContentType.TABLE.value in content_types,
            "has_code": ContentType.CODE.value in content_types,
            "min_hierarchy_level": min(hierarchy_levels),
            "max_hierarchy_level": max(hierarchy_levels)
        }

        return Document(page_content=content, metadata=enhanced_metadata)

The metadata enhancement provides multiple search dimensions: content types enable filtering ("find chunks with code"), hierarchy levels support structure-aware retrieval, and boolean flags enable quick filtering.

This rich metadata transforms simple text chunks into searchable, contextual knowledge units.

Supporting Helper Methods¶

Section Title Extraction¶

    def _extract_section_title(self, section: List[DocumentElement]) -> str:
        """Extract title from section elements."""
        for element in section:
            if element.element_type == ContentType.HEADING:
                return element.content.strip('#').strip()
        return "Untitled Section"

Section title extraction identifies the main heading within a section to provide context.

Overlap Element Selection¶

    def _get_overlap_elements(self, elements: List[DocumentElement]) -> List[DocumentElement]:
        """Get elements for chunk overlap."""
        if not elements:
            return []

        # Calculate overlap size
        total_size = sum(len(e.content) for e in elements)
        overlap_size = int(total_size * self.overlap_ratio)

        # Select last elements that fit in overlap
        overlap_elements = []
        current_size = 0

        for element in reversed(elements):
            element_size = len(element.content)
            if current_size + element_size <= overlap_size:
                overlap_elements.insert(0, element)
                current_size += element_size
            else:
                break

        return overlap_elements

Overlap element selection ensures the next chunk begins with some context from the previous chunk, maintaining semantic continuity while respecting the configured overlap ratio.

📝 Practice Exercises¶

Exercise 1: Basic Hierarchical Chunker¶

Implement a simplified hierarchical chunker and test it on a structured document:

# Test with sample document
sample_doc = """
# Machine Learning Guide

Machine learning is a subset of artificial intelligence.

## Supervised Learning

Supervised learning uses labeled training data to make predictions.

### Classification

Classification algorithms predict discrete categories.

#### Decision Trees

Decision trees split data based on feature values.

#### Random Forest

Random forest combines multiple decision trees.

### Regression

Regression algorithms predict continuous values.

## Unsupervised Learning

Unsupervised learning finds patterns without labels.
"""

# Test your implementation
chunker = HierarchicalChunker(max_chunk_size=300)
chunks = chunker.create_hierarchical_chunks(Document(page_content=sample_doc))

print(f"Created {len(chunks)} chunks:")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1}:")
    print(f"Size: {len(chunk.page_content)} characters")
    print(f"Content types: {chunk.metadata.get('content_types', [])}")
    print(f"Section title: {chunk.metadata.get('section_title', 'N/A')}")
    print(f"Preview: {chunk.page_content[:100]}...")

Exercise 2: Overlap Testing¶

Create chunks with different overlap ratios and analyze the continuity:

# Test different overlap ratios
overlap_ratios = [0.0, 0.1, 0.2, 0.3]

for ratio in overlap_ratios:
    chunker = HierarchicalChunker(max_chunk_size=200, overlap_ratio=ratio)
    chunks = chunker.create_hierarchical_chunks(Document(page_content=sample_doc))

    print(f"\nOverlap ratio: {ratio}")
    print(f"Number of chunks: {len(chunks)}")

    # Analyze overlap between adjacent chunks
    for i in range(len(chunks) - 1):
        chunk1_words = set(chunks[i].page_content.split())
        chunk2_words = set(chunks[i + 1].page_content.split())
        overlap = len(chunk1_words & chunk2_words)
        total = len(chunk1_words | chunk2_words)
        overlap_percentage = (overlap / total) * 100 if total > 0 else 0
        print(f"  Chunks {i+1}-{i+2} overlap: {overlap_percentage:.1f}%")

Exercise 3: Section Boundary Analysis¶

Test how the chunker handles different heading hierarchies:

# Document with complex hierarchy
complex_doc = """
# Chapter 1: Introduction

This is the introduction to our topic.

## Section 1.1: Background

Background information here.

### Subsection 1.1.1: History

Historical context.

### Subsection 1.1.2: Evolution

How things evolved.

## Section 1.2: Current State

Current situation analysis.

# Chapter 2: Methods

New chapter begins here.

## Section 2.1: Methodology

Our approach to the problem.
"""

# Analyze section boundaries
chunker = HierarchicalChunker(max_chunk_size=400)
elements = chunker.analyzer.analyze_structure(complex_doc)

print("Document structure analysis:")
for element in elements:
    if element.element_type == ContentType.HEADING:
        indent = "  " * element.level
        print(f"{indent}Level {element.level}: {element.content}")

Troubleshooting Common Issues¶

Issue 1: Chunks Too Small¶

Problem: Chunks contain only single elements or very little content. Solution: Adjust max_chunk_size or modify section grouping logic:

# Add minimum chunk size constraint
def _chunk_section(self, section, base_metadata, min_chunk_size=150):
    # ... existing code ...

    # Only create chunk if minimum size is met
    if current_size >= min_chunk_size or not chunks:
        chunk = self._create_chunk_from_elements(...)
        chunks.append(chunk)

Issue 2: Overlap Too Large¶

Problem: Overlap creates excessive duplication between chunks. Solution: Implement smart overlap that prioritizes important content:

def _get_smart_overlap_elements(self, elements):
    """Get overlap elements prioritizing headings and key content."""
    overlap_elements = []

    # Always include section heading if present
    for element in elements:
        if element.element_type == ContentType.HEADING:
            overlap_elements.append(element)
            break

    # Add last few sentences up to overlap limit
    # ... implementation details ...

    return overlap_elements

Issue 3: Metadata Inconsistency¶

Problem: Chunk metadata varies inconsistently across similar content. Solution: Standardize metadata extraction with validation:

def _validate_chunk_metadata(self, metadata):
    """Ensure metadata consistency."""
    required_fields = ["section_title", "content_types", "chunk_type"]

    for field in required_fields:
        if field not in metadata:
            metadata[field] = self._get_default_value(field)

    return metadata

Advanced Optimization Techniques¶

Dynamic Chunk Sizing¶

Adapt chunk size based on content complexity:

def _calculate_adaptive_chunk_size(self, section):
    """Calculate optimal chunk size for section."""
    base_size = self.max_chunk_size

    # Increase size for code-heavy sections
    code_elements = sum(1 for e in section if e.element_type == ContentType.CODE)
    if code_elements > 2:
        return base_size * 1.3

    # Decrease size for list-heavy sections
    list_elements = sum(1 for e in section if e.element_type == ContentType.LIST)
    if list_elements > 5:
        return base_size * 0.8

    return base_size

Quality-Based Chunk Validation¶

Validate chunk quality during creation:

def _validate_chunk_quality(self, chunk):
    """Validate chunk meets quality standards."""
    content = chunk.page_content
    words = content.split()

    # Check minimum information content
    if len(words) < 20:
        return False, "Chunk too short"

    # Check for incomplete sentences
    if not content.strip().endswith(('.', '!', '?', ':')):
        return False, "Chunk ends mid-sentence"

    # Check for balanced content types
    metadata = chunk.metadata
    if len(metadata.get('content_types', [])) == 0:
        return False, "No content type detected"

    return True, "Quality check passed"

Key Implementation Tips¶

Best Practices¶

Size Management: Balance structure preservation with practical limits
Overlap Strategy: Use 10-15% overlap for optimal context preservation
Metadata Richness: Include comprehensive metadata for retrieval enhancement
Edge Case Handling: Test with various document structures and formats

Performance Optimization¶

Lazy Loading: Process elements on-demand for large documents
Caching: Cache structure analysis results for repeated processing
Parallel Processing: Process independent sections concurrently
Memory Management: Clear intermediate data structures regularly

Testing Strategy¶

Unit Tests: Test individual methods with controlled inputs
Integration Tests: Test full pipeline with realistic documents
Edge Case Tests: Handle malformed documents and unusual structures
Performance Tests: Measure processing time and memory usage

Previous: Session 1 - Foundations →
Next: Session 3 - Advanced Patterns →