Overview

DocSlicer transforms HTML documents into structured, semantic chunks ready for RAG applications.

What is DocSlicer?

DocSlicer is a document processing pipeline that takes HTML documents (like SEC filings, legal documents, technical documentation, or any structured HTML) and breaks them down into semantic chunks with proper hierarchy, metadata, and optional embeddings.

Unlike simple text splitters that naively cut documents at fixed character counts, DocSlicer understands document structure and preserves semantic meaning across chunks. Each chunk maintains context through heading paths, section references, and page labels.

Why DocSlicer?

Structure-Aware Chunking

Respects document hierarchy and semantic boundaries instead of arbitrarily splitting mid-sentence or mid-paragraph.

Rich Metadata

Every chunk includes its heading, full heading path, page references, and position in the document hierarchy.

Production Ready

Simple API, multiple output formats (JSON, CSV, JSONL, Parquet), and optional embeddings generation.

What You Get

Each processed document returns an array of chunks. Every chunk contains:

{
  "text": "The actual text content...",
  "chunk_heading": "Section 3.1",
  "chunk_heading_path": "Legal Terms > Section 3 > Section 3.1",
  "page_label": "Page 12",
  "chunk_index": 0,
  "embedding": [0.123, -0.456, ...]  // Optional
}

Use Cases

  • RAG Applications: Build question-answering systems with better retrieval accuracy
  • Document Analysis: Process legal, financial, or technical documents at scale
  • Knowledge Bases: Convert documentation into searchable, structured data
  • Semantic Search: Generate embeddings for vector similarity search