Skip to content

No chunking for markdown files — entire document sent as single LLM message, fails on context overflow #73

@jr2804

Description

@jr2804

Problem

OpenKB sends the entire source document as one LLM message when compiling markdown files. There is no splitting, truncation, or chunking strategy for .md sources.

In agent/compiler.py, compile_short_doc() injects the full document text into a single prompt:

content = source_path.read_text(encoding="utf-8")
doc_msg = {"role": "user", "content": _cached_text(_SUMMARY_USER.format(
    doc_name=doc_name, content=content,
))}

The long-document path (index_long_document() via PageIndex) only triggers for PDFs with page count ≥ pageindex_threshold (default 20). Markdown files have no such fallback — they always take the short-doc code path regardless of size.

Impact: Unusable with local/private LLMs

The README and defaults suggest models like gpt-5.4-mini, implying cloud-scale models with 1M+ token context windows. But for private knowledge bases — the stated use case — users often need to run local models with context windows of 4K–32K tokens.

With a 32K context model, any markdown file exceeding ~24K tokens (roughly 96K characters / ~30 pages) will:

  • Fail outright with a context-length-exceeded API error, or
  • Produce truncated/garbled output when the LLM silently drops the tail of the prompt

For reference, our corpus of 3GPP technical specification documents includes converted markdown files ranging from a few KB to 14 MB. We had 3 documents that could never be ingested because they exceed any reasonable context window. Even mid-sized documents (~50 pages) are risky with 8K–16K context models.

What a fix could look like

A chunking strategy for markdown (and other text-based formats) similar to what other wiki frameworks implement:

  1. Heading-aware splitting: Split on #/##/### boundaries so chunks respect document structure
  2. Token-aware sizing: Estimate token count per chunk and split when exceeding a configurable threshold (e.g., 75% of model context)
  3. Hierarchical synthesis: Summarize each chunk individually, then synthesize chunk summaries into a final document summary
  4. Graceful degradation: At minimum, truncate with a [...truncated at N tokens...] marker instead of sending an oversized prompt that will fail

The existing pageindex_threshold config key could be extended to apply to markdown files (e.g., character or token count threshold), or a new config key could be introduced.

Environment

  • OpenKB version: latest (pip)
  • Model: deepseek-v4-flash via Ollama cloud (128K context, but reasoning tokens consume significant budget)
  • Document corpus: 475 markdown files converted from 3GPP ATIAS specifications
  • 3 documents too large for any practical context window (3–14 MB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions