Problem
OpenKB sends the entire source document as one LLM message when compiling markdown files. There is no splitting, truncation, or chunking strategy for .md sources.
In agent/compiler.py, compile_short_doc() injects the full document text into a single prompt:
content = source_path.read_text(encoding="utf-8")
doc_msg = {"role": "user", "content": _cached_text(_SUMMARY_USER.format(
doc_name=doc_name, content=content,
))}
The long-document path (index_long_document() via PageIndex) only triggers for PDFs with page count ≥ pageindex_threshold (default 20). Markdown files have no such fallback — they always take the short-doc code path regardless of size.
Impact: Unusable with local/private LLMs
The README and defaults suggest models like gpt-5.4-mini, implying cloud-scale models with 1M+ token context windows. But for private knowledge bases — the stated use case — users often need to run local models with context windows of 4K–32K tokens.
With a 32K context model, any markdown file exceeding ~24K tokens (roughly 96K characters / ~30 pages) will:
- Fail outright with a context-length-exceeded API error, or
- Produce truncated/garbled output when the LLM silently drops the tail of the prompt
For reference, our corpus of 3GPP technical specification documents includes converted markdown files ranging from a few KB to 14 MB. We had 3 documents that could never be ingested because they exceed any reasonable context window. Even mid-sized documents (~50 pages) are risky with 8K–16K context models.
What a fix could look like
A chunking strategy for markdown (and other text-based formats) similar to what other wiki frameworks implement:
- Heading-aware splitting: Split on
#/##/### boundaries so chunks respect document structure
- Token-aware sizing: Estimate token count per chunk and split when exceeding a configurable threshold (e.g., 75% of model context)
- Hierarchical synthesis: Summarize each chunk individually, then synthesize chunk summaries into a final document summary
- Graceful degradation: At minimum, truncate with a
[...truncated at N tokens...] marker instead of sending an oversized prompt that will fail
The existing pageindex_threshold config key could be extended to apply to markdown files (e.g., character or token count threshold), or a new config key could be introduced.
Environment
- OpenKB version: latest (pip)
- Model:
deepseek-v4-flash via Ollama cloud (128K context, but reasoning tokens consume significant budget)
- Document corpus: 475 markdown files converted from 3GPP ATIAS specifications
- 3 documents too large for any practical context window (3–14 MB)
Problem
OpenKB sends the entire source document as one LLM message when compiling markdown files. There is no splitting, truncation, or chunking strategy for
.mdsources.In
agent/compiler.py,compile_short_doc()injects the full document text into a single prompt:The long-document path (
index_long_document()viaPageIndex) only triggers for PDFs with page count ≥pageindex_threshold(default 20). Markdown files have no such fallback — they always take the short-doc code path regardless of size.Impact: Unusable with local/private LLMs
The README and defaults suggest models like
gpt-5.4-mini, implying cloud-scale models with 1M+ token context windows. But for private knowledge bases — the stated use case — users often need to run local models with context windows of 4K–32K tokens.With a 32K context model, any markdown file exceeding ~24K tokens (roughly 96K characters / ~30 pages) will:
For reference, our corpus of 3GPP technical specification documents includes converted markdown files ranging from a few KB to 14 MB. We had 3 documents that could never be ingested because they exceed any reasonable context window. Even mid-sized documents (~50 pages) are risky with 8K–16K context models.
What a fix could look like
A chunking strategy for markdown (and other text-based formats) similar to what other wiki frameworks implement:
#/##/###boundaries so chunks respect document structure[...truncated at N tokens...]marker instead of sending an oversized prompt that will failThe existing
pageindex_thresholdconfig key could be extended to apply to markdown files (e.g., character or token count threshold), or a new config key could be introduced.Environment
deepseek-v4-flashvia Ollama cloud (128K context, but reasoning tokens consume significant budget)