Description
When adding a Chinese EPUB-to-Markdown document (《给青年编剧的信3.0》, ~80k chars), the compiler fails to generate concept pages in two different ways across two attempts, while other documents process fine.
Environment
- openkb v0.1.dev1+g51b6d4f88
- Python 3.12
- macOS 24.2.0 (darwin)
- Model: deepseek-v4-pro (via LiteLLM)
Steps to Reproduce
- Convert an EPUB to Markdown via pandoc:
pandoc book.epub -t markdown -o raw/book.md
- Add to knowledge base:
openkb add raw/book.md
- Observe compiler warnings
First attempt — 'list' object has no attribute 'get'
The LLM successfully returned a concepts plan (3 concepts + 1 update), but all 3 concept generations failed:
Compiling short doc...
summary.............................. 23.6s (in=80249, out=1149)
concepts-plan................ 16.3s (in=81675, out=733, cached=80187)
Generating 4 concept(s) (concurrency=5)...
concept: 剧作中心制... 18.3s (in=81423, out=1062, cached=80635)
concept: 喜剧创作... 18.7s (in=81417, out=924, cached=80635)
concept: 钩子与阻力... 24.2s (in=81420, out=1287, cached=80635)
update: dramatic-conflict... 35.0s (in=82579, out=2483, cached=80635)
openkb.agent.compiler WARNING: Concept generation failed: 'list' object has no attribute 'get'
openkb.agent.compiler WARNING: Concept generation failed: 'list' object has no attribute 'get'
openkb.agent.compiler WARNING: Concept generation failed: 'list' object has no attribute 'get'
Root cause appears to be in compiler.py line 946-947: when _parse_json returns a list, the fallback treats it as a flat dict list via plan = {"create": parsed, ...}, but if any create item is itself a list instead of a dict, line 1002 (concept.get("title", name)) raises 'list' object has no attribute 'get':
# compiler.py:946-947
if isinstance(parsed, list):
plan = {"create": parsed, "update": [], "related": []}
Second attempt — Failed to parse concepts plan: JSON parse error
After removing and re-adding, the plan JSON couldn't parse at all:
Compiling short doc...
summary.............................. 30.4s (in=80249, out=1440)
concepts-plan...................... 22.2s (in=81945, out=1024, cached=80187)
openkb.agent.compiler WARNING: Failed to parse concepts plan: Expecting value: line 1 column 1 (char 0)
This triggers the fallback at compiler.py line 938, which writes the v1 summary without any concept pages. The raw LLM output is logged at DEBUG level and not visible in default output.
Expected Behavior
Concept pages should be generated and written to wiki/concepts/, even when the LLM returns edge-case JSON structures. The fallback in line 946-947 should validate that each item in the list is a dict before treating it as one.
Suggestions
- Add type validation for each item in the flat-list fallback path (line 946-947) — if items are lists rather than dicts, they may be using the wrong JSON structure and should be skipped with a clearer warning.
- Log the raw plan JSON at WARNING level (not just DEBUG) when parse fails, so users can inspect what the LLM actually returned.
- Consider making the concept plan prompt more robust for non-English content — the Chinese EPUB conversion might produce different prompt interaction patterns.
Related
- Source file:
openkb/agent/compiler.py
- Affected lines: ~946-947 (list fallback), ~937-942 (JSON parse failure), ~1000-1020 (concept generation where
.get() is called on items)
Description
When adding a Chinese EPUB-to-Markdown document (《给青年编剧的信3.0》, ~80k chars), the compiler fails to generate concept pages in two different ways across two attempts, while other documents process fine.
Environment
Steps to Reproduce
pandoc book.epub -t markdown -o raw/book.mdopenkb add raw/book.mdFirst attempt —
'list' object has no attribute 'get'The LLM successfully returned a concepts plan (3 concepts + 1 update), but all 3 concept generations failed:
Root cause appears to be in
compiler.pyline 946-947: when_parse_jsonreturns a list, the fallback treats it as a flat dict list viaplan = {"create": parsed, ...}, but if any create item is itself a list instead of a dict, line 1002 (concept.get("title", name)) raises'list' object has no attribute 'get':Second attempt —
Failed to parse concepts plan: JSON parse errorAfter removing and re-adding, the plan JSON couldn't parse at all:
This triggers the fallback at compiler.py line 938, which writes the v1 summary without any concept pages. The raw LLM output is logged at DEBUG level and not visible in default output.
Expected Behavior
Concept pages should be generated and written to
wiki/concepts/, even when the LLM returns edge-case JSON structures. The fallback in line 946-947 should validate that each item in the list is a dict before treating it as one.Suggestions
Related
openkb/agent/compiler.py.get()is called on items)