Skip to content

compiler: concept generation fails with and JSON parse error on same document #71

@moyu12-ae

Description

@moyu12-ae

Description

When adding a Chinese EPUB-to-Markdown document (《给青年编剧的信3.0》, ~80k chars), the compiler fails to generate concept pages in two different ways across two attempts, while other documents process fine.

Environment

  • openkb v0.1.dev1+g51b6d4f88
  • Python 3.12
  • macOS 24.2.0 (darwin)
  • Model: deepseek-v4-pro (via LiteLLM)

Steps to Reproduce

  1. Convert an EPUB to Markdown via pandoc: pandoc book.epub -t markdown -o raw/book.md
  2. Add to knowledge base: openkb add raw/book.md
  3. Observe compiler warnings

First attempt — 'list' object has no attribute 'get'

The LLM successfully returned a concepts plan (3 concepts + 1 update), but all 3 concept generations failed:

Compiling short doc...
  summary.............................. 23.6s (in=80249, out=1149)
  concepts-plan................ 16.3s (in=81675, out=733, cached=80187)
  Generating 4 concept(s) (concurrency=5)...
  concept: 剧作中心制... 18.3s (in=81423, out=1062, cached=80635)
  concept: 喜剧创作... 18.7s (in=81417, out=924, cached=80635)
  concept: 钩子与阻力... 24.2s (in=81420, out=1287, cached=80635)
  update: dramatic-conflict... 35.0s (in=82579, out=2483, cached=80635)
openkb.agent.compiler WARNING: Concept generation failed: 'list' object has no attribute 'get'
openkb.agent.compiler WARNING: Concept generation failed: 'list' object has no attribute 'get'
openkb.agent.compiler WARNING: Concept generation failed: 'list' object has no attribute 'get'

Root cause appears to be in compiler.py line 946-947: when _parse_json returns a list, the fallback treats it as a flat dict list via plan = {"create": parsed, ...}, but if any create item is itself a list instead of a dict, line 1002 (concept.get("title", name)) raises 'list' object has no attribute 'get':

# compiler.py:946-947
if isinstance(parsed, list):
    plan = {"create": parsed, "update": [], "related": []}

Second attempt — Failed to parse concepts plan: JSON parse error

After removing and re-adding, the plan JSON couldn't parse at all:

Compiling short doc...
  summary.............................. 30.4s (in=80249, out=1440)
  concepts-plan...................... 22.2s (in=81945, out=1024, cached=80187)
openkb.agent.compiler WARNING: Failed to parse concepts plan: Expecting value: line 1 column 1 (char 0)

This triggers the fallback at compiler.py line 938, which writes the v1 summary without any concept pages. The raw LLM output is logged at DEBUG level and not visible in default output.

Expected Behavior

Concept pages should be generated and written to wiki/concepts/, even when the LLM returns edge-case JSON structures. The fallback in line 946-947 should validate that each item in the list is a dict before treating it as one.

Suggestions

  1. Add type validation for each item in the flat-list fallback path (line 946-947) — if items are lists rather than dicts, they may be using the wrong JSON structure and should be skipped with a clearer warning.
  2. Log the raw plan JSON at WARNING level (not just DEBUG) when parse fails, so users can inspect what the LLM actually returned.
  3. Consider making the concept plan prompt more robust for non-English content — the Chinese EPUB conversion might produce different prompt interaction patterns.

Related

  • Source file: openkb/agent/compiler.py
  • Affected lines: ~946-947 (list fallback), ~937-942 (JSON parse failure), ~1000-1020 (concept generation where .get() is called on items)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions