Skip to content

Feature: Optional image understanding / vision for inline and referenced images #74

@jr2804

Description

@jr2804

Feature Request

OpenKB currently treats images in documents as text-only syntax — the LLM sees ![alt](path/to/image.png) but never the actual image content. This significantly reduces knowledge base quality for technical and scientific documents where figures, diagrams, and charts carry essential information.

Current Behavior

Image path rewriting (images.py)

  • copy_relative_images() scans for ![alt](relative/path) references
  • Copies referenced image files into wiki/sources/images/<doc_name>/
  • Rewrites links to sources/images/<doc_name>/<filename>
  • Skips images not found on disk, http/https/data: URIs

During LLM compilation

  • The markdown with image references is sent to the LLM as plain text
  • The LLM sees ![Figure 3: SBA rendering pipeline](sources/images/doc_name/fig3.png) but cannot see the actual image
  • No image bytes are ever sent to the LLM
  • No vision/multimodal capability is used

Why This Matters

For technical and scientific documents — the kind that benefit most from a knowledge base — figures are often irreplaceable:

  • Architecture diagrams: Show signal flow, system topology, protocol stacks
  • Tables rendered as images: Contain normative reference data that isn't in the text
  • Charts and plots: Performance benchmarks, measurement results
  • Schematics: Circuit diagrams, filter responses, encoder block diagrams

A knowledge base that ignores all of this produces summaries and concept articles that are missing critical information. For example, a 3GPP spec document on "Immersive Audio Rendering" might have 15+ figures showing rendering pipelines, binaural processing chains, and speaker layouts — none of which would be captured.

Proposed Solution (Optional / Configurable)

Since not all users need image understanding (and it requires a vision-capable model), this should be opt-in:

  1. Config flag: image_understanding: true (default: false)
  2. Detection: During compilation, identify ![]() references in the markdown
  3. Vision pass: For each referenced image file found on disk, send the image to a vision-capable LLM with a prompt like: "Describe this figure from document {doc_name}. Include: caption, what it depicts, key information conveyed, visible text/labels, related concepts."
  4. Injection: Prepend the vision-generated description as a text block before the image reference in the prompt sent to the summarization LLM
  5. Wiki output: Include the description in the generated summary/concept pages alongside the image reference

This approach is framework-agnostic — it works with any vision-capable model (GPT-4o, Claude 3.5+, Gemini, LLaVA via local Ollama, etc.) and doesn't require changes to the wiki output format.

Alternative (Minimal)

If full vision integration is too complex, a simpler approach: add an image_caption_step config that lets the user provide pre-generated captions in a sidecar file (e.g., doc_name.images.yaml), which get injected into the LLM prompt. This avoids the vision dependency entirely while still giving the LLM access to image content descriptions.

Environment

  • Document corpus: 3GPP ATIAS technical specifications (converted PDF → markdown with inline image references)
  • Many documents contain critical figures (protocol diagrams, test setups, signal flow charts) that are essential for understanding the content

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions