Skip to content

[V1.3.3] Image/Video generation interfaces #294

@ahmad-ajmal

Description

@ahmad-ajmal

Requesting changes - architectural concern + a few correctness issues

Main blocker: provider branching duplicates infrastructure we already have

This change hardcodes if provider == "openai" / elif "gemini" inside the action, with two SDK imports, two client inits, and two error-mapping blocks. We already have provider abstraction elsewhere in the repo:

  • MODEL_REGISTRY + InterfaceType enum (used by VLM/LLM)
  • LLMInterface / VLMInterface wrappers in app/llm_interface.py and app/vlm_interface.py that hide the provider-specific SDK calls
  • describe_image.py (lines 62–70) is the reference pattern: read the configured provider from MODEL_REGISTRY[provider][InterfaceType.VLM], then delegate

As-is this PR builds a third parallel provider system that future image providers (Stability, Replicate, xAI, OpenRouter image, etc.) will all have to extend by adding another elif branch here. It also introduces a new image_generation.preferred_provider setting that parallels the existing vlm_provider / llm_provider pattern instead of joining it.

Could we route this through MODEL_REGISTRY with a new InterfaceType.IMAGE_GEN and an ImageGenInterface in agent_core, mirroring how VLMInterface is set up, so generate_image.py ends up looking like describe_image.py? Reusing InterfaceType.VLM directly is tempting since some providers serve both through one endpoint, but the capability sets differ (Claude / ByteDance support VLM but not gen) and users will want to pick providers independently for each.

Other issues worth fixing while you're in here

  • OpenAI aspect-ratio map is wrong. "16:9": "1536x1024" is 3:2, "9:16": "1024x1536" is 2:3. The canvas constraint is real (gpt-image only has 3 sizes), but silently mismapping → at least append to warnings. (I skimmed real quick so please verify)
  • Silent 4K downgrade for OpenAI. "4K": "high" returns at most 1536×1024. Either reject 4K for OpenAI or warn.
  • quality dropped on the edit path. images.generate(..., quality=...) is passed, but images.edit(...) isn't - reference-image runs silently render at lower quality.
  • images.edit ≠ "style reference." The existing reference_images field is documented as style guidance (how Gemini uses them). OpenAI's images.edit treats inputs as compositional/mask inputs. Same input, very different output between providers.
  • Provider-selection UX doesn't match the PR description. The description says "asks the user" when both keys are present, but the code silently defaults to Gemini - there's no signal in the response telling the calling LLM that a choice is available. Once provider_preference is saved, there's also no way to clear it.

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions