feat(eval): add gooddata-eval model-evaluation CLI by zdenekmusil-gd · Pull Request #1639 · gooddata/gooddata-python-sdk

zdenekmusil-gd · 2026-06-03T11:09:35Z

New public package gooddata-eval with a gd-eval CLI that evaluates the GoodData AI agent against a dataset of natural-language questions.

Layered core + thin argparse CLI; SSE agentic chat client (httpx); workspace LLM provider/model resolution and activation via GoodData SDK; local-folder and Langfuse dataset sources; visualization evaluator with strict checks (metrics/dimensions/filters/type, cross-ref, pass@K); console + JSON reports.
Streaming per-item progress with latency (total, avg) and quality score.
Provider flag accepts name or id; auto-switches workspace to the provider that offers the requested model.
SSE fallback: captures visualization from create_adhoc_visualization tool call args when the data source is inaccessible.
metric_skill (create_metric result, MAQL + format exact match)
alert_skill (create_metric_alert args, 6 conditional scored fields)
search_tool (tool_selection hard gate, tool_correctness soft)
general_question + guardrail (LLM-as-judge via openai [llm-judge] extra)
Shared helpers: _deep_subset for alert filter comparison, LLMJudge, _text_utils.
Scoring sink (--langfuse, requires --langfuse-dataset): posts one ingestion batch (trace + 4 scores) + one dataset-run-item per evaluated item, creating the named experiment run automatically in Langfuse.
Scores: pass_at_k (bool), quality_score (fraction of strict checks), value_score (0.6×quality + 0.2×speed), latency_s.
quality_score / avg_quality_score exposed on ItemReport / EvalReport and shown in CLI progress line, final table, and summary.

JIRA: GDAI-1766
Risk: low — new isolated package; no changes to existing packages.

hkad98 · 2026-06-03T12:30:30Z

Looks good! We need to add also infra things.

Please add the new package also in the following places:

Also, package is missing common Makefile. See https://github.com/gooddata/gooddata-python-sdk/blob/master/packages/gooddata-dbt/Makefile#L1-L1 for instance. Just copy paste to to the new package.

codecov · 2026-06-03T13:06:05Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.10%. Comparing base (f9639cb) to head (6ab0749).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1639   +/-   ##
=======================================
  Coverage   79.10%   79.10%           
=======================================
  Files         231      231           
  Lines       15718    15718           
=======================================
  Hits        12433    12433           
  Misses       3285     3285

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… + Langfuse) New public package `gooddata-eval` with a `gd-eval` CLI that evaluates the GoodData AI agent against a dataset of natural-language questions. Phase 1 — visualization evaluation: - Layered core + thin argparse CLI; SSE agentic chat client (httpx); workspace LLM provider/model resolution and activation via GoodData SDK; local-folder and Langfuse dataset sources; visualization evaluator with strict checks (metrics/dimensions/filters/type, cross-ref, pass@K); console + JSON reports. - Streaming per-item progress with latency (total, avg) and quality score. - Provider flag accepts name or id; auto-switches workspace to the provider that offers the requested model. - SSE fallback: captures visualization from create_adhoc_visualization tool call args when the data source is inaccessible. Phase 2 — remaining agentic test kinds: - metric_skill, alert_skill, search_tool: scored via tool call arguments. - general_question + guardrail: LLM-as-judge via openai [llm-judge] extra, lazily imported so CLI starts without openai installed. - Shared helpers: _deep_subset, LLMJudge, _text_utils. Langfuse integration: - Dataset source uses REST API via httpx (no Langfuse SDK — broken on Python 3.14). Requires LANGFUSE_PUBLIC_KEY / SECRET_KEY / HOST env vars. - Scoring sink (--langfuse, requires --langfuse-dataset): posts trace + 4 scores + dataset-run-item per evaluated item, creating the named experiment run automatically in Langfuse. - Scores: pass_at_k, quality_score, value_score, latency_s. Bug fixes from code review: - general_question/guardrail items SKIPPED (not ERRORED) when openai absent: supported_test_kinds() now checks openai availability via find_spec(). - Guardrail quality_score was inverted: visualization_returned renamed to no_visualization (True=good); judge_passed added so prose compliance scores 0.5 rather than 1.0. - _coerce_number truncated float thresholds: float(int(x)) -> float(x). - Falsy-zero threshold: 'or' fallback replaced with 'in' key check. - conversationId KeyError on malformed 200: raises ValueError with body. - Scoring math in sink.py was duplicated inline: now calls compute_scores(). - _deep_subset docstring corrected: greedy first-fit, not bipartite match. Infra wiring: - Add to fossa.yaml matrix, build-release/dev-release COMPONENTS, codecov. - Add Makefile (include ../../project_common.mk). 102 tests, ruff + ty clean. CLI starts without openai installed. JIRA: GDAI-1766 Risk: low — new isolated package; no changes to existing packages.

zdenekmusil-gd requested review from hkad98, jaceksan, lupko and pcerny as code owners June 3, 2026 11:09

zdenekmusil-gd force-pushed the zmu/gdai-1766-gooddata-eval-cli branch from a7d2af0 to decddb3 Compare June 3, 2026 12:26

zdenekmusil-gd force-pushed the zmu/gdai-1766-gooddata-eval-cli branch from decddb3 to c5b69b5 Compare June 3, 2026 13:01

zdenekmusil-gd force-pushed the zmu/gdai-1766-gooddata-eval-cli branch from c5b69b5 to d90f37d Compare June 3, 2026 13:22

zdenekmusil-gd force-pushed the zmu/gdai-1766-gooddata-eval-cli branch from d90f37d to 6ab0749 Compare June 3, 2026 13:58

hkad98 approved these changes Jun 3, 2026

View reviewed changes

zdenekmusil-gd merged commit c028c31 into master Jun 3, 2026
13 checks passed

zdenekmusil-gd deleted the zmu/gdai-1766-gooddata-eval-cli branch June 3, 2026 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add gooddata-eval model-evaluation CLI#1639

feat(eval): add gooddata-eval model-evaluation CLI#1639
zdenekmusil-gd merged 1 commit into
masterfrom
zmu/gdai-1766-gooddata-eval-cli

zdenekmusil-gd commented Jun 3, 2026

Uh oh!

hkad98 commented Jun 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zdenekmusil-gd commented Jun 3, 2026

Uh oh!

hkad98 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hkad98 commented Jun 3, 2026 •

edited

Loading

codecov Bot commented Jun 3, 2026 •

edited

Loading