fix: include grounding metadata in rubric judge prompt by he-yufeng · Pull Request #5834 · google/adk-python

he-yufeng · 2026-05-24T20:03:12Z

Summary

This updates the rubric-based final response quality evaluator so model-supplied grounding metadata is available to the LLM-as-judge prompt.

The issue is easiest to hit with model-internal tools such as google_search: the evaluator currently tells the judge to trust only function tool_response values, but those raw search results may not appear as normal function tool responses. ADK events can still carry grounding metadata, so this patch preserves that metadata in eval invocation events and serializes it into the judge prompt as trusted evidence.

Final answer text is still not treated as evidence.

Fixes #5831.

To verify

python -m py_compile src/google/adk/evaluation/eval_case.py src/google/adk/evaluation/evaluation_generator.py src/google/adk/evaluation/llm_as_judge_utils.py src/google/adk/evaluation/rubric_based_final_response_quality_v1.py tests/unittests/evaluation/test_evaluation_generator.py tests/unittests/evaluation/test_llm_as_judge_utils.py tests/unittests/evaluation/test_rubric_based_final_response_quality_v1.py
.venv\Scripts\python.exe -m pyink --check src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
.venv\Scripts\python.exe -m isort --check-only src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
.venv\Scripts\python.exe -m pytest tests\unittests\evaluation\test_eval_case.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py -q
git diff --check

I also ran targeted pylint on the touched files. It still reports existing module-wide style warnings in these evaluation tests/modules, but no unused-import or grounding-metadata-specific issue remains.

ftnext

Thanks!❤️
#5831 (comment)

he-yufeng · 2026-05-27T09:07:12Z

Rebased onto current upstream/main and pushed a small follow-up for the CI typing failures. Current head is fa508ea.

The follow-up fixes the two new mypy-diff errors from the previous run:

explicitly types the stored final event and event list before checking grounding metadata;
casts the new grounding metadata JSON serialization result back to str.

Validation on Windows:

uv run --no-sync pytest tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py -q --basetemp .tmp\pytest-5834-20260527b -p no:cacheprovider -> 51 passed, 18 warnings
uv run --no-sync python -m py_compile src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
uv run --no-sync pyink --check src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
uv run --no-sync isort --check-only on the same files
git diff --check upstream/main...HEAD

A full local mypy run on these two modules still reports existing unrelated module-level errors, but the two new CI-diff errors are gone.

rohityan · 2026-05-27T18:21:56Z

Hi @he-yufeng , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors by running autoformat.sh

he-yufeng · 2026-05-27T18:25:06Z

Thanks, addressed in the latest push (2d17c2f).

I rebased onto current upstream/main and ran the formatter checks over the files touched by this PR. I do not see an autoformat.sh script in this checkout, so I used the repository's configured pyink + isort path directly.

Validation on Windows:

uv run --no-sync pyink src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
uv run --no-sync isort on the same files
uv run --no-sync pyink --check on the same files
uv run --no-sync isort --check-only on the same files
uv run --no-sync pytest tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py -q --basetemp .tmp\pytest-5834-run -p no:cacheprovider -> 51 passed
uv run --no-sync python -m py_compile on the same files
git diff --check

he-yufeng · 2026-05-27T18:26:41Z

Rebased once more after the upstream formatting fix landed; current head is 2682c6b.

Revalidated the same focused checks:

uv run --no-sync pytest tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py -q --basetemp .tmp\pytest-5834-run2 -p no:cacheprovider -> 51 passed
uv run --no-sync pyink --check on the touched files
uv run --no-sync isort --check-only on the touched files
uv run --no-sync python -m py_compile on the touched files
git diff --check

adk-bot added the eval [Component] This issue is related to evaluation label May 24, 2026

sanketpatil06 mentioned this pull request May 25, 2026

rubric_based_final_response_quality_v1 is hard to use for factual evaluation of google_search agents because its judge prompt requires tool_response evidence #5831

Open

ftnext approved these changes May 25, 2026

View reviewed changes

rohityan self-assigned this May 26, 2026

rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label May 26, 2026

he-yufeng force-pushed the fix/google-search-rubric-evidence branch from b8618c6 to fa508ea Compare May 27, 2026 09:06

he-yufeng force-pushed the fix/google-search-rubric-evidence branch from 6898a5e to 2d17c2f Compare May 27, 2026 18:24

he-yufeng added 2 commits May 28, 2026 02:25

fix: include grounding metadata in rubric judge prompt

a737e53

fix: satisfy eval grounding typing

2682c6b

he-yufeng force-pushed the fix/google-search-rubric-evidence branch from 2d17c2f to 2682c6b Compare May 27, 2026 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: include grounding metadata in rubric judge prompt#5834

fix: include grounding metadata in rubric judge prompt#5834
he-yufeng wants to merge 2 commits into
google:mainfrom
he-yufeng:fix/google-search-rubric-evidence

he-yufeng commented May 24, 2026

Uh oh!

ftnext left a comment

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

rohityan commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

he-yufeng commented May 24, 2026

Summary

To verify

Uh oh!

ftnext left a comment

Choose a reason for hiding this comment

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

rohityan commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants