Skip to content

fix: include grounding metadata in rubric judge prompt#5834

Open
he-yufeng wants to merge 2 commits into
google:mainfrom
he-yufeng:fix/google-search-rubric-evidence
Open

fix: include grounding metadata in rubric judge prompt#5834
he-yufeng wants to merge 2 commits into
google:mainfrom
he-yufeng:fix/google-search-rubric-evidence

Conversation

@he-yufeng
Copy link
Copy Markdown

Summary

This updates the rubric-based final response quality evaluator so model-supplied grounding metadata is available to the LLM-as-judge prompt.

The issue is easiest to hit with model-internal tools such as google_search: the evaluator currently tells the judge to trust only function tool_response values, but those raw search results may not appear as normal function tool responses. ADK events can still carry grounding metadata, so this patch preserves that metadata in eval invocation events and serializes it into the judge prompt as trusted evidence.

Final answer text is still not treated as evidence.

Fixes #5831.

To verify

  • python -m py_compile src/google/adk/evaluation/eval_case.py src/google/adk/evaluation/evaluation_generator.py src/google/adk/evaluation/llm_as_judge_utils.py src/google/adk/evaluation/rubric_based_final_response_quality_v1.py tests/unittests/evaluation/test_evaluation_generator.py tests/unittests/evaluation/test_llm_as_judge_utils.py tests/unittests/evaluation/test_rubric_based_final_response_quality_v1.py
  • .venv\Scripts\python.exe -m pyink --check src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
  • .venv\Scripts\python.exe -m isort --check-only src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
  • .venv\Scripts\python.exe -m pytest tests\unittests\evaluation\test_eval_case.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py -q
  • git diff --check

I also ran targeted pylint on the touched files. It still reports existing module-wide style warnings in these evaluation tests/modules, but no unused-import or grounding-metadata-specific issue remains.

Copy link
Copy Markdown
Contributor

@ftnext ftnext left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!❤️
#5831 (comment)

@rohityan rohityan self-assigned this May 26, 2026
@rohityan rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label May 26, 2026
@he-yufeng he-yufeng force-pushed the fix/google-search-rubric-evidence branch from b8618c6 to fa508ea Compare May 27, 2026 09:06
@he-yufeng
Copy link
Copy Markdown
Author

Rebased onto current upstream/main and pushed a small follow-up for the CI typing failures. Current head is fa508ea.

The follow-up fixes the two new mypy-diff errors from the previous run:

  • explicitly types the stored final event and event list before checking grounding metadata;
  • casts the new grounding metadata JSON serialization result back to str.

Validation on Windows:

  • uv run --no-sync pytest tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py -q --basetemp .tmp\pytest-5834-20260527b -p no:cacheprovider -> 51 passed, 18 warnings
  • uv run --no-sync python -m py_compile src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
  • uv run --no-sync pyink --check src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
  • uv run --no-sync isort --check-only on the same files
  • git diff --check upstream/main...HEAD

A full local mypy run on these two modules still reports existing unrelated module-level errors, but the two new CI-diff errors are gone.

@rohityan
Copy link
Copy Markdown
Collaborator

Hi @he-yufeng , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors by running autoformat.sh

@he-yufeng he-yufeng force-pushed the fix/google-search-rubric-evidence branch from 6898a5e to 2d17c2f Compare May 27, 2026 18:24
@he-yufeng
Copy link
Copy Markdown
Author

Thanks, addressed in the latest push (2d17c2f).

I rebased onto current upstream/main and ran the formatter checks over the files touched by this PR. I do not see an autoformat.sh script in this checkout, so I used the repository's configured pyink + isort path directly.

Validation on Windows:

  • uv run --no-sync pyink src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py
  • uv run --no-sync isort on the same files
  • uv run --no-sync pyink --check on the same files
  • uv run --no-sync isort --check-only on the same files
  • uv run --no-sync pytest tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py -q --basetemp .tmp\pytest-5834-run -p no:cacheprovider -> 51 passed
  • uv run --no-sync python -m py_compile on the same files
  • git diff --check

@he-yufeng he-yufeng force-pushed the fix/google-search-rubric-evidence branch from 2d17c2f to 2682c6b Compare May 27, 2026 18:26
@he-yufeng
Copy link
Copy Markdown
Author

Rebased once more after the upstream formatting fix landed; current head is 2682c6b.

Revalidated the same focused checks:

  • uv run --no-sync pytest tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py -q --basetemp .tmp\pytest-5834-run2 -p no:cacheprovider -> 51 passed
  • uv run --no-sync pyink --check on the touched files
  • uv run --no-sync isort --check-only on the touched files
  • uv run --no-sync python -m py_compile on the touched files
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

4 participants