fix: include grounding metadata in rubric judge prompt#5834
Conversation
ftnext
left a comment
There was a problem hiding this comment.
Thanks!❤️
#5831 (comment)
b8618c6 to
fa508ea
Compare
|
Rebased onto current upstream/main and pushed a small follow-up for the CI typing failures. Current head is fa508ea. The follow-up fixes the two new mypy-diff errors from the previous run:
Validation on Windows:
A full local mypy run on these two modules still reports existing unrelated module-level errors, but the two new CI-diff errors are gone. |
|
Hi @he-yufeng , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors by running autoformat.sh |
6898a5e to
2d17c2f
Compare
|
Thanks, addressed in the latest push (2d17c2f). I rebased onto current upstream/main and ran the formatter checks over the files touched by this PR. I do not see an autoformat.sh script in this checkout, so I used the repository's configured pyink + isort path directly. Validation on Windows:
|
2d17c2f to
2682c6b
Compare
|
Rebased once more after the upstream formatting fix landed; current head is 2682c6b. Revalidated the same focused checks:
|
Summary
This updates the rubric-based final response quality evaluator so model-supplied grounding metadata is available to the LLM-as-judge prompt.
The issue is easiest to hit with model-internal tools such as
google_search: the evaluator currently tells the judge to trust only functiontool_responsevalues, but those raw search results may not appear as normal function tool responses. ADK events can still carry grounding metadata, so this patch preserves that metadata in eval invocation events and serializes it into the judge prompt as trusted evidence.Final answer text is still not treated as evidence.
Fixes #5831.
To verify
python -m py_compile src/google/adk/evaluation/eval_case.py src/google/adk/evaluation/evaluation_generator.py src/google/adk/evaluation/llm_as_judge_utils.py src/google/adk/evaluation/rubric_based_final_response_quality_v1.py tests/unittests/evaluation/test_evaluation_generator.py tests/unittests/evaluation/test_llm_as_judge_utils.py tests/unittests/evaluation/test_rubric_based_final_response_quality_v1.py.venv\Scripts\python.exe -m pyink --check src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py.venv\Scripts\python.exe -m isort --check-only src\google\adk\evaluation\eval_case.py src\google\adk\evaluation\evaluation_generator.py src\google\adk\evaluation\llm_as_judge_utils.py src\google\adk\evaluation\rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py.venv\Scripts\python.exe -m pytest tests\unittests\evaluation\test_eval_case.py tests\unittests\evaluation\test_llm_as_judge_utils.py tests\unittests\evaluation\test_rubric_based_final_response_quality_v1.py tests\unittests\evaluation\test_evaluation_generator.py -qgit diff --checkI also ran targeted
pylinton the touched files. It still reports existing module-wide style warnings in these evaluation tests/modules, but nounused-importor grounding-metadata-specific issue remains.