fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227)#2232
fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227)#2232Anai-Guo wants to merge 3 commits into
Conversation
…ls (abetlen#2227) Adds @register_chat_completion_handler("gemma4") that: 1. Uses the GGUF-embedded Jinja2 chat template to render prompts (Gemma 4 GGUFs ship a correct one out of the box). 2. After generation, parses Gemma 4 native tool-call tokens <|tool_call>call:NAME{key:value,...}<tool_call|> into OpenAI-compatible tool_calls on the assistant message, and strips the optional <|channel>thought ... <channel|> block emitted when thinking mode is enabled. Argument-value grammar follows the spec at https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4 : strings via <|"|>...<|"|>, primitives (int/float/bool/null) bare, lists via [v1,v2,...]. The 3-char <|"|> delimiter means a literal double quote inside a string value never terminates it, so no escaping is needed. Mirrors the PEG-grammar fix the C++ side already shipped in ggml-org/llama.cpp#21326. Non-streaming responses get parsed tool calls; streaming responses pass chunks through unchanged for now (callers can re-parse with the public helper). Tests cover: issue repro, mixed primitives, list-of-strings, thought-block stripping, plain-text passthrough, multiple calls, surrounding plain text, and embedded quotes in string values. Closes abetlen#2227 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…E402 The Gemma 4 parser tests were appended below the existing test_hf_tokenizer_config_str_to_chat_formatter, with their own module- level docstring and re-imports of json / llama_cpp.llama_chat_format that ruff flagged as E402 (module-level import not at top of file). Both imports are already at lines 1 and 9 respectively, so deleting the duplicate block is a no-op for the runtime behaviour. The orientation note that used to live in the stray docstring is preserved as an inline comment block above the new test functions.
|
Closing this PR. After fixing the initial |
|
Withdrawing orphan PR; #2227 remains open for a clean re-attempt. |
Resolves the ruff format --check drift that blocked the original PR; no logic changes.
|
Reopening after resolving the 🤖 Generated with Claude Code |
Summary
Closes #2227.
Adds
@register_chat_completion_handler("gemma4")so thatcreate_chat_completion()with Gemma 4 + tools actually returns parsedtool_callsinstead of dumping native tokens intomessage.content.What changes
llama_cpp/llama_chat_format.py_parse_gemma4_native_tool_calls(text)— pure-Python parser for the Gemma 4 native tool-call grammar, including the optional<|channel>thought…<channel|>block that thinking mode adds.gemma4_chat_completionhandler that uses the GGUF-embedded Jinja2 chat template for prompt rendering, runsllama.create_completion, and post-parses the output.import re.tests/test_llama_chat_format.py— 8 new tests covering the issue repro, mixed primitives (int/float/bool/null), list of strings, thought-block stripping, plain-text passthrough, multiple sequential calls, surrounding plain text, and string values with embedded".Why this design
Reuse the GGUF Jinja template. Gemma 4 GGUFs already ship a correct chat template that produces the right tool-prompt tokens — the bug was strictly on the parsing side, not the formatting side. Re-using
Jinja2ChatFormatterkeeps prompt rendering in lockstep with whatever the model author shipped, instead of hard-coding another copy that can drift.Match the C++ side. ggml-org/llama.cpp#21326 already added the equivalent PEG parser to
llama-server. This PR is the Python port, with the same grammar:key:<|"|>value<|"|>key:30key:3.5key:true/key:falsekey:nullkey:[v1,v2,...]The 3-char
<|"|>delimiter means a literal"inside a string value never terminates it — no escape handling needed.Known limitation
Streaming responses currently pass chunks through unchanged; the caller still gets the raw native tokens. A streaming tool-call parser needs the same incremental PEG state machine the C++ side uses, which is a bigger change. The public
_parse_gemma4_native_tool_callshelper is documented so callers can buffer chunks and re-parse if they need streaming today.Test plan
_parse_gemma4_native_tool_callsdirectly, matching the style of the existing tests in this file).gemma-4-*.ggufand a tools request, to confirm the Jinja-template path renders correctly and the handler returnstool_calls.References
contentinstead oftool_calls#2227🤖 Generated with Claude Code. AI-assisted, human reviewed.