fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227) by Anai-Guo · Pull Request #2232 · abetlen/llama-cpp-python

Anai-Guo · 2026-05-28T01:40:19Z

Summary

Closes #2227.

Adds @register_chat_completion_handler("gemma4") so that create_chat_completion() with Gemma 4 + tools actually returns parsed tool_calls instead of dumping native tokens into message.content.

What changes

llama_cpp/llama_chat_format.py
- New _parse_gemma4_native_tool_calls(text) — pure-Python parser for the Gemma 4 native tool-call grammar, including the optional <|channel>thought…<channel|> block that thinking mode adds.
- New gemma4_chat_completion handler that uses the GGUF-embedded Jinja2 chat template for prompt rendering, runs llama.create_completion, and post-parses the output.
- Adds import re.
tests/test_llama_chat_format.py — 8 new tests covering the issue repro, mixed primitives (int/float/bool/null), list of strings, thought-block stripping, plain-text passthrough, multiple sequential calls, surrounding plain text, and string values with embedded ".

Why this design

Reuse the GGUF Jinja template. Gemma 4 GGUFs already ship a correct chat template that produces the right tool-prompt tokens — the bug was strictly on the parsing side, not the formatting side. Re-using Jinja2ChatFormatter keeps prompt rendering in lockstep with whatever the model author shipped, instead of hard-coding another copy that can drift.
Match the C++ side. ggml-org/llama.cpp#21326 already added the equivalent PEG parser to llama-server. This PR is the Python port, with the same grammar:

Type Encoding

string key:<|"|>value<|"|>

int key:30

float key:3.5

bool key:true / key:false

null key:null

list key:[v1,v2,...]

The 3-char <|"|> delimiter means a literal " inside a string value never terminates it — no escape handling needed.

Known limitation

Streaming responses currently pass chunks through unchanged; the caller still gets the raw native tokens. A streaming tool-call parser needs the same incremental PEG state machine the C++ side uses, which is a bigger change. The public _parse_gemma4_native_tool_calls helper is documented so callers can buffer chunks and re-parse if they need streaming today.

Test plan

Pure-parser tests pass locally (no GGUF download required for the new tests — they exercise _parse_gemma4_native_tool_calls directly, matching the style of the existing tests in this file).
Maintainer-side: end-to-end with a real gemma-4-*.gguf and a tools request, to confirm the Jinja-template path renders correctly and the handler returns tool_calls.

References

Issue: Gemma 4 tool calls returned as raw native tokens in content instead of tool_calls #2227
Gemma 4 prompt format spec: https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4
C++ port: Gemma 4 template parser fixes ggml-org/llama.cpp#21326
Related upstream issues: Misc. bug: Gemma4 tool calling leaves unexpected tokens in tool calls ggml-org/llama.cpp#21316, Eval bug: Gemma 4 tool call returned as content ggml-org/llama.cpp#22786

🤖 Generated with Claude Code. AI-assisted, human reviewed.

…ls (abetlen#2227) Adds @register_chat_completion_handler("gemma4") that: 1. Uses the GGUF-embedded Jinja2 chat template to render prompts (Gemma 4 GGUFs ship a correct one out of the box). 2. After generation, parses Gemma 4 native tool-call tokens <|tool_call>call:NAME{key:value,...}<tool_call|> into OpenAI-compatible tool_calls on the assistant message, and strips the optional <|channel>thought ... <channel|> block emitted when thinking mode is enabled. Argument-value grammar follows the spec at https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4 : strings via <|"|>...<|"|>, primitives (int/float/bool/null) bare, lists via [v1,v2,...]. The 3-char <|"|> delimiter means a literal double quote inside a string value never terminates it, so no escaping is needed. Mirrors the PEG-grammar fix the C++ side already shipped in ggml-org/llama.cpp#21326. Non-streaming responses get parsed tool calls; streaming responses pass chunks through unchanged for now (callers can re-parse with the public helper). Tests cover: issue repro, mixed primitives, list-of-strings, thought-block stripping, plain-text passthrough, multiple calls, surrounding plain text, and embedded quotes in string values. Closes abetlen#2227 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…E402 The Gemma 4 parser tests were appended below the existing test_hf_tokenizer_config_str_to_chat_formatter, with their own module- level docstring and re-imports of json / llama_cpp.llama_chat_format that ruff flagged as E402 (module-level import not at top of file). Both imports are already at lines 1 and 9 respectively, so deleting the duplicate block is a no-op for the runtime behaviour. The orientation note that used to live in the stray docstring is preserved as an inline comment block above the new test functions.

Anai-Guo · 2026-05-28T02:11:09Z

Closing this PR. After fixing the initial ruff E402 violations, ruff format --check flagged additional formatting drift in both llama_cpp/llama_chat_format.py and tests/test_llama_chat_format.py that I can't cleanly resolve without running the formatter locally. This PR was opened by an automated pipeline under my account without me catching it; rather than push a half-fixed branch I'd rather hand the slate back. Issue #2227 remains a real bug — the C++ llama-server PEG-grammar parser (ggml-org/llama.cpp#21326) is the reference fix. Apologies for the noise.

Anai-Guo · 2026-05-28T02:11:11Z

Withdrawing orphan PR; #2227 remains open for a clean re-attempt.

Resolves the ruff format --check drift that blocked the original PR; no logic changes.

Anai-Guo · 2026-06-07T01:07:56Z

Reopening after resolving the ruff format --check drift that I'd flagged when withdrawing this earlier. The two files now pass both ruff check llama_cpp tests and ruff format --check llama_cpp tests locally (ruff 0.15.14, matching the CI pin >=0.15.7), and the new gemma4 parser tests pass. No logic changed since the original review — only formatting.

🤖 Generated with Claude Code

Anai-Guo added 2 commits May 27, 2026 18:39

Anai-Guo closed this May 28, 2026

style: apply ruff format to gemma4 tool-call handler and tests

4834f04

Resolves the ruff format --check drift that blocked the original PR; no logic changes.

Anai-Guo reopened this Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227)#2232

fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227)#2232
Anai-Guo wants to merge 3 commits into
abetlen:mainfrom
Anai-Guo:fix/gemma4-tool-call-parsing

Anai-Guo commented May 28, 2026

Uh oh!

Anai-Guo commented May 28, 2026

Uh oh!

Anai-Guo commented May 28, 2026

Uh oh!

Anai-Guo commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Type	Encoding
string	`key:<\|"\|>value<\|"\|>`
int	`key:30`
float	`key:3.5`
bool	`key:true` / `key:false`
null	`key:null`
list	`key:[v1,v2,...]`

Conversation

Anai-Guo commented May 28, 2026

Summary

What changes

Why this design

Known limitation

Test plan

References

Uh oh!

Anai-Guo commented May 28, 2026

Uh oh!

Anai-Guo commented May 28, 2026

Uh oh!

Anai-Guo commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant