Skip to content

extract_answer: prefer boxed{N} extraction, fall back to legacy tags#4150

Open
py4 wants to merge 1 commit into
mainfrom
pr/extract-answer-boxed
Open

extract_answer: prefer boxed{N} extraction, fall back to legacy tags#4150
py4 wants to merge 1 commit into
mainfrom
pr/extract-answer-boxed

Conversation

@py4

@py4 py4 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

extract_answer (used by the built-in check_numbers reward) returned the raw <answer> content. Modern reasoning models (Qwen3, DeepSeek-R1, etc.) emit the final answer as \boxed{N} (the maxtext GSM8K chat template itself asks for {solution_start_token}\boxed{}{solution_end_token}). So the extractor returned the literal string \boxed{42}, which math_verify cannot match against a bare numeric gold like 42. Result: ~0% accuracy on Qwen3/GSM8K even when the model's numeric answer is correct, a silent scoring failure rather than a model failure.

This makes the extractor consistent with maxtext's own chat template.

Strategy (priority order)

  1. If <answer>...</answer> is present, scope to the last block's content; otherwise use the full response.
  2. Inside the scope, extract the last \boxed{N} via a brace-balanced scan (handles nested LaTeX like \boxed{\frac{1}{2}}), with a permissive regex fallback.
  3. If no \boxed is found, fall back to the legacy {solution_start_token}...{solution_end_token} regex, so recipes that emit plain-text answers are unaffected.

Step 3 is what keeps this backward compatible.

Tests

tests/post_training/unit/extract_answer_test.py (10 cases, cpu_only + post_training):

Boxed extraction:

  • inside <answer> tags, without tags, nested LaTeX, multiple boxed (last wins), whitespace stripping, negatives, and <answer>-tag scoping over a \boxed that appears in <reasoning>.

Legacy fallback (no \boxed):

  • plain-text answer inside <answer> tags still extracts; last-answer-wins preserved; no-answer returns FALLBACK_ANSWER.

All 10 executed and passed against the real function (Ran 10 tests ... OK).

Files

File Change
src/maxtext/trainers/post_train/rl/utils_rl.py extract_answer: boxed extraction + legacy fallback (+40/-4)
tests/post_training/unit/extract_answer_test.py new, 10 cases

Checklist

  • Pyink-clean (--pyink-indentation=2 --line-length=122)
  • Backward compatible: legacy plain-text answers still extract via the tier-3 fallback
  • No effect on non-RL paths
  • Unit test covering boxed extraction + legacy fallback, verified passing

Modern reasoning models (Qwen3, DeepSeek-R1, etc.) emit `\boxed{N}`
inside `<answer>...</answer>` (or with no answer tags at all). The
legacy regex returned the raw `<answer>` content — e.g. `\boxed{42}`
as a string — which math_verify cannot match against a bare numeric
gold like "42". Result: ~0% accuracy on Qwen3/GSM8K even when the
model's numeric answer is correct.

New strategy (priority order):
  1. If `<answer>...</answer>` is present, use the last block's
     content as the search scope; otherwise use the full response.
  2. Inside the scope, extract the last `\boxed{N}` via brace-balanced
     scan + permissive regex fallback.
  3. If no `\boxed` is found, fall back to the legacy
     `{solution_start_token}...{solution_end_token}` regex
     (backward-compat for recipes that emit plain-text answers).
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.95238% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/trainers/post_train/rl/utils_rl.py 80.95% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant