fix(plotting): render Chinese formula previews without tofu boxes (CJK mathtext)#71
Conversation
…K mathtext)
A formula preview whose identifiers are Chinese (e.g. 质量 * 加速度) rendered
every CJK char as an empty box: the formula render service put the raw string
inside a matplotlib mathtext ($...$) span, and mathtext uses its OWN font set
(mathtext.fontset=dejavusans + cm fallback) which has no CJK glyphs — the GUI
logged "Font 'rm' does not have a glyph …".
Root fix in shared/plotting.py (single source of truth): when a CJK family is
resolved, set mathtext.fontset='custom' with mathtext.rm/it/bf/… pointed at that
family (which also carries Greek/common math symbols) and keep mathtext.fallback
='cm' so any math glyph the CJK font lacks still renders. Note: matplotlib's
\text{} does NOT use font.family for mathtext, so the custom fontset — not
\text{} wrapping — is what actually resolves CJK; the wrapping below is kept for
semantic correctness (Chinese is text, not a math variable).
- shared/formula_mathtext_png.py: import matplotlib THROUGH shared.plotting
(lazily, preserving import purity) so the bare Figure it builds inherits the
CJK-aware mathtext config; also fixes its CLAUDE.md violation of calling
matplotlib.use() locally instead of via shared.plotting.
- datalab_latex/formula_render_service.py: _protect_non_ascii_for_mathtext wraps
non-ASCII runs in \text{} and normalizes full-width punctuation → ASCII in the
mathtext field only; the plain `latex` field stays raw (real LaTeX handles CJK
via the document's CJK package).
Tests: CJK identifier → \text{}-wrapped mathtext + no bare CJK in $...$; PNG
render of a Chinese formula emits no "missing glyph" in a fresh subprocess (the
real GUI scenario where shared.plotting isn't otherwise preloaded).
Multi-model + CodeRabbit review of the CJK-mathtext fix surfaced real defects:
- [HIGH] _protect_non_ascii_for_mathtext double-wrapped CJK already inside a
\text{...} (LaTeX-language source), producing \text{\text{...}} which crashes
matplotlib's parser (ok=False, empty PNG). Now split on existing \text{} spans
and only wrap runs outside them.
- [MAJOR] mathtext.it/bf were set to the plain CJK family, flattening italic
variables / bold math for EVERY formula app-wide. Restore the :italic / :bold
style modifiers (rm/cal/sf/tt stay plain; fallback=cm kept).
- [MAJOR] full-width braces {} mapped to bare { } — invisible TeX grouping
delimiters, so {x} silently vanished. Map to escaped \{ \} (visible), and
escape %→\% too; also normalize full-width operators +-*/=<>|.
- [MINOR] the full-width-punctuation test fed ASCII-only input (vacuous). Now
feeds real full-width chars and asserts normalization.
Tests: no-double-wrap (metadata + render-to-PNG), visible escaped braces,
italic/bold mathtext styles preserved, real full-width normalization.
Note: bare `%` in a source (e.g. a%b) still fails to render — that is a
pre-existing issue on main (mathtext treats % as a comment), out of scope here.
…tless hosts
Second-round review findings:
- [HIGH] _protect_non_ascii_for_mathtext ran the full-width→ASCII translate
GLOBALLY before the \text{} span scan, so a full-width brace inside an existing
\text{...} became \{ mid-span, which broke _MATHTEXT_TEXT_SPAN_RE's [^{}]*
match, defeated the double-wrap guard, and re-nested \text{\text{...}} →
ParseException, empty PNG. Now translate + wrap PER outside-segment only, so
existing \text{} spans stay intact. (\text{质量{x}} now renders, ok=True.)
- [MEDIUM] test_render_mathtext_png_renders_cjk_without_missing_glyphs had no
skip guard for hosts with no CJK font (CI ubuntu installs no CJK fonts), so it
would FAIL rather than skip on a fontless runner — turning CI red for a
best-effort feature that legitimately can't render CJK there. Add the
cjk_font_family() is None skip guard used by the other CJK-dependent tests.
Tests: full-width brace inside a \text{} span stays a single span (metadata +
render-to-PNG); the CJK render test skips (not fails) when no CJK font resolves.
…e regex
Third-round review finding: the double-wrap guard regex \\text\{[^{}]*\} could
not match a \text{...} span containing inner braces (e.g. \text{质量\{x\}} with
escaped literal braces), so it was treated as "outside" and its CJK re-wrapped
into unparseable \text{\text{...}}, crashing that (valid) preview.
Replace the regex with a brace-counting scanner (_find_text_spans) that matches
\text{...} at any nesting depth and skips escaped \{ \}. \text{质量\{x\}} now
renders (ok=True) instead of crashing.
Scope note verified against main: inputs like \text{质量_{a}} / \text{面积^{2}}
still fail to render — matplotlib mathtext itself rejects _{}/^{} inside \text{}
(main behaves identically). The fix's guarantee for those is graceful failure
with NO \text{\text{ re-nesting, which the tests assert; genuinely-valid spans
(escaped braces) now render.
📝 WalkthroughWalkthroughAdds full-width/CJK character protection to LaTeX mathtext rendering by detecting existing \text{...} spans, normalizing full-width punctuation, and wrapping remaining non-ASCII runs. Configures custom CJK mathtext fonts in shared.plotting and updates the PNG renderer's matplotlib import path. Adds corresponding tests. ChangesCJK mathtext protection and rendering
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant Caller
participant FormulaRenderService
participant TextSpanTransformer
participant PlottingConfig
participant MathtextPNGRenderer
Caller->>FormulaRenderService: request formula preview metadata
FormulaRenderService->>TextSpanTransformer: transform latex for mathtext
TextSpanTransformer->>TextSpanTransformer: detect existing \text{...} spans
TextSpanTransformer->>TextSpanTransformer: normalize full-width punctuation outside spans
TextSpanTransformer->>TextSpanTransformer: wrap remaining non-ASCII runs in \text{...}
TextSpanTransformer-->>FormulaRenderService: protected mathtext string
FormulaRenderService->>MathtextPNGRenderer: render_mathtext_png(mathtext)
MathtextPNGRenderer->>PlottingConfig: import shared.plotting (lazy)
PlottingConfig-->>MathtextPNGRenderer: CJK-configured rcParams
MathtextPNGRenderer-->>FormulaRenderService: PNG bytes
FormulaRenderService-->>Caller: metadata with latex and mathtext PNG
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
shared/formula_mathtext_png.py (1)
14-19: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valueBroad
except Exceptioncan mask real bugs inshared.plotting, not just import failures.
shared.plottingmutates rcParams and resolves fonts at import time; any bug there (not just "module missing") is silently swallowed here, falling back to a config without CJK mathtext support with no diagnostic trail.♻️ Suggested narrowing
try: from shared.plotting import matplotlib - except Exception: # noqa: BLE001 - CJK mathtext support is best-effort + except ImportError: # CJK mathtext support is best-effort import matplotlib matplotlib.use("Agg", force=False)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/formula_mathtext_png.py` around lines 14 - 19, The import fallback in the top-level matplotlib setup is too broad and hides real bugs from shared.plotting. Narrow the handling around the shared.plotting import in formula_mathtext_png so it only falls back on an actual import-related failure, and let other exceptions surface or be logged with context. Keep the existing Agg fallback behavior for the genuine missing-module case, but avoid swallowing unexpected errors during shared.plotting initialization.datalab_latex/formula_render_service.py (1)
142-169: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider extending the punctuation map with common CJK punctuation.
_FULL_WIDTH_PUNCT_MAPnormalizes full-width Latin punctuation/operators but omits native CJK punctuation like。(ideographic full stop) and、(ideographic comma). These currently fall through to the non-ASCII wrap (rendered via the CJK font, not tofu), so this isn't a correctness bug — just incomplete normalization for punctuation that arguably belongs in the "visible literal" bucket rather than being swept into a\text{}run with adjacent characters.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@datalab_latex/formula_render_service.py` around lines 142 - 169, Extend `_FULL_WIDTH_PUNCT_MAP` in `formula_render_service.py` to include common CJK punctuation such as ideographic full stop and ideographic comma. Keep the change confined to the existing normalization table used by the math rendering flow so these characters are translated like the other visible literal punctuation instead of falling through to the non-ASCII path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@datalab_latex/formula_render_service.py`:
- Around line 142-169: Extend `_FULL_WIDTH_PUNCT_MAP` in
`formula_render_service.py` to include common CJK punctuation such as
ideographic full stop and ideographic comma. Keep the change confined to the
existing normalization table used by the math rendering flow so these characters
are translated like the other visible literal punctuation instead of falling
through to the non-ASCII path.
In `@shared/formula_mathtext_png.py`:
- Around line 14-19: The import fallback in the top-level matplotlib setup is
too broad and hides real bugs from shared.plotting. Narrow the handling around
the shared.plotting import in formula_mathtext_png so it only falls back on an
actual import-related failure, and let other exceptions surface or be logged
with context. Keep the existing Agg fallback behavior for the genuine
missing-module case, but avoid swallowing unexpected errors during
shared.plotting initialization.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 101c871a-e7ea-4945-9a64-1c0982503cf7
📒 Files selected for processing (6)
datalab_latex/formula_render_service.pyshared/formula_mathtext_png.pyshared/plotting.pytests/test_formula_mathtext_png.pytests/test_formula_render_service.pytests/test_plotting_backend.py
Summary
Chinese-identifier formula previews (e.g.
质量 * 加速度) rendered every CJKcharacter as an empty tofu box. The formula render service put the raw string
inside a matplotlib mathtext (
$...$) span, and mathtext uses its own fontset (
mathtext.fontset=dejavusans+cmfallback) which has no CJK glyphs — theGUI logged
Font 'rm' does not have a glyph ….Fix
shared/plotting.py(single source of truth): when a CJK family resolves,set
mathtext.fontset='custom'withmathtext.rm/it/bf/cal/sf/ttpointed atthat family, preserving the
:italic/:boldstyle modifiers (so mathvariables stay italic and bold math stays bold app-wide), and keep
mathtext.fallback='cm'for any math glyph the CJK font lacks. Applied atimport so every figure inherits it.
shared/formula_mathtext_png.py: import matplotlib throughshared.plotting(lazily, preserving import purity) so its bare Figureinherits the CJK-aware mathtext config; also fixes its CLAUDE.md violation of
calling
matplotlib.use()locally.datalab_latex/formula_render_service.py:_protect_non_ascii_for_mathtextwraps non-ASCII runs in
\text{}and normalizes full-width punctuation/operators → ASCII (braces → escaped
\{ \}, not invisible grouping braces).Existing
\text{...}spans are matched by a brace-counting scanner (anynesting depth) so their CJK is not re-wrapped into unparseable
\text{\text{...}}.The plain
latexfield stays raw (real LaTeX handles CJK via the document's CJKpackage); only the mathtext field feeding matplotlib is protected.
Review
Converged over four rounds of a two-track review — a multi-model adversarial
swarm and CodeRabbit — fixing every confirmed finding with TDD:
\text{\text{...}}crash when a LaTeX source already wrappedCJK; and
\text{}spans with inner braces (\text{质量\{x\}}) missed by thefirst regex → replaced with a brace-counting scanner.
mathtext.it/bflost their:italic/:boldmodifiers (flattenedmath styling app-wide) — restored.
{}mapped to bare{ }(invisible grouping) →escaped
\{ \}; also normalized full-width operators.cjk_font_family() is Noneskip guard,so it would fail (not skip) on a CJK-fontless CI runner — guarded.
now feeds real full-width chars.
Final round: three-model 0 findings, CodeRabbit No findings.
Scope note:
\text{质量_{a}}(sub/superscript inside\text{}) still fails torender — matplotlib mathtext itself rejects that syntax, identically on
main;the fix guarantees graceful failure with no re-nesting, not that matplotlib
accepts invalid input.
Test plan
ruffclean, targeted TDD tests (render service + mathtext PNG + plotting backend)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests