Skip to content

fix(plotting): render Chinese formula previews without tofu boxes (CJK mathtext)#71

Merged
yilibinbin merged 4 commits into
mainfrom
fix/cjk-mathtext-formula-preview
Jul 1, 2026
Merged

fix(plotting): render Chinese formula previews without tofu boxes (CJK mathtext)#71
yilibinbin merged 4 commits into
mainfrom
fix/cjk-mathtext-formula-preview

Conversation

@yilibinbin

@yilibinbin yilibinbin commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Summary

Chinese-identifier formula previews (e.g. 质量 * 加速度) rendered every CJK
character as an empty tofu box. The formula render service put the raw string
inside a matplotlib mathtext ($...$) span, and mathtext uses its own font
set (mathtext.fontset=dejavusans + cm fallback) which has no CJK glyphs — the
GUI logged Font 'rm' does not have a glyph ….

Fix

  • shared/plotting.py (single source of truth): when a CJK family resolves,
    set mathtext.fontset='custom' with mathtext.rm/it/bf/cal/sf/tt pointed at
    that family, preserving the :italic / :bold style modifiers (so math
    variables stay italic and bold math stays bold app-wide), and keep
    mathtext.fallback='cm' for any math glyph the CJK font lacks. Applied at
    import so every figure inherits it.
  • shared/formula_mathtext_png.py: import matplotlib through
    shared.plotting (lazily, preserving import purity) so its bare Figure
    inherits the CJK-aware mathtext config; also fixes its CLAUDE.md violation of
    calling matplotlib.use() locally.
  • datalab_latex/formula_render_service.py: _protect_non_ascii_for_mathtext
    wraps non-ASCII runs in \text{} and normalizes full-width punctuation/
    operators → ASCII (braces → escaped \{ \}, not invisible grouping braces).
    Existing \text{...} spans are matched by a brace-counting scanner (any
    nesting depth) so their CJK is not re-wrapped into unparseable
    \text{\text{...}}.

The plain latex field stays raw (real LaTeX handles CJK via the document's CJK
package); only the mathtext field feeding matplotlib is protected.

Review

Converged over four rounds of a two-track review — a multi-model adversarial
swarm and CodeRabbit — fixing every confirmed finding with TDD:

  • [HIGH] nested \text{\text{...}} crash when a LaTeX source already wrapped
    CJK; and \text{} spans with inner braces (\text{质量\{x\}}) missed by the
    first regex → replaced with a brace-counting scanner.
  • [MAJOR] mathtext.it/bf lost their :italic/:bold modifiers (flattened
    math styling app-wide) — restored.
  • [MAJOR] full-width braces {} mapped to bare { } (invisible grouping) →
    escaped \{ \}; also normalized full-width operators.
  • [MEDIUM] CJK PNG test lacked the cjk_font_family() is None skip guard,
    so it would fail (not skip) on a CJK-fontless CI runner — guarded.
  • [MINOR] the full-width-punctuation test was vacuous (ASCII-only input) —
    now feeds real full-width chars.

Final round: three-model 0 findings, CodeRabbit No findings.

Scope note: \text{质量_{a}} (sub/superscript inside \text{}) still fails to
render — matplotlib mathtext itself rejects that syntax, identically on main;
the fix guarantees graceful failure with no re-nesting, not that matplotlib
accepts invalid input.

Test plan

  • ruff clean, targeted TDD tests (render service + mathtext PNG + plotting backend)
  • Full local suite green (~290 files, zero failures)
  • CI green on 3.11/3.12/3.13

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Improved formula preview rendering so Chinese/Japanese/Korean text and full-width punctuation display more reliably in generated images.
    • Preserved existing LaTeX output while making image rendering more robust for mixed-language expressions.
  • Bug Fixes

    • Reduced missing-glyph/tofu issues in math-based formula previews.
    • Prevented double-wrapping of existing text spans and handled full-width braces more safely.
  • Tests

    • Added regression coverage for CJK rendering, punctuation normalization, and formula preview output.

…K mathtext)

A formula preview whose identifiers are Chinese (e.g. 质量 * 加速度) rendered
every CJK char as an empty box: the formula render service put the raw string
inside a matplotlib mathtext ($...$) span, and mathtext uses its OWN font set
(mathtext.fontset=dejavusans + cm fallback) which has no CJK glyphs — the GUI
logged "Font 'rm' does not have a glyph …".

Root fix in shared/plotting.py (single source of truth): when a CJK family is
resolved, set mathtext.fontset='custom' with mathtext.rm/it/bf/… pointed at that
family (which also carries Greek/common math symbols) and keep mathtext.fallback
='cm' so any math glyph the CJK font lacks still renders. Note: matplotlib's
\text{} does NOT use font.family for mathtext, so the custom fontset — not
\text{} wrapping — is what actually resolves CJK; the wrapping below is kept for
semantic correctness (Chinese is text, not a math variable).

- shared/formula_mathtext_png.py: import matplotlib THROUGH shared.plotting
  (lazily, preserving import purity) so the bare Figure it builds inherits the
  CJK-aware mathtext config; also fixes its CLAUDE.md violation of calling
  matplotlib.use() locally instead of via shared.plotting.
- datalab_latex/formula_render_service.py: _protect_non_ascii_for_mathtext wraps
  non-ASCII runs in \text{} and normalizes full-width punctuation → ASCII in the
  mathtext field only; the plain `latex` field stays raw (real LaTeX handles CJK
  via the document's CJK package).

Tests: CJK identifier → \text{}-wrapped mathtext + no bare CJK in $...$; PNG
render of a Chinese formula emits no "missing glyph" in a fresh subprocess (the
real GUI scenario where shared.plotting isn't otherwise preloaded).
Multi-model + CodeRabbit review of the CJK-mathtext fix surfaced real defects:

- [HIGH] _protect_non_ascii_for_mathtext double-wrapped CJK already inside a
  \text{...} (LaTeX-language source), producing \text{\text{...}} which crashes
  matplotlib's parser (ok=False, empty PNG). Now split on existing \text{} spans
  and only wrap runs outside them.
- [MAJOR] mathtext.it/bf were set to the plain CJK family, flattening italic
  variables / bold math for EVERY formula app-wide. Restore the :italic / :bold
  style modifiers (rm/cal/sf/tt stay plain; fallback=cm kept).
- [MAJOR] full-width braces {} mapped to bare { } — invisible TeX grouping
  delimiters, so {x} silently vanished. Map to escaped \{ \} (visible), and
  escape %→\% too; also normalize full-width operators +-*/=<>|.
- [MINOR] the full-width-punctuation test fed ASCII-only input (vacuous). Now
  feeds real full-width chars and asserts normalization.

Tests: no-double-wrap (metadata + render-to-PNG), visible escaped braces,
italic/bold mathtext styles preserved, real full-width normalization.

Note: bare `%` in a source (e.g. a%b) still fails to render — that is a
pre-existing issue on main (mathtext treats % as a comment), out of scope here.
…tless hosts

Second-round review findings:

- [HIGH] _protect_non_ascii_for_mathtext ran the full-width→ASCII translate
  GLOBALLY before the \text{} span scan, so a full-width brace inside an existing
  \text{...} became \{ mid-span, which broke _MATHTEXT_TEXT_SPAN_RE's [^{}]*
  match, defeated the double-wrap guard, and re-nested \text{\text{...}} →
  ParseException, empty PNG. Now translate + wrap PER outside-segment only, so
  existing \text{} spans stay intact. (\text{质量{x}} now renders, ok=True.)
- [MEDIUM] test_render_mathtext_png_renders_cjk_without_missing_glyphs had no
  skip guard for hosts with no CJK font (CI ubuntu installs no CJK fonts), so it
  would FAIL rather than skip on a fontless runner — turning CI red for a
  best-effort feature that legitimately can't render CJK there. Add the
  cjk_font_family() is None skip guard used by the other CJK-dependent tests.

Tests: full-width brace inside a \text{} span stays a single span (metadata +
render-to-PNG); the CJK render test skips (not fails) when no CJK font resolves.
…e regex

Third-round review finding: the double-wrap guard regex \\text\{[^{}]*\} could
not match a \text{...} span containing inner braces (e.g. \text{质量\{x\}} with
escaped literal braces), so it was treated as "outside" and its CJK re-wrapped
into unparseable \text{\text{...}}, crashing that (valid) preview.

Replace the regex with a brace-counting scanner (_find_text_spans) that matches
\text{...} at any nesting depth and skips escaped \{ \}. \text{质量\{x\}} now
renders (ok=True) instead of crashing.

Scope note verified against main: inputs like \text{质量_{a}} / \text{面积^{2}}
still fail to render — matplotlib mathtext itself rejects _{}/^{} inside \text{}
(main behaves identically). The fix's guarantee for those is graceful failure
with NO \text{\text{ re-nesting, which the tests assert; genuinely-valid spans
(escaped braces) now render.
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds full-width/CJK character protection to LaTeX mathtext rendering by detecting existing \text{...} spans, normalizing full-width punctuation, and wrapping remaining non-ASCII runs. Configures custom CJK mathtext fonts in shared.plotting and updates the PNG renderer's matplotlib import path. Adds corresponding tests.

Changes

CJK mathtext protection and rendering

Layer / File(s) Summary
Text-span detection and normalization
datalab_latex/formula_render_service.py
Adds brace-balancing detection of top-level \text{...} spans and normalization/wrapping of full-width and non-ASCII content outside those spans.
Mathtext metadata wiring
datalab_latex/formula_render_service.py
Applies the new transformation when computing FormulaPreviewMetadata.mathtext, replacing direct $latex$ wrapping.
CJK mathtext font configuration
shared/plotting.py
Sets mathtext.fontset to custom and maps rm/it/bf/cal/sf/tt to the resolved CJK font, with mathtext.fallback set to "cm".
PNG renderer import path update
shared/formula_mathtext_png.py
Switches to a lazy shared.plotting import with fallback to direct matplotlib import before configuring the Agg backend.
Regression tests for text/mathtext behavior
tests/test_formula_render_service.py, tests/test_formula_mathtext_png.py, tests/test_plotting_backend.py
Adds tests covering CJK wrapping, full-width normalization, double-wrap guards, PNG rendering success, missing-glyph checks, and mathtext font style preservation.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant FormulaRenderService
  participant TextSpanTransformer
  participant PlottingConfig
  participant MathtextPNGRenderer

  Caller->>FormulaRenderService: request formula preview metadata
  FormulaRenderService->>TextSpanTransformer: transform latex for mathtext
  TextSpanTransformer->>TextSpanTransformer: detect existing \text{...} spans
  TextSpanTransformer->>TextSpanTransformer: normalize full-width punctuation outside spans
  TextSpanTransformer->>TextSpanTransformer: wrap remaining non-ASCII runs in \text{...}
  TextSpanTransformer-->>FormulaRenderService: protected mathtext string
  FormulaRenderService->>MathtextPNGRenderer: render_mathtext_png(mathtext)
  MathtextPNGRenderer->>PlottingConfig: import shared.plotting (lazy)
  PlottingConfig-->>MathtextPNGRenderer: CJK-configured rcParams
  MathtextPNGRenderer-->>FormulaRenderService: PNG bytes
  FormulaRenderService-->>Caller: metadata with latex and mathtext PNG
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly captures the main change: improving Chinese formula preview rendering by fixing CJK mathtext tofu boxes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cjk-mathtext-formula-preview

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
shared/formula_mathtext_png.py (1)

14-19: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

Broad except Exception can mask real bugs in shared.plotting, not just import failures.

shared.plotting mutates rcParams and resolves fonts at import time; any bug there (not just "module missing") is silently swallowed here, falling back to a config without CJK mathtext support with no diagnostic trail.

♻️ Suggested narrowing
     try:
         from shared.plotting import matplotlib
-    except Exception:  # noqa: BLE001 - CJK mathtext support is best-effort
+    except ImportError:  # CJK mathtext support is best-effort
         import matplotlib
 
         matplotlib.use("Agg", force=False)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/formula_mathtext_png.py` around lines 14 - 19, The import fallback in
the top-level matplotlib setup is too broad and hides real bugs from
shared.plotting. Narrow the handling around the shared.plotting import in
formula_mathtext_png so it only falls back on an actual import-related failure,
and let other exceptions surface or be logged with context. Keep the existing
Agg fallback behavior for the genuine missing-module case, but avoid swallowing
unexpected errors during shared.plotting initialization.
datalab_latex/formula_render_service.py (1)

142-169: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider extending the punctuation map with common CJK punctuation.

_FULL_WIDTH_PUNCT_MAP normalizes full-width Latin punctuation/operators but omits native CJK punctuation like (ideographic full stop) and (ideographic comma). These currently fall through to the non-ASCII wrap (rendered via the CJK font, not tofu), so this isn't a correctness bug — just incomplete normalization for punctuation that arguably belongs in the "visible literal" bucket rather than being swept into a \text{} run with adjacent characters.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@datalab_latex/formula_render_service.py` around lines 142 - 169, Extend
`_FULL_WIDTH_PUNCT_MAP` in `formula_render_service.py` to include common CJK
punctuation such as ideographic full stop and ideographic comma. Keep the change
confined to the existing normalization table used by the math rendering flow so
these characters are translated like the other visible literal punctuation
instead of falling through to the non-ASCII path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@datalab_latex/formula_render_service.py`:
- Around line 142-169: Extend `_FULL_WIDTH_PUNCT_MAP` in
`formula_render_service.py` to include common CJK punctuation such as
ideographic full stop and ideographic comma. Keep the change confined to the
existing normalization table used by the math rendering flow so these characters
are translated like the other visible literal punctuation instead of falling
through to the non-ASCII path.

In `@shared/formula_mathtext_png.py`:
- Around line 14-19: The import fallback in the top-level matplotlib setup is
too broad and hides real bugs from shared.plotting. Narrow the handling around
the shared.plotting import in formula_mathtext_png so it only falls back on an
actual import-related failure, and let other exceptions surface or be logged
with context. Keep the existing Agg fallback behavior for the genuine
missing-module case, but avoid swallowing unexpected errors during
shared.plotting initialization.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 101c871a-e7ea-4945-9a64-1c0982503cf7

📥 Commits

Reviewing files that changed from the base of the PR and between 2cecc5f and a5db0b4.

📒 Files selected for processing (6)
  • datalab_latex/formula_render_service.py
  • shared/formula_mathtext_png.py
  • shared/plotting.py
  • tests/test_formula_mathtext_png.py
  • tests/test_formula_render_service.py
  • tests/test_plotting_backend.py

@yilibinbin yilibinbin merged commit 2eabd8c into main Jul 1, 2026
6 checks passed
@yilibinbin yilibinbin deleted the fix/cjk-mathtext-formula-preview branch July 1, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants