Crosslingual Reasoning & Test-Time Scaling — ACL'26 Tutorial Demo

Live-demo materials in two parts, both driven through OpenRouter so they run on a laptop with no local model or GPU:

Crosslingual notebook — Does a model reason in the language you ask in?
Live reasoning-strategies demo — Which strategy wins under a latency budget?

Setup

uv sync                      # install dependencies (pyproject.toml / uv.lock)

Create a .env file with your OpenRouter key:

OPENROUTER_API_KEY=sk-or-...
MODEL=openai/gpt-4o-mini     # optional: the default model

Part 1 — Crosslingual notebook

crosslingual_demo.ipynb chases one question: when you pose a math problem in French, Russian, or Chinese, what language does the model actually reason in?

Open it in Jupyter or VS Code (with this project's uv venv selected as the kernel) and run top to bottom. It walks through:

Solve GSM8K — one English problem, end to end (prompt → answer).
Solve MGSM — the same problem in other languages; watch accuracy degrade.
A SOTA reasoning model — load the paper's real shipped s1.1 generations and see it answer well across languages (best in English).
Inside look — highlight a real chain-of-thought by writing system: the model reasons in English even on a Russian problem, then translates the final answer back.
A reasoning model live — split a model's thinking vs output channels and compare them across languages.

It runs on the bundled MGSM data (data/mgsm/, 250 problems/language, offline) and the precomputed s1.1 results in repos/ — only the live cells call OpenRouter.

Part 2 — Live reasoning-strategies demo

Three test-time reasoning strategies on GSM8K, compared under a latency budget:

Strategy	What it does	Latency knob	Why it costs latency
Iterative (self-refine)	Answer, then critique-and-revise N times	`rounds`	Strictly sequential
Self-consistency	Sample K chains, majority vote	`k`	Chains run in parallel → latency ≈ slowest chain
Agentic (tool use)	Reason + call a calculator until confident	`max_steps`	Sequential tool-call turns

The headline artifact is an accuracy-vs-latency Pareto plot: read it at any latency budget to see which strategy wins there.

# one problem, all three strategies side by side:
uv run python run_demo.py demo --id gsm8k-3

# budget sweep (the numbers behind the plot) -> results/results.json:
uv run python run_demo.py bench            # bundled 12-problem sample, offline
uv run python run_demo.py bench --source hf --n 100   # real GSM8K test split (needs `datasets`)

# the plot:
uv run python run_demo.py plot             # writes results/pareto.png

# static trace explorer (self-contained results/traces.html):
uv run python run_demo.py trace --id gsm8k-3

# LIVE streaming trace server (streams every step over SSE as it runs):
uv run python run_demo.py live                       # http://127.0.0.1:8000
uv run python run_demo.py live --no-open --port 8011 # headless / custom port

Knobs to turn live:

--model — swap the backing model (smaller models make self-consistency and tools matter more).
--strategies iterative,agentic — restrict the sweep to save time on stage.
Sweep ranges live in reasoning_demo/strategies.py (STRATEGIES registry).

The punchline: there is no single best strategy — the winner depends on whether your budget is latency, tokens, or dollars.

Layout

crosslingual_demo.ipynb    # Part 1 notebook
run_demo.py                # Part 2 CLI: demo | bench | plot | trace | live
reasoning_demo/
  client.py                # OpenRouter wrapper (sync + async), latency/token accounting
  mgsm.py                  # Part 1: faithful MGSM eval + solvers
  traces.py                # Part 1: load real s1.1 traces, highlight CoT by script
  strategies.py            # Part 2: the three strategies + sweep registry
  bench.py / plot.py       # Part 2: budget sweep -> results.json -> pareto.png
  extract.py / tools.py    # Part 2: answer parsing, majority vote, calculator tool
  tracevis.py / liveserver.py  # Part 2: static + live SSE trace explorers
  data.py                  # GSM8K loader (bundled sample or HF)
data/
  gsm8k_sample.jsonl       # 12 GSM8K problems (offline-safe)
  mgsm/                    # MGSM TSVs, 250 problems/language
repos/                     # vendored crosslingual-test-time-scaling (read-only data source)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crosslingual Reasoning & Test-Time Scaling — ACL'26 Tutorial Demo

Setup

Part 1 — Crosslingual notebook

Part 2 — Live reasoning-strategies demo

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
reasoning_demo		reasoning_demo
results		results
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
crosslingual_demo.ipynb		crosslingual_demo.ipynb
pyproject.toml		pyproject.toml
run_demo.py		run_demo.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Crosslingual Reasoning & Test-Time Scaling — ACL'26 Tutorial Demo

Setup

Part 1 — Crosslingual notebook

Part 2 — Live reasoning-strategies demo

Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages