fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686
Open
Quratulain-bilal wants to merge 1 commit into
Open
fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686Quratulain-bilal wants to merge 1 commit into
Quratulain-bilal wants to merge 1 commit into
Conversation
``Path.read_text`` / ``Path.write_text`` default to the system locale
codec, which is cp1252 / gb2312 / cp932 on Windows. Two user-facing
file paths in spec-kit were calling them without an explicit
``encoding=`` argument:
- ``src/specify_cli/__init__.py:400,412`` —
``save_init_options`` / ``load_init_options`` for
``.specify/init-options.json``. A peer machine with a different
default locale (or a UTF-8 Unix CI runner reading a file written on
a cp1252 Windows host) cannot decode the file, raising
``UnicodeDecodeError``. ``UnicodeDecodeError`` is a subclass of
``ValueError`` — not ``OSError`` / ``json.JSONDecodeError`` — so
the existing fall-back ``except`` tuple in ``load_init_options``
also misses it and the error propagates raw to the CLI.
- ``src/specify_cli/extensions.py:764`` — ``.extensionignore``
pattern reader. The very next line already normalises
backslashes "so Windows-authored files work", proving the codebase
expects Windows authors to write this file. Multibyte UTF-8
patterns (Chinese filenames, accented directory names) silently
mojibake when the host locale is not UTF-8, so the patterns fail
to match and unintended files are shipped with the extension.
The sibling integration-catalog reader at
``src/specify_cli/integrations/catalog.py:150,156,193,202,374``
already pins ``encoding="utf-8"`` everywhere. PR github#2280 fixed the
symmetric PowerShell-template BOM bug. This change brings the two
remaining drifted paths in line with that precedent.
Regression tests:
- ``tests/test_presets.py::TestInitOptions`` — parametrized non-ASCII
round-trip (CJK, Latin-1, Greek, emoji) plus a corrupted-file case
that asserts the existing "fall back to {}" contract still holds
when a peer file contains bytes invalid as UTF-8.
- ``tests/test_extensions.py::TestExtensionIgnore`` — Japanese
(``ドキュメント/``) and Latin-1 (``café/``) ignore patterns
correctly exclude their directories during install.
Contributor
There was a problem hiding this comment.
Pull request overview
Pins UTF-8 explicitly for a couple of user-authored/portable files to avoid Windows locale-dependent decoding/encoding behavior, and adds regression tests around non-ASCII content.
Changes:
- Use
encoding="utf-8"for.specify/init-options.jsonread/write and catchUnicodeDecodeErrorin the loader fallback path. - Use
encoding="utf-8"when reading.extensionignoreso multibyte patterns round-trip across platforms/locales. - Add tests covering UTF-8 patterns in
.extensionignoreand fallback behavior for invalid UTF-8 init-options.
Show a summary per file
| File | Description |
|---|---|
src/specify_cli/__init__.py |
Pins UTF-8 for init-options I/O; extends error handling to include decode failures. |
src/specify_cli/extensions.py |
Pins UTF-8 when reading .extensionignore patterns. |
tests/test_presets.py |
Adds init-options tests for non-ASCII values and corrupted-byte fallback. |
tests/test_extensions.py |
Adds .extensionignore tests ensuring non-ASCII patterns exclude intended paths. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 4/4 changed files
- Comments generated: 3
Comment on lines
+400
to
+408
| # Pin UTF-8 explicitly: ``Path.write_text`` defaults to the system | ||
| # locale codec, which is cp1252 / gb2312 / cp932 on Windows. A | ||
| # locale-encoded write succeeds locally but produces a file a peer | ||
| # machine (different locale) or Unix CI cannot decode. The sibling | ||
| # integration-catalog writer in ``integrations/catalog.py`` already | ||
| # pins ``encoding="utf-8"`` for the same reason. | ||
| dest.write_text( | ||
| json.dumps(options, indent=2, sort_keys=True), encoding="utf-8" | ||
| ) |
Comment on lines
+2272
to
+2292
| @pytest.mark.parametrize( | ||
| "value", | ||
| ["名前-プロジェクト", "café-résumé", "Ωmega-Δelta", "🚀-launch"], | ||
| ) | ||
| def test_save_load_round_trip_preserves_non_ascii(self, project_dir, value): | ||
| """Non-ASCII values round-trip via explicit UTF-8 encoding. | ||
|
|
||
| ``Path.write_text`` / ``Path.read_text`` default to the system | ||
| locale codec on Windows (cp1252 / gb2312 / cp932). Without | ||
| ``encoding="utf-8"`` pinned on both ends, a project name like | ||
| ``café`` written on a UTF-8 host becomes garbled or unreadable on | ||
| a cp1252 host (and vice versa). Pin UTF-8 explicitly so init | ||
| options round-trip across machines and CI. | ||
| """ | ||
| from specify_cli import save_init_options, load_init_options | ||
|
|
||
| save_init_options(project_dir, {"ai": "claude", "project_name": value}) | ||
|
|
||
| loaded = load_init_options(project_dir) | ||
| assert loaded["project_name"] == value | ||
|
|
Comment on lines
+764
to
+770
| # Pin UTF-8 explicitly: ``Path.read_text`` defaults to the system | ||
| # locale codec on Windows (cp1252 / gb2312 / cp932), which silently | ||
| # corrupts multibyte patterns when the file is shared across | ||
| # machines with different locales. The next line already | ||
| # normalises backslashes "so Windows-authored files work" — the | ||
| # codebase already expects Windows authors to write this file. | ||
| lines: List[str] = ignore_file.read_text(encoding="utf-8").splitlines() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Path.read_text()/Path.write_text()default to the system locale codec, which is cp1252 / gb2312 / cp932 on Windows.Two user-facing file paths in spec-kit were calling them without an explicit
encoding=argument:src/specify_cli/__init__.py:400, 412—save_init_options/load_init_optionsfor.specify/init-options.json. A peer machine with a different default locale (or a UTF-8 Unix CI runner reading a filewritten on a cp1252 Windows host) cannot decode the file, raising
UnicodeDecodeError.UnicodeDecodeErroris asubclass of
ValueError— notOSError/json.JSONDecodeError— so the existing fall-backexcepttuple inload_init_optionsalso misses it and the error propagates raw to the CLI.src/specify_cli/extensions.py:764—.extensionignorepattern reader. The very next line already normalisesbackslashes "so Windows-authored files work", proving the codebase expects Windows authors to write this file.
Multibyte UTF-8 patterns (Chinese filenames, accented directory names) silently mojibake when the host locale is not
UTF-8, so the patterns fail to match and unintended files are shipped with the extension.
Reproducer
Why this matters
src/specify_cli/integrations/catalog.py:150,156,193,202,374already pinsencoding="utf-8"everywhere. PR fix(powershell): ensure UTF-8 templates are written without BOM #2280 (f684305) fixed the symmetric PowerShell-template BOM bug. The two paths inthis PR were the remaining drifted ones.
init-options.jsonis meant to be a portable record of how a project was scaffolded — a peer cloning the repo on adifferent OS / locale must be able to read it. Today they can't if the original author's project name (or any future
field) contains non-ASCII.
.extensionignorealready explicitly supports Windows authors (see line 766). UTF-8 patterns are part of that samecontract.
The change
src/specify_cli/__init__.py— pinencoding="utf-8"on bothwrite_textandread_text; extend the existingexcepttuple inload_init_optionsto includeUnicodeDecodeErrorso a peer file written in a non-UTF-8 codec fallsback to
{}per the existing contract instead of crashing.src/specify_cli/extensions.py— pinencoding="utf-8"on the.extensionignorereader.Tests
tests/test_presets.py::TestInitOptions— parametrized non-ASCII round-trip (CJK / Latin-1 / Greek / emoji) plus a0xe9-byte corrupted-file fallback test.tests/test_extensions.py::TestExtensionIgnore— Japanese (ドキュメント/) and Latin-1 (café/) ignore patternscorrectly exclude their directories during install.
Scope
Intentionally narrow: no behaviour change for ASCII content (UTF-8 is a superset). Only non-ASCII content that previously
round-tripped accidentally (when host locale happened to be UTF-8) or silently mojibaked (when it wasn't) now
round-trips reliably across all hosts.