gh-53144: Improve charset support in the email package by serhiy-storchaka · Pull Request #149942 · python/cpython

serhiy-storchaka · 2026-05-17T09:42:02Z

Defer to the codecs module for all aliases.
Use MIME/IANA names for all IANA registered charsets.

Issue: The email package should defer to the codecs module for all aliases #53144

Defer to the codecs module for all aliases. Use MIME/IANA names for all IANA registered charsets.

bitdancer

Overall this looks good.

I'm getting two test failures against current main, one of which (test_chinese_codecs) looks like charset name changes and is probably correct, but the other(test_korean_codecs) I don't understand.

bitdancer · 2026-06-01T19:42:23Z

@@ -61,40 +62,73 @@
    'utf-8':       (SHORTEST,  BASE64, 'utf-8'),


Suggested change

'utf-8': (SHORTEST, BASE64, 'utf-8'),

'utf-8': (SHORTEST, BASE64, None),

I don't know why this is set to a non-None value when all the others are None, except for the one that wants to use a different charset for header encoding. So I think we should "fix" it to avoid confusion later (assuming I'm not missing something here).

This item can be removed. (SHORTEST, BASE64, None) is the default value.

bitdancer · 2026-06-01T19:45:20Z

        self.header_encoding = henc
        self.body_encoding = benc
-        self.output_charset = ALIASES.get(conv, conv)
+        self.output_charset = conv


The comment says "but let the user override it", which is what the call to ALIASES does. While it is unlikely anyone is using that facility, it is part of the API to update the aliases table, so we should maybe leave it. The place it might get used is exactly the place it might be wanted: overriding iso-2022-jp 7bit codec. So, since it is cheap, and a no-op in all other cases, maybe we shouldn't risk breaking anyone's legacy code?

bitdancer · 2026-06-01T19:45:39Z

-    msg.set_param('charset',
-                  email.charset.ALIASES.get(charset, charset),
-                  replace=True)
+    msg.set_param('charset', charset, replace=True)


Your reordering of the operations here looks correct, which presumably means there is a missing test that would show the bug. Do you want to add one? If not I'll make a note and try to remember to do it some day ;)

Added test_set_text_charset_cp949. Note that charset="euc-kr", even if ALIASES maps 'cp949' to 'ks_c_5601-1987'.

But when I tried to add similar test with shif_jis or euc-jp, it failed (trying to encode surrogates). It fails also with the current code.

Opened #150771.

bedevere-app · 2026-06-01T19:49:50Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

serhiy-storchaka

I'm getting two test failures against current main, one of which (test_chinese_codecs) looks like charset name changes and is probably correct, but the other(test_korean_codecs) I don't understand.

This is interesting. There are three chunks with encodings euc-kr, ks_c_5601-1987 and cp949. In main, cp949 is translated to ks_c_5601-1987, so the 2nd and the 3rd chunks are merged. With this PR, the second chunk, ks_c_5601-1987, translated to euc-kr, so the 1st and the 2nd chunks are merged.

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

serhiy-storchaka · 2026-06-01T20:38:09Z

@@ -61,40 +62,73 @@
    'utf-8':       (SHORTEST,  BASE64, 'utf-8'),


This item can be removed. (SHORTEST, BASE64, None) is the default value.

serhiy-storchaka · 2026-06-01T22:08:44Z

-    msg.set_param('charset',
-                  email.charset.ALIASES.get(charset, charset),
-                  replace=True)
+    msg.set_param('charset', charset, replace=True)


Added test_set_text_charset_cp949. Note that charset="euc-kr", even if ALIASES maps 'cp949' to 'ks_c_5601-1987'.

But when I tried to add similar test with shif_jis or euc-jp, it failed (trying to encode surrogates). It fails also with the current code.

serhiy-storchaka · 2026-06-01T22:24:14Z

I have made the requested changes; please review again.

bedevere-app · 2026-06-01T22:24:19Z

Thanks for making the requested changes!

@bitdancer: please review the changes made to this pull request.

serhiy-storchaka · 2026-06-02T09:21:40Z

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

See #62825.

bitdancer

LGTM

bitdancer · 2026-06-02T13:35:16Z

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

See #62825.

It seems to me the change in behavior introduced by this PR at least partially addresses 62825 in that it makes the output encoding for the case in question the encoding that handles more characters. Do you agree or am I misunderstanding?

serhiy-storchaka · 2026-06-02T14:21:36Z

I do not think it makes test_korean_codecs better or worse. At least we cannot see this until other issue with Korean charsets are fixed. But test_set_text_charset_cp949 fails on main, so there is at least some progress. And test_chinese_codecs is now more correct.

pythongh-53144: Imporove charset support in the email package

4248309

Defer to the codecs module for all aliases. Use MIME/IANA names for all IANA registered charsets.

serhiy-storchaka requested a review from a team as a code owner May 17, 2026 09:42

bedevere-app Bot mentioned this pull request May 17, 2026

The email package should defer to the codecs module for all aliases #53144

Open

bedevere-app Bot added the awaiting core review label May 17, 2026

malemburg changed the title ~~gh-53144: Imporove charset support in the email package~~ gh-53144: Improve charset support in the email package May 17, 2026

malemburg reviewed May 17, 2026

View reviewed changes

Comment thread Lib/email/charset.py Outdated

Normalize charset in set_text_content().

3331ac9

bitdancer requested changes Jun 1, 2026

View reviewed changes

bedevere-app Bot added awaiting changes and removed awaiting core review labels Jun 1, 2026

serhiy-storchaka added 3 commits June 1, 2026 23:55

Merge branch 'main' into email-charset-aliases

08c2110

Fix test_chinese_codecs and test_korean_codecs.

cece1e7

Address review comments.

73b1836

serhiy-storchaka commented Jun 1, 2026

View reviewed changes

bedevere-app Bot added awaiting change review and removed awaiting changes labels Jun 1, 2026

bedevere-app Bot requested a review from bitdancer June 1, 2026 22:24

bitdancer approved these changes Jun 2, 2026

View reviewed changes

bedevere-app Bot added awaiting merge and removed awaiting change review labels Jun 2, 2026

	'utf-8': (SHORTEST, BASE64, 'utf-8'),
	'utf-8': (SHORTEST, BASE64, None),

Uh oh!

Conversation

serhiy-storchaka commented May 17, 2026 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bitdancer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bedevere-app Bot commented Jun 1, 2026

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Jun 1, 2026

Uh oh!

bedevere-app Bot commented Jun 1, 2026

Uh oh!

serhiy-storchaka commented Jun 2, 2026

Uh oh!

bitdancer left a comment

Choose a reason for hiding this comment

Uh oh!

bitdancer commented Jun 2, 2026

Uh oh!

serhiy-storchaka commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

serhiy-storchaka commented May 17, 2026 •

edited by bedevere-app Bot

Loading