Skip to content

gh-53144: Improve charset support in the email package#149942

Open
serhiy-storchaka wants to merge 5 commits into
python:mainfrom
serhiy-storchaka:email-charset-aliases
Open

gh-53144: Improve charset support in the email package#149942
serhiy-storchaka wants to merge 5 commits into
python:mainfrom
serhiy-storchaka:email-charset-aliases

Conversation

@serhiy-storchaka
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka commented May 17, 2026

Defer to the codecs module for all aliases.
Use MIME/IANA names for all IANA registered charsets.

Defer to the codecs module for all aliases.
Use MIME/IANA names for all IANA registered charsets.
@serhiy-storchaka serhiy-storchaka requested a review from a team as a code owner May 17, 2026 09:42
@malemburg malemburg changed the title gh-53144: Imporove charset support in the email package gh-53144: Improve charset support in the email package May 17, 2026
Comment thread Lib/email/charset.py Outdated
Copy link
Copy Markdown
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good.

I'm getting two test failures against current main, one of which (test_chinese_codecs) looks like charset name changes and is probably correct, but the other(test_korean_codecs) I don't understand.

Comment thread Lib/email/charset.py Outdated
@@ -61,40 +62,73 @@
'utf-8': (SHORTEST, BASE64, 'utf-8'),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'utf-8': (SHORTEST, BASE64, 'utf-8'),
'utf-8': (SHORTEST, BASE64, None),

I don't know why this is set to a non-None value when all the others are None, except for the one that wants to use a different charset for header encoding. So I think we should "fix" it to avoid confusion later (assuming I'm not missing something here).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This item can be removed. (SHORTEST, BASE64, None) is the default value.

Comment thread Lib/email/charset.py Outdated
self.header_encoding = henc
self.body_encoding = benc
self.output_charset = ALIASES.get(conv, conv)
self.output_charset = conv
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "but let the user override it", which is what the call to ALIASES does. While it is unlikely anyone is using that facility, it is part of the API to update the aliases table, so we should maybe leave it. The place it might get used is exactly the place it might be wanted: overriding iso-2022-jp 7bit codec. So, since it is cheap, and a no-op in all other cases, maybe we shouldn't risk breaking anyone's legacy code?

msg.set_param('charset',
email.charset.ALIASES.get(charset, charset),
replace=True)
msg.set_param('charset', charset, replace=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reordering of the operations here looks correct, which presumably means there is a missing test that would show the bug. Do you want to add one? If not I'll make a note and try to remember to do it some day ;)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_set_text_charset_cp949. Note that charset="euc-kr", even if ALIASES maps 'cp949' to 'ks_c_5601-1987'.

But when I tried to add similar test with shif_jis or euc-jp, it failed (trying to encode surrogates). It fails also with the current code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #150771.

Comment thread Lib/test/test_email/test_email.py
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Jun 1, 2026

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Copy link
Copy Markdown
Member Author

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting two test failures against current main, one of which (test_chinese_codecs) looks like charset name changes and is probably correct, but the other(test_korean_codecs) I don't understand.

This is interesting. There are three chunks with encodings euc-kr, ks_c_5601-1987 and cp949. In main, cp949 is translated to ks_c_5601-1987, so the 2nd and the 3rd chunks are merged. With this PR, the second chunk, ks_c_5601-1987, translated to euc-kr, so the 1st and the 2nd chunks are merged.

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

Comment thread Lib/email/charset.py Outdated
Comment thread Lib/email/charset.py Outdated
Comment thread Lib/email/charset.py Outdated
@@ -61,40 +62,73 @@
'utf-8': (SHORTEST, BASE64, 'utf-8'),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This item can be removed. (SHORTEST, BASE64, None) is the default value.

msg.set_param('charset',
email.charset.ALIASES.get(charset, charset),
replace=True)
msg.set_param('charset', charset, replace=True)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_set_text_charset_cp949. Note that charset="euc-kr", even if ALIASES maps 'cp949' to 'ks_c_5601-1987'.

But when I tried to add similar test with shif_jis or euc-jp, it failed (trying to encode surrogates). It fails also with the current code.

@serhiy-storchaka
Copy link
Copy Markdown
Member Author

I have made the requested changes; please review again.

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Jun 1, 2026

Thanks for making the requested changes!

@bitdancer: please review the changes made to this pull request.

@bedevere-app bedevere-app Bot requested a review from bitdancer June 1, 2026 22:24
@serhiy-storchaka
Copy link
Copy Markdown
Member Author

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

See #62825.

Copy link
Copy Markdown
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bitdancer
Copy link
Copy Markdown
Member

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

See #62825.

It seems to me the change in behavior introduced by this PR at least partially addresses 62825 in that it makes the output encoding for the case in question the encoding that handles more characters. Do you agree or am I misunderstanding?

@serhiy-storchaka
Copy link
Copy Markdown
Member Author

I do not think it makes test_korean_codecs better or worse. At least we cannot see this until other issue with Korean charsets are fixed. But test_set_text_charset_cp949 fails on main, so there is at least some progress. And test_chinese_codecs is now more correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants