gh-53144: Improve charset support in the email package#149942
gh-53144: Improve charset support in the email package#149942serhiy-storchaka wants to merge 5 commits into
Conversation
Defer to the codecs module for all aliases. Use MIME/IANA names for all IANA registered charsets.
bitdancer
left a comment
There was a problem hiding this comment.
Overall this looks good.
I'm getting two test failures against current main, one of which (test_chinese_codecs) looks like charset name changes and is probably correct, but the other(test_korean_codecs) I don't understand.
| @@ -61,40 +62,73 @@ | |||
| 'utf-8': (SHORTEST, BASE64, 'utf-8'), | |||
There was a problem hiding this comment.
| 'utf-8': (SHORTEST, BASE64, 'utf-8'), | |
| 'utf-8': (SHORTEST, BASE64, None), |
I don't know why this is set to a non-None value when all the others are None, except for the one that wants to use a different charset for header encoding. So I think we should "fix" it to avoid confusion later (assuming I'm not missing something here).
There was a problem hiding this comment.
This item can be removed. (SHORTEST, BASE64, None) is the default value.
| self.header_encoding = henc | ||
| self.body_encoding = benc | ||
| self.output_charset = ALIASES.get(conv, conv) | ||
| self.output_charset = conv |
There was a problem hiding this comment.
The comment says "but let the user override it", which is what the call to ALIASES does. While it is unlikely anyone is using that facility, it is part of the API to update the aliases table, so we should maybe leave it. The place it might get used is exactly the place it might be wanted: overriding iso-2022-jp 7bit codec. So, since it is cheap, and a no-op in all other cases, maybe we shouldn't risk breaking anyone's legacy code?
| msg.set_param('charset', | ||
| email.charset.ALIASES.get(charset, charset), | ||
| replace=True) | ||
| msg.set_param('charset', charset, replace=True) |
There was a problem hiding this comment.
Your reordering of the operations here looks correct, which presumably means there is a missing test that would show the bug. Do you want to add one? If not I'll make a note and try to remember to do it some day ;)
There was a problem hiding this comment.
Added test_set_text_charset_cp949. Note that charset="euc-kr", even if ALIASES maps 'cp949' to 'ks_c_5601-1987'.
But when I tried to add similar test with shif_jis or euc-jp, it failed (trying to encode surrogates). It fails also with the current code.
|
When you're done making the requested changes, leave the comment: |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
I'm getting two test failures against current main, one of which (test_chinese_codecs) looks like charset name changes and is probably correct, but the other(test_korean_codecs) I don't understand.
This is interesting. There are three chunks with encodings euc-kr, ks_c_5601-1987 and cp949. In main, cp949 is translated to ks_c_5601-1987, so the 2nd and the 3rd chunks are merged. With this PR, the second chunk, ks_c_5601-1987, translated to euc-kr, so the 1st and the 2nd chunks are merged.
This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.
| @@ -61,40 +62,73 @@ | |||
| 'utf-8': (SHORTEST, BASE64, 'utf-8'), | |||
There was a problem hiding this comment.
This item can be removed. (SHORTEST, BASE64, None) is the default value.
| msg.set_param('charset', | ||
| email.charset.ALIASES.get(charset, charset), | ||
| replace=True) | ||
| msg.set_param('charset', charset, replace=True) |
There was a problem hiding this comment.
Added test_set_text_charset_cp949. Note that charset="euc-kr", even if ALIASES maps 'cp949' to 'ks_c_5601-1987'.
But when I tried to add similar test with shif_jis or euc-jp, it failed (trying to encode surrogates). It fails also with the current code.
|
I have made the requested changes; please review again. |
|
Thanks for making the requested changes! @bitdancer: please review the changes made to this pull request. |
See #62825. |
It seems to me the change in behavior introduced by this PR at least partially addresses 62825 in that it makes the output encoding for the case in question the encoding that handles more characters. Do you agree or am I misunderstanding? |
|
I do not think it makes |
Defer to the codecs module for all aliases.
Use MIME/IANA names for all IANA registered charsets.