gh-119182: Use strict error handler in PyUnicode_FromFormat()#120307
gh-119182: Use strict error handler in PyUnicode_FromFormat()#120307vstinner wants to merge 1 commit intopython:mainfrom
Conversation
PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler. Remove the unused 'consumed' parameter of unicode_decode_utf8_writer().
|
What happens with truncated strings, like |
There are two tests on that: UnicodeDecodeError is raised in this case. |
|
Example of test: test_capi.test_unicode # test "%s" format with precision
check_format('abc',
b'%.3s', b'abcdef')
with self.assertRaises(UnicodeDecodeError):
PyUnicode_FromFormat(b'%.5s', 'abc[\u20ac]'.encode('utf8'))
check_format('abc[\u20ac',
b'%.7s', 'abc[\u20ac]'.encode('utf8')) |
|
This is bad. Such formats are common in error formatting code (not only in CPython, but in third-party code), and now you will get a UnicodeDecodeError instead of the original error even if all was fine with encoding. In this case I think that it it is better to truncate the string before the truncated sequence. But even without truncation, it may be better to get a replacement character in the error message of the correct exception than a UnicodeDecodeError. |
|
On my PR gh-120248, @methane wrote:
So I created this PR. @methane: What do you think? I can modify the |
|
I think that @methane's comment was only related to the format string (which currently is ASCII-only), not to arguments. |
|
I think 100 codepoints is the best option. About error handler, there is no correct answer. Theorically speaking, "strict" is "Errors should never pass silently." |
|
Precision should specify the length in bytes. This feature can be used to format not-null-teminated strings. char buffer[100];
PyUnicode_FromFormat("%.100s", buffer);If you start to count codepoints, you can read past the end of the array. |
|
I abandon this PR. It seems like using |
PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler.
Remove the unused 'consumed' parameter of
unicode_decode_utf8_writer().
📚 Documentation preview 📚: https://cpython-previews--120307.org.readthedocs.build/