gh-119182: Use strict error handler in PyUnicode_FromFormat() by vstinner · Pull Request #120307 · python/cpython

vstinner · 2024-06-10T09:08:12Z

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler.

Remove the unused 'consumed' parameter of
unicode_decode_utf8_writer().

Issue: [C API] Add an efficient public PyUnicodeWriter API #119182

📚 Documentation preview 📚: https://cpython-previews--120307.org.readthedocs.build/

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler. Remove the unused 'consumed' parameter of unicode_decode_utf8_writer().

vstinner · 2024-06-10T09:08:25Z

cc @methane @serhiy-storchaka

serhiy-storchaka · 2024-06-10T09:14:25Z

What happens with truncated strings, like %.50s, if the are truncated in the middle of multibyte UTF-8 sequence?

vstinner · 2024-06-10T09:39:32Z

What happens with truncated strings, like %.50s, if the are truncated in the middle of multibyte UTF-8 sequence?

There are two tests on that: UnicodeDecodeError is raised in this case.

vstinner · 2024-06-10T09:55:57Z

Example of test: test_capi.test_unicode

        # test "%s" format with precision
        check_format('abc',
                     b'%.3s', b'abcdef')
        with self.assertRaises(UnicodeDecodeError):
            PyUnicode_FromFormat(b'%.5s', 'abc[\u20ac]'.encode('utf8'))
        check_format('abc[\u20ac',
                     b'%.7s', 'abc[\u20ac]'.encode('utf8'))

serhiy-storchaka · 2024-06-10T10:11:47Z

This is bad. Such formats are common in error formatting code (not only in CPython, but in third-party code), and now you will get a UnicodeDecodeError instead of the original error even if all was fine with encoding. In this case I think that it it is better to truncate the string before the truncated sequence.

But even without truncation, it may be better to get a replacement character in the error message of the correct exception than a UnicodeDecodeError.

vstinner · 2024-06-10T13:04:22Z

On my PR gh-120248, @methane wrote:

I prefer "strict" because "hard to notice" is also hard to debug.

So I created this PR. @methane: What do you think?

I can modify the %.100s format ("%s" with precision) to truncate to 100 characters instead of 100 bytes, to avoid the risk of creating invalid UTF-8 strings.

serhiy-storchaka · 2024-06-10T16:25:58Z

I think that @methane's comment was only related to the format string (which currently is ASCII-only), not to arguments.

methane · 2024-06-11T13:29:43Z

I think 100 codepoints is the best option.

About error handler, there is no correct answer. Theorically speaking, "strict" is "Errors should never pass silently."
But both of "replace" and "backslashreplace" are acceptable.

serhiy-storchaka · 2024-06-11T18:31:46Z

Precision should specify the length in bytes. This feature can be used to format not-null-teminated strings.

char buffer[100];
PyUnicode_FromFormat("%.100s", buffer);

If you start to count codepoints, you can read past the end of the array.

vstinner · 2024-06-17T20:02:18Z

I abandon this PR. It seems like using "replace" error handler is more appropriate here.

pythongh-119182: Use strict error handler in PyUnicode_FromFormat()

3541237

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler. Remove the unused 'consumed' parameter of unicode_decode_utf8_writer().

bedevere-app bot mentioned this pull request Jun 10, 2024

[C API] Add an efficient public PyUnicodeWriter API #119182

Closed

bedevere-app bot added the awaiting core review label Jun 10, 2024

vstinner mentioned this pull request Jun 10, 2024

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

Closed

vstinner closed this Jun 17, 2024

vstinner deleted the format_strict branch June 17, 2024 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

gh-119182: Use strict error handler in PyUnicode_FromFormat()#120307

gh-119182: Use strict error handler in PyUnicode_FromFormat()#120307
vstinner wants to merge 1 commit intopython:mainfrom
vstinner:format_strict

vstinner commented Jun 10, 2024 •

edited by github-actions bot

Loading

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

methane commented Jun 11, 2024

Uh oh!

serhiy-storchaka commented Jun 11, 2024

Uh oh!

vstinner commented Jun 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Comments

Conversation

vstinner commented Jun 10, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

methane commented Jun 11, 2024

Uh oh!

serhiy-storchaka commented Jun 11, 2024

Uh oh!

vstinner commented Jun 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vstinner commented Jun 10, 2024 •

edited by github-actions bot

Loading