Skip to content

gh-88886: Remove excessive encoding name normalization #137167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Jul 28, 2025

The codecs lookup function now performs only minimal normalization of the encoding name before passing it to the search functions: all ASCII letters are converted to lower case, spaces are replaced with hyphens.

Excessive normalization broke third-party codecs providers, like python-iconv.

Revert "bpo-37751: Fix codecs.lookup() normalization (GH-15092)"

This reverts commit 20f59fe.


📚 Documentation preview 📚: https://cpython-previews--137167.org.readthedocs.build/

Copy link
Member

@StanFromIreland StanFromIreland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update normalize_encodings doc.

@malemburg Which order should the two PRs be merged in, switching it to the C implementation and simplifying the implementation?

@serhiy-storchaka
Copy link
Member Author

serhiy-storchaka commented Jul 28, 2025

This mainly restores the status quo prior to bpo-37751 and updates the documentation. But I am planning more changes.

I was not sure about replacing spaces with hyphens? Should we kept this here, on leave it to the search function? Should we convert spaces to underscores instead? Or should we remove any transformation?

I plan also to change _Py_normalize_encoding() -- only convert to lower case and replace spaces, hyphens and underscores, so encoding like "utf#$^(&8" will no longer be accepted. And change encodings.normalize_encoding() in a similar way. Names that are not valid module names after normalization should not be found.

The codecs lookup function now performs only minimal normalization of
the encoding name before passing it to the search functions:
all ASCII letters are converted to lower case, spaces are replaced
with hyphens.

Excessive normalization broke third-party codecs providers, like
python-iconv.

Revert "bpo-37751: Fix codecs.lookup() normalization (pythonGH-15092)"

This reverts commit 20f59fe.
Copy link
Member

@malemburg malemburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the one change to the news entry.

@@ -0,0 +1,3 @@
The codecs lookup function now performs only minimal normalization of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write "The codecs lookup function now again performs only minimal normalization of the encoding name before passing it to the search functions: all ASCII letters are converted to lower case, spaces are replaced with hyphens. This restores the pre-Python 3.9 behavior.".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it restores the pre-Python 3.9 behavior. I wonder if we can go further -- do not replace spaces, or replace them with underscores instead of hyphens, or replace also hyphens or underscores. We can handle this in normalize_encoding() or just add aliases. iconv on FreeBSD do not convert any non-letter character. iconv on Linux preserves alphanumerics and "_-.,:/", and removes all others.

Should it still convert to lower case? Most standard encoding names in the IANA registry are in upper case. iconv encoding names are all in upper case (on Linux and FreeBSD). But this is potentially a breaking change without much benefit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, let's not change the behavior again. It's been like this for ages before the Python 3.9 change and we really only want to roll back that change.

The lower casing was added to make sure that encoding names are handled case insensitive. It's rather common to default to lower casing rather then upper casing to do this.

Copy link
Member

@StanFromIreland StanFromIreland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants