Skip to content

gh-137146: Validate IPv6 ZoneID characters against RFC 6874 in urllib.parse #137148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mauricelambert
Copy link
Contributor

@mauricelambert mauricelambert commented Jul 27, 2025

This PR tightens the validation of IPv6 Zone Identifiers (ZoneIDs) in bracketed hostnames handled by urllib.parse (#137146).

Problem

Currently, urllib.parse accepts any non-null string as a ZoneID, because it delegates IPv6 parsing to the ipaddress module, which follows RFC 4007. However, RFC 6874 §2.1 defines a stricter character set for ZoneIDs when used in URLs:

Characters allowed in ZoneIDs (after percent-decoding):
ALPHA / DIGIT / "-" / "." / "_" / "~"

ZoneIDs in URIs must be percent-encoded and may optionally begin with a literal % (e.g., %25) as described in the RFC.

Fix

This patch adds an explicit validation step to check that any ZoneID in a URL conforms to the allowed character set.

Before the fix:

>>> import urllib.parse
>>> urllib.parse.urlparse("http://[fe80::1%zone|bad]/")
ParseResult(scheme='http', netloc='[fe80::1%zone|bad]', path='/', ...)

After the fix:

>>> urllib.parse.urlparse("http://[fe80::1%zone|bad]/")
ValueError: IPv6 ZoneID is invalid

Notes

  • This does not affect parsing of valid IPv6 addresses or ZoneIDs that comply with RFC 6874.
  • The new check is only triggered if a % is present in the hostname (i.e., it's a ZoneID).

This improves RFC compliance, reduces risk of incorrect or insecure behavior, and ensures more predictable URL parsing.

…pliant set

The current parsing logic for IPv6 addresses with Zone Identifiers (ZoneIDs)
uses the `ipaddress` module, which validates ZoneIDs according to RFC 4007,
allowing any non-null string. However, when used in URLs, ZoneIDs must follow
the percent-encoded format defined in RFC 6874.

This patch adds a check to restrict ZoneIDs to the allowed characters:

  ALPHA / DIGIT / "-" / "." / "_" / "~" / "% HEXDIG HEXDIG"

RFC 6874 §2.1 specifies the format of an IPv6 address with a ZoneID in a URI as:
  `IPv6addrz = IPv6address "%25" ZoneID`

Additionally, RFC 6874 recommends accepting a bare `%` without hex digits as a
liberal extension, but that flexibility still requires ZoneID content to conform
to a safe character set. This patch enforces that ZoneIDs do not include
characters outside the permitted range.

### Before the fix:

```py
>>> import urllib.parse
>>> urllib.parse.urlparse("http://[::1%2|test]/path")
ParseResult(scheme='http', netloc='[::1%2|test]', path='/path', ...)
```

Invalid characters such as `|` were incorrectly accepted in ZoneIDs.

### After the fix:

```py
>>> import urllib.parse
>>> urllib.parse.urlparse("http://[::1%2|test]/path")
Traceback (most recent call last):
    ...
ValueError: IPv6 ZoneID is invalid
```

This patch ensures `urllib.parse` properly rejects ZoneIDs with invalid characters,
improving compliance with the URI standards and helping prevent subtle bugs
or security vulnerabilities.
@StanFromIreland StanFromIreland changed the title #137146: Validate IPv6 ZoneID characters against RFC 6874 in urllib.parse gh-137146: Validate IPv6 ZoneID characters against RFC 6874 in urllib.parse Jul 27, 2025
@StanFromIreland
Copy link
Member

In the future, please use the title format I have edited your title too, as so that our automation can recognise it.

Copy link
Member

@StanFromIreland StanFromIreland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a blurb entry.

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case.

@@ -0,0 +1 @@
Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. `urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters.
Copy link
Member

@ZeroIntensity ZeroIntensity Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reStructuredText, not Markdown, so references look like this:

Suggested change
Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. `urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters.
Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. :mod:`urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urllib.parse is a module, if you want to talk about the function it's urllib.parse.urlparse. I've edited Zero's answer by changing the role as I didn't check which functions are affected (if it's the entire module, it's fine to only quote the module)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I corrected the blurb and added the tests, thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants