-
-
Notifications
You must be signed in to change notification settings - Fork 32.5k
gh-137146: Validate IPv6 ZoneID characters against RFC 6874 in urllib.parse #137148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
gh-137146: Validate IPv6 ZoneID characters against RFC 6874 in urllib.parse #137148
Conversation
…pliant set The current parsing logic for IPv6 addresses with Zone Identifiers (ZoneIDs) uses the `ipaddress` module, which validates ZoneIDs according to RFC 4007, allowing any non-null string. However, when used in URLs, ZoneIDs must follow the percent-encoded format defined in RFC 6874. This patch adds a check to restrict ZoneIDs to the allowed characters: ALPHA / DIGIT / "-" / "." / "_" / "~" / "% HEXDIG HEXDIG" RFC 6874 §2.1 specifies the format of an IPv6 address with a ZoneID in a URI as: `IPv6addrz = IPv6address "%25" ZoneID` Additionally, RFC 6874 recommends accepting a bare `%` without hex digits as a liberal extension, but that flexibility still requires ZoneID content to conform to a safe character set. This patch enforces that ZoneIDs do not include characters outside the permitted range. ### Before the fix: ```py >>> import urllib.parse >>> urllib.parse.urlparse("http://[::1%2|test]/path") ParseResult(scheme='http', netloc='[::1%2|test]', path='/path', ...) ``` Invalid characters such as `|` were incorrectly accepted in ZoneIDs. ### After the fix: ```py >>> import urllib.parse >>> urllib.parse.urlparse("http://[::1%2|test]/path") Traceback (most recent call last): ... ValueError: IPv6 ZoneID is invalid ``` This patch ensures `urllib.parse` properly rejects ZoneIDs with invalid characters, improving compliance with the URI standards and helping prevent subtle bugs or security vulnerabilities.
In the future, please use the title format I have edited your title too, as so that our automation can recognise it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a blurb entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test case.
@@ -0,0 +1 @@ | |||
Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. `urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reStructuredText, not Markdown, so references look like this:
Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. `urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters. | |
Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. :mod:`urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
urllib.parse is a module, if you want to talk about the function it's urllib.parse.urlparse. I've edited Zero's answer by changing the role as I didn't check which functions are affected (if it's the entire module, it's fine to only quote the module)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I corrected the blurb and added the tests, thank you for your help.
This PR tightens the validation of IPv6 Zone Identifiers (ZoneIDs) in bracketed hostnames handled by
urllib.parse
(#137146).Problem
Currently,
urllib.parse
accepts any non-null string as a ZoneID, because it delegates IPv6 parsing to theipaddress
module, which follows RFC 4007. However, RFC 6874 §2.1 defines a stricter character set for ZoneIDs when used in URLs:ZoneIDs in URIs must be percent-encoded and may optionally begin with a literal
%
(e.g.,%25
) as described in the RFC.Fix
This patch adds an explicit validation step to check that any ZoneID in a URL conforms to the allowed character set.
Before the fix:
After the fix:
Notes
%
is present in the hostname (i.e., it's a ZoneID).This improves RFC compliance, reduces risk of incorrect or insecure behavior, and ensures more predictable URL parsing.
urllib.parse
accepts invalid characters in IPv6 ZoneIDs and IPvFuture addresses #137146