-
-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Description
Bug report
Bug description:
Issue Summary
The urllib.parse
module currently allows invalid characters in both IPv6 Zone Identifiers and IPvFuture addresses due to discrepancies between how it validates these components and what is defined in relevant RFCs.
Details
-
IPv6 ZoneID Parsing
-
According to RFC 6874, the IPv6 ZoneID should follow a restricted format when used in a URL. Specifically, it must use percent-encoding (
%
followed by two hex digits) for non-allowed characters, and only the following characters are permitted in the decoded form:ALPHA / DIGIT / "-" / "." / "_" / "~"
-
However,
urllib.parse
relies on theipaddress
module to parse ZoneIDs, which follows the broader rules from RFC 4007, accepting any non-null string. -
This results in invalid ZoneIDs being accepted in parsed URLs, which could cause compatibility issues or security concerns in strict RFC-compliant applications.
-
-
IPvFuture Parsing
-
RFC 3986 defines the format of an
IPvFuture
address as:"v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
where:
unreserved
= ALPHA / DIGIT / "-" / "." / "_" / "~"sub-delims
= "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
-
However, the current regex used by
urllib.parse
for validatingIPvFuture
uses a.+
pattern, which matches any character sequence. This allows completely invalid strings (including those with spaces or other illegal characters) to be accepted as valid IPvFuture addresses.
-
Expected Behavior
-
urllib.parse
should:- Enforce the character restrictions for ZoneIDs as defined in RFC 6874 when parsing URL hosts.
- Strictly validate IPvFuture addresses based on the character classes defined in RFC 3986.
Actual Behavior
- Invalid characters are accepted in both ZoneIDs and IPvFuture segments.
Suggested Fix
-
Update
urllib.parse
:- To decode and validate ZoneID characters according to RFC 6874.
- To adjust the regex or parsing logic for IPvFuture addresses to comply with the ABNF in RFC 3986.
Example
from urllib.parse import urlparse
# Invalid ZoneID containing `@`
url = urlparse("http://[fe80::1%en0@]:8080")
print(url.hostname) # Should raise or reject invalid ZoneID
# Invalid IPvFuture containing ` ` (space)
url = urlparse("http://[v1.invalid space]/")
print(url.hostname) # Should raise or reject invalid IPvFuture
Impact
This behavior may lead to security vulnerabilities or unexpected behavior in applications that rely on urllib.parse
for URL validation or sanitization based on RFC compliance.
In particular:
-
A developer may incorrectly assume that
urllib.parse
enforces RFC 3986, RFC 6874, or RFC 4007 character restrictions. -
If the parser accepts invalid ZoneIDs or malformed IPvFuture components, applications could:
- Accept and process invalid or malicious URLs.
- Misroute requests, leading to access control issues.
- Be exposed to injection attacks (e.g., if the ZoneID is reused unsanitized in shell commands or logging).
- Fail silently in contexts where strict compliance is expected, introducing logic bugs or interoperability issues.
Example scenario:
If a developer trusts urllib.parse
to reject invalid hostnames, and then interpolates the parsed url.hostname
or url.netloc
into system commands, proxy configurations, or DNS lookups, an attacker could exploit improperly validated input.
Recommendation:
Until this behavior is corrected, developers should not rely solely on urllib.parse
for validation of ZoneIDs or IPvFuture addresses, and should consider adding additional sanitization layers when handling user-supplied URLs.
CPython versions tested on:
3.13
Operating systems tested on:
Linux