`urllib.parse` accepts invalid characters in IPv6 ZoneIDs and IPvFuture addresses #137146

Open

Open

urllib.parse accepts invalid characters in IPv6 ZoneIDs and IPvFuture addresses#137146

Labels

stdlibtype-security

opened

on Jul 27, 2025

Bug report

Bug description:

Issue Summary

The urllib.parse module currently allows invalid characters in both IPv6 Zone Identifiers and IPvFuture addresses due to discrepancies between how it validates these components and what is defined in relevant RFCs.

Details

IPv6 ZoneID Parsing
- According to RFC 6874, the IPv6 ZoneID should follow a restricted format when used in a URL. Specifically, it must use percent-encoding (% followed by two hex digits) for non-allowed characters, and only the following characters are permitted in the decoded form:
```
ALPHA / DIGIT / "-" / "." / "_" / "~"
```
- However, urllib.parse relies on the ipaddress module to parse ZoneIDs, which follows the broader rules from RFC 4007, accepting any non-null string.
- This results in invalid ZoneIDs being accepted in parsed URLs, which could cause compatibility issues or security concerns in strict RFC-compliant applications.
IPvFuture Parsing
- RFC 3986 defines the format of an IPvFuture address as:
```
"v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
```
  where:
  - unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
  - sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
- However, the current regex used by urllib.parse for validating IPvFuture uses a .+ pattern, which matches any character sequence. This allows completely invalid strings (including those with spaces or other illegal characters) to be accepted as valid IPvFuture addresses.

Expected Behavior

urllib.parse should:
- Enforce the character restrictions for ZoneIDs as defined in RFC 6874 when parsing URL hosts.
- Strictly validate IPvFuture addresses based on the character classes defined in RFC 3986.

Actual Behavior

Invalid characters are accepted in both ZoneIDs and IPvFuture segments.

Suggested Fix

Update urllib.parse:
- To decode and validate ZoneID characters according to RFC 6874.
- To adjust the regex or parsing logic for IPvFuture addresses to comply with the ABNF in RFC 3986.

Example

from urllib.parse import urlparse

# Invalid ZoneID containing `@`
url = urlparse("http://[fe80::1%en0@]:8080")
print(url.hostname)  # Should raise or reject invalid ZoneID

# Invalid IPvFuture containing ` ` (space)
url = urlparse("http://[v1.invalid space]/")
print(url.hostname)  # Should raise or reject invalid IPvFuture

Impact

This behavior may lead to security vulnerabilities or unexpected behavior in applications that rely on urllib.parse for URL validation or sanitization based on RFC compliance.

In particular:

A developer may incorrectly assume that urllib.parse enforces RFC 3986, RFC 6874, or RFC 4007 character restrictions.
If the parser accepts invalid ZoneIDs or malformed IPvFuture components, applications could:
- Accept and process invalid or malicious URLs.
- Misroute requests, leading to access control issues.
- Be exposed to injection attacks (e.g., if the ZoneID is reused unsanitized in shell commands or logging).
- Fail silently in contexts where strict compliance is expected, introducing logic bugs or interoperability issues.

Example scenario:

If a developer trusts urllib.parse to reject invalid hostnames, and then interpolates the parsed url.hostname or url.netloc into system commands, proxy configurations, or DNS lookups, an attacker could exploit improperly validated input.

Recommendation:

Until this behavior is corrected, developers should not rely solely on urllib.parse for validation of ZoneIDs or IPvFuture addresses, and should consider adding additional sanitization layers when handling user-supplied URLs.

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Linked PRs

Metadata

Assignees

No one assigned

Labels

stdlibtype-security

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests