-
Notifications
You must be signed in to change notification settings - Fork 1.4k
windows codecs #6337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows codecs #6337
Conversation
WalkthroughThis change adds Windows-specific implementations for MBCS and OEM codec encode/decode functions with dedicated argument structs and typed returns, and retains non-Windows fallbacks that delegate to Python codecs for portability. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a204467 to
c623c56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crates/vm/src/stdlib/codecs.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Follow the default rustfmt code style by runningcargo fmtto format Rust code
Always run clippy to lint Rust code (cargo clippy) before completing tasks and fix any warnings or lints introduced by changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
crates/vm/src/stdlib/codecs.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Ensure compilation on various targets
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Check Rust code with clippy
- GitHub Check: Check the WASM package and demo
🔇 Additional comments (5)
crates/vm/src/stdlib/codecs.rs (5)
228-235: LGTM!The argument struct follows the established pattern for encode operations and is correctly configured.
316-326: Verify that the unusedfinalparameter aligns with Python's codec API.The
r#finalparameter is marked as dead code. While this might be intentional for API compatibility with Python's incremental decoder interface, please confirm this matches the expected signature formbcs_decodein CPython.
417-424: LGTM!The argument struct is correctly defined and follows the established pattern.
505-515: Verify that the unusedfinalparameter aligns with Python's codec API.Same as
MbcsDecodeArgs, ther#finalparameter is unused. Confirm this matches the expected signature foroem_decodein CPython.
310-314: LGTM!The non-Windows fallback implementations correctly delegate to Python's
_pycodecsmodule, ensuring cross-platform compatibility. The configuration attributes are properly set.Also applies to: 411-415, 499-503, 600-604
| #[cfg(windows)] | ||
| #[pyfunction] | ||
| fn mbcs_encode(args: MbcsEncodeArgs, vm: &VirtualMachine) -> PyResult<(Vec<u8>, usize)> { | ||
| use std::os::windows::ffi::OsStrExt; | ||
| use windows_sys::Win32::Globalization::{ | ||
| CP_ACP, WC_NO_BEST_FIT_CHARS, WideCharToMultiByte, | ||
| }; | ||
|
|
||
| let errors = args.errors.as_ref().map(|s| s.as_str()).unwrap_or("strict"); | ||
| let s = args.s.as_str(); | ||
| let char_len = args.s.char_len(); | ||
|
|
||
| if s.is_empty() { | ||
| return Ok((Vec::new(), char_len)); | ||
| } | ||
|
|
||
| // Convert UTF-8 string to UTF-16 | ||
| let wide: Vec<u16> = std::ffi::OsStr::new(s).encode_wide().collect(); | ||
|
|
||
| // Get the required buffer size | ||
| let size = unsafe { | ||
| WideCharToMultiByte( | ||
| CP_ACP, | ||
| WC_NO_BEST_FIT_CHARS, | ||
| wide.as_ptr(), | ||
| wide.len() as i32, | ||
| std::ptr::null_mut(), | ||
| 0, | ||
| std::ptr::null(), | ||
| std::ptr::null_mut(), | ||
| ) | ||
| }; | ||
|
|
||
| if size == 0 { | ||
| let err = std::io::Error::last_os_error(); | ||
| return Err(vm.new_os_error(format!("mbcs_encode failed: {}", err))); | ||
| } | ||
|
|
||
| let mut buffer = vec![0u8; size as usize]; | ||
| let mut used_default_char: i32 = 0; | ||
|
|
||
| let result = unsafe { | ||
| WideCharToMultiByte( | ||
| CP_ACP, | ||
| WC_NO_BEST_FIT_CHARS, | ||
| wide.as_ptr(), | ||
| wide.len() as i32, | ||
| buffer.as_mut_ptr().cast(), | ||
| size, | ||
| std::ptr::null(), | ||
| if errors == "strict" { | ||
| &mut used_default_char | ||
| } else { | ||
| std::ptr::null_mut() | ||
| }, | ||
| ) | ||
| }; | ||
|
|
||
| if result == 0 { | ||
| let err = std::io::Error::last_os_error(); | ||
| return Err(vm.new_os_error(format!("mbcs_encode failed: {err}"))); | ||
| } | ||
|
|
||
| if errors == "strict" && used_default_char != 0 { | ||
| return Err(vm.new_unicode_encode_error( | ||
| "'mbcs' codec can't encode characters: invalid character", | ||
| )); | ||
| } | ||
|
|
||
| buffer.truncate(result as usize); | ||
| Ok((buffer, char_len)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Search for tests related to mbcs codec
rg -n "mbcs" --type=rs --type=py -iRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Look for error handler validation patterns in the codebase
rg -n "errors.*strict|ignore|replace" --type=rs | head -30Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check the broader context around codecs.rs to understand how other encode functions handle errors
rg -n "errors.*as_ref|errors.*unwrap_or" --type=rs -B2 -A2Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check the codecs.rs file structure to see if there are other encode functions
cat -n crates/vm/src/stdlib/codecs.rs | head -100Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check the entire mbcs_encode function and surrounding context
sed -n '237,310p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check if there are other encode functions to see how they handle errors
grep -n "fn.*encode" crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's see what other encoding functions look like to understand error handling patterns
sed -n '1,150p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's also check if there are any other error-related constants being used
rg -n "CP_ACP|WC_NO_BEST_FIT_CHARS" --type=rs -B5 -A5Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Look for any test files that test this function
find . -name "*.rs" -type f | xargs grep -l "mbcs" 2>/dev/nullRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check if there are Python stdlib tests we can reference
find . -name "*.py" -type f | xargs grep -l "mbcs" 2>/dev/null | head -10Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's check the entire file to understand the error handling context
wc -l crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check the full codecs.rs file for other encode functions and their error handling
sed -n '100,250p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's check if there are any comments or documentation about error handling in this file
grep -n "error\|Error\|ERROR" crates/vm/src/stdlib/codecs.rs | head -20Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's understand what the "errors" parameter should control by checking Python documentation
# and looking at the actual Windows API behavior
# Check if this function validates the errors parameter properly
sed -n '237,250p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check what MbcsEncodeArgs structure looks like
rg -n "struct MbcsEncodeArgs" --type=rs -A10Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's look at the broader context of the file
sed -n '1,50p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Find the MbcsEncodeArgs definition
grep -n "MbcsEncodeArgs" crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Get the definition of MbcsEncodeArgs
sed -n '210,236p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's check if there's a utf_8_encode or similar function to see how errors are handled there
sed -n '50,150p' crates/vm/src/stdlib/codecs.rs | grep -A20 "fn utf"Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's look for any utf_8_encode function in the file
rg -n "fn utf_8_encode" --type=rs -A30Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Let's understand the complete behavior by looking at lines around mbcs_encode
sed -n '230,330p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check commit history for mbcs_encode changes
git log --all -p --follow -S "mbcs_encode" -- crates/vm/src/stdlib/codecs.rs | head -100Repository: RustPython/RustPython
Length of output: 186
🏁 Script executed:
# Check the git blame to understand the history
git blame -L237,310 crates/vm/src/stdlib/codecs.rs | head -20Repository: RustPython/RustPython
Length of output: 186
Implement error handling for non-strict error modes in mbcs_encode.
The current implementation only properly handles "strict" error mode. Non-strict modes ("ignore", "replace", "xmlcharrefreplace", "backslashreplace") lack proper implementation:
- Invalid error mode names are silently treated as non-strict (no validation at line 245)
- Only "strict" mode tracks unmappable characters via
used_default_char(lines 294-302) - Non-strict modes don't implement their specified behavior: "ignore" should skip unmappable characters, "replace" should substitute them, and other modes require custom handling
- The
WC_NO_BEST_FIT_CHARSflag is always set regardless of error mode, but the code doesn't explicitly handle the consequences in non-strict modes
This causes non-strict error modes to exhibit undefined behavior dependent on the Windows MBCS implementation rather than Python's codec specification.
🤖 Prompt for AI Agents
In crates/vm/src/stdlib/codecs.rs around lines 237 to 308, the mbcs_encode
currently validates only "strict" and doesn't implement behavior for non-strict
modes: validate the errors argument against allowed modes
("strict","ignore","replace","xmlcharrefreplace","backslashreplace") and return
a ValueError for invalid names; always request WideCharToMultiByte to report
default-char usage (track used_default_char) and, after conversion, if
used_default_char!=0 handle unmappable characters according to the selected
mode: for "strict" raise a UnicodeEncodeError as now; for "ignore" remove the
bytes corresponding to unmappable input characters (skip them from the output
and adjust returned char count appropriately); for "replace" substitute
unmappable characters with a single replacement byte (e.g. b'?' or the system
best-fit char provided by the conversion) while preserving char length
semantics; for "backslashreplace" and "xmlcharrefreplace" post-process the
original UTF-8 input to replace each unmappable character with the corresponding
ASCII escape or XML character reference and then re-encode that sequence to
bytes; ensure WC_NO_BEST_FIT_CHARS is cleared or set appropriately per mode so
Windows does/doesn't do best-fit mapping, and update buffer
construction/truncation and returned char length to reflect any removals or
insertions.
| #[cfg(windows)] | ||
| #[pyfunction] | ||
| fn oem_decode(args: OemDecodeArgs, vm: &VirtualMachine) -> PyResult<(String, usize)> { | ||
| use windows_sys::Win32::Globalization::{ | ||
| CP_OEMCP, MB_ERR_INVALID_CHARS, MultiByteToWideChar, | ||
| }; | ||
|
|
||
| let _errors = args.errors.as_ref().map(|s| s.as_str()).unwrap_or("strict"); | ||
| let data = args.data.borrow_buf(); | ||
| let len = data.len(); | ||
|
|
||
| if data.is_empty() { | ||
| return Ok((String::new(), 0)); | ||
| } | ||
|
|
||
| // Get the required buffer size for UTF-16 | ||
| let size = unsafe { | ||
| MultiByteToWideChar( | ||
| CP_OEMCP, | ||
| MB_ERR_INVALID_CHARS, | ||
| data.as_ptr().cast(), | ||
| len as i32, | ||
| std::ptr::null_mut(), | ||
| 0, | ||
| ) | ||
| }; | ||
|
|
||
| if size == 0 { | ||
| // Try without MB_ERR_INVALID_CHARS for non-strict mode | ||
| let size = unsafe { | ||
| MultiByteToWideChar( | ||
| CP_OEMCP, | ||
| 0, | ||
| data.as_ptr().cast(), | ||
| len as i32, | ||
| std::ptr::null_mut(), | ||
| 0, | ||
| ) | ||
| }; | ||
| if size == 0 { | ||
| let err = std::io::Error::last_os_error(); | ||
| return Err(vm.new_os_error(format!("oem_decode failed: {}", err))); | ||
| } | ||
| } | ||
|
|
||
| let size = unsafe { | ||
| MultiByteToWideChar( | ||
| CP_OEMCP, | ||
| 0, // Use 0 flags for replacement behavior | ||
| data.as_ptr().cast(), | ||
| len as i32, | ||
| std::ptr::null_mut(), | ||
| 0, | ||
| ) | ||
| }; | ||
|
|
||
| let mut buffer = vec![0u16; size as usize]; | ||
|
|
||
| let result = unsafe { | ||
| MultiByteToWideChar( | ||
| CP_OEMCP, | ||
| 0, | ||
| data.as_ptr().cast(), | ||
| len as i32, | ||
| buffer.as_mut_ptr(), | ||
| size, | ||
| ) | ||
| }; | ||
|
|
||
| if result == 0 { | ||
| let err = std::io::Error::last_os_error(); | ||
| return Err(vm.new_os_error(format!("oem_decode failed: {}", err))); | ||
| } | ||
|
|
||
| buffer.truncate(result as usize); | ||
|
|
||
| // Convert UTF-16 to UTF-8 String | ||
| let s = String::from_utf16(&buffer) | ||
| .map_err(|e| vm.new_unicode_decode_error(format!("oem_decode failed: {}", e)))?; | ||
|
|
||
| Ok((s, len)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same critical issues as mbcs_decode, plus code duplication.
This function has identical issues to mbcs_decode:
- Critical: Lines 562-571 unconditionally recalculate the buffer size, making the error-checking code path (lines 533-560) ineffective.
- Major: The
_errorsparameter is unused.
Additionally, oem_decode and mbcs_decode differ only in the code page constant. Consider refactoring to share a common implementation.
Apply the same fix pattern as suggested for mbcs_decode, and consider extracting a shared helper function:
fn windows_decode(
data: &[u8],
code_page: u32,
errors: &str,
codec_name: &str,
vm: &VirtualMachine,
) -> PyResult<(String, usize)> {
// Shared implementation
}🤖 Prompt for AI Agents
In crates/vm/src/stdlib/codecs.rs around lines 517 to 598, oem_decode duplicates
mbcs_decode issues: it unconditionally recalculates the UTF‑16 buffer size
(making the earlier error-checking branch ineffective) and ignores the errors
parameter; refactor to follow the mbcs_decode fix by 1) compute required size
once using MB_ERR_INVALID_CHARS when errors == "strict", otherwise try
MB_ERR_INVALID_CHARS first and fall back to 0 and propagate the OS error only if
both calls return 0; 2) remove the redundant second size calculation and use the
returned size to allocate the buffer and perform the conversion; 3) use the
errors value (drop the leading underscore) to control strict vs replacement
behavior; and 4) extract a shared helper (e.g. windows_decode(data, code_page,
errors, codec_name, vm)) that contains the common logic and call it from
oem_decode with CP_OEMCP as the code_page to eliminate duplication.
c623c56 to
86e1a0b
Compare
ShaharNaveh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna approve it, although I don't understand anything in Windows tbh:/
|
oh, sorry. I will try not to bring you to windows topics too much |
|
Windows is fine, it's just that my knowledge there is so little (compared to *nix systems), I don't think I can provide useful insights 😅 |
05e888e to
cdf39bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
crates/vm/src/stdlib/codecs.rs (4)
237-316: Non-strict error modes lack proper implementation.The function only implements "strict" error handling (lines 308-312). Non-strict modes ("ignore", "replace", "xmlcharrefreplace", "backslashreplace") silently receive Windows default behavior rather than Python codec semantics, and invalid error mode names are not validated.
As per past review comments, proper error mode implementations should:
- Validate the errors parameter against allowed modes
- Implement "ignore" to skip unmappable characters
- Implement "replace" with appropriate substitution
- Implement "backslashreplace" and "xmlcharrefreplace" with proper escape sequences
439-518: Extract shared implementation to eliminate duplication.This function duplicates
mbcs_encode(lines 237-316), differing only in the code page constant (CP_OEMCPvsCP_ACP) and error messages. Additionally, it has the same incomplete error handling for non-strict modes.As per past review comments and coding guidelines, extract a shared helper function:
fn windows_encode( s: &str, char_len: usize, errors: &str, code_page: u32, codec_name: &str, vm: &VirtualMachine, ) -> PyResult<(Vec<u8>, usize)> { // Shared implementation with code_page parameter }Then call from both
mbcs_encodewithCP_ACPandoem_encodewithCP_OEMCP.
538-624: Extract shared implementation and fix unused errors parameter.This function duplicates
mbcs_decode(lines 336-422), differing only in the code page constant. Additionally, it has the same unused_errorsparameter (line 545).As per past review comments and coding guidelines, extract a shared helper:
fn windows_decode( data: &[u8], code_page: u32, errors: &str, codec_name: &str, vm: &VirtualMachine, ) -> PyResult<(String, usize)> { // Shared implementation }Call from both functions with appropriate code page constants. Also remove the underscore prefix from the
errorsparameter and use it to control decode behavior.
336-422: Unusederrorsparameter prevents proper error handling control.The
_errorsparameter (line 343) is extracted but never used to control decode behavior. The function unconditionally triesMB_ERR_INVALID_CHARSfirst (line 355) and falls back to flags=0 (line 368), regardless of the user's requested error mode. This means users cannot select strict-only behavior or other specific error handling strategies.Consider removing the underscore prefix and using the
errorsvalue to control whether to attempt strict validation or directly use replacement behavior:- let _errors = args.errors.as_ref().map(|s| s.as_str()).unwrap_or("strict"); + let errors = args.errors.as_ref().map(|s| s.as_str()).unwrap_or("strict");Then use
errorsto determine flags:
- If
errors == "strict", use onlyMB_ERR_INVALID_CHARSand fail on invalid sequences- Otherwise, use flags=0 for replacement behavior
🧹 Nitpick comments (3)
crates/vm/src/stdlib/codecs.rs (3)
324-334: Consider implementing thefinalparameter.The
r#finalparameter is currently unused (line 333). While this matches the function signature for codec compatibility, consider whether incremental decoding support should be implemented for completeness.
430-437: LGTM with note on duplication.The
OemEncodeArgsstruct is identical toMbcsEncodeArgs. Consider whether these could be consolidated if the encode functions are refactored to share implementation.
526-536: LGTM with duplication note.The
OemDecodeArgsstruct duplicatesMbcsDecodeArgs. If decode functions are refactored to share implementation, these structs could potentially be consolidated.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Lib/test/test_subprocess.pyis excluded by!Lib/**
📒 Files selected for processing (1)
crates/vm/src/stdlib/codecs.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Follow the default rustfmt code style by runningcargo fmtto format Rust code
Always run clippy to lint Rust code (cargo clippy) before completing tasks and fix any warnings or lints introduced by changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
crates/vm/src/stdlib/codecs.rs
🧬 Code graph analysis (1)
crates/vm/src/stdlib/codecs.rs (1)
crates/vm/src/stdlib/nt.rs (1)
from_utf16(327-329)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Check Rust code with clippy
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Ensure compilation on various targets
🔇 Additional comments (1)
crates/vm/src/stdlib/codecs.rs (1)
228-235: LGTM!The
MbcsEncodeArgsstruct follows the established pattern for codec argument structures.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.