Skip to content

Add kde function and tests to RustPython statistics module #6030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jackoconnordev
Copy link
Contributor

@jackoconnordev jackoconnordev commented Jul 24, 2025

Description

Adds the kde function to RustPython's standard library alongside Python unit tests. Function and tests were copied as is from CPython 3.13, but in order for the new tests to complete on the order of seconds instead of hours I adjusted some "magic constants".

From my local testing, the builtin sorted function is the bottleneck here. CPython will sort a list of 1 million random integers in a few hundred milliseconds making it suitable to be used in a unit test. RustPython is not yet capable of this.

Using kde in the interpreter

This example is taken from kde function's docstring.

>>>>> from statistics import kde
>>>>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
>>>>> f_hat = kde(sample, h=1.5)
>>>>> sum(f_hat(x) for x in range(-20, 20))
1.0
>>>>> # ^ sum of probabilities == 1
>>>>> 
>>>>> cdf = kde(sample, h=1.5, cumulative=True)
>>>>> round(cdf(7.5) - cdf(4.5), 2)
0.22
>>>>> # ^ probability of a value between 7.5 and 4.5 being drawn from the sample's estimated probability distribution

whats_left.py diff

Following is the result of diffing the output of python -I whats_left.py with and without the kde function added.

✦ ❯ diff with-kde.txt without-kde.txt 
1449a1450,1467
> statistics._SQRT2
> statistics.__annotations__
> statistics._decimal_sqrt_of_frac
> statistics._float_sqrt_of_frac
> statistics._integer_sqrt_of_frac_rto
> statistics._kernel_invcdfs
> statistics._mean_stdev
> statistics._newton_raphson
> statistics._quartic_invcdf
> statistics._quartic_invcdf_estimate
> statistics._rank
> statistics._sqrt_bit_width
> statistics._sqrtprod
> statistics._triweight_invcdf
> statistics._triweight_invcdf_estimate
> statistics.kde
> statistics.kde_random
> statistics.pi
1710c1728
< #  implemented 94
---
> #  implemented 93
1713,1715c1731,1733
< #  missing_items 115
< #  mismatched_items 67
< #  mismatched_doc_items 60
---
> #  missing_items 116
> #  mismatched_items 68
> #  mismatched_doc_items 61

Example output showing original numbers are too slow to be appropriate

# Example of running TestKDE without adjusting the numbers

✦ ❯ time cargo run --release -- -m test -v -u cpu test_statistics -m TestKDE
    Finished `release` profile [optimized] target(s) in 0.17s
     Running `target/release/rustpython -m test -v -u cpu test_statistics -m TestKDE`
== RustPython 3.13.0alpha (heads/main:96f47a415, Jul 23 2025, 04:28:38) [RustPython 0.4.0 with rustc 1.88.0 (6b00bc388 2025-06-23)]
== Linux-6.12.8-200.fc41.x86_64-x86_64-with-glibc2.35 little-endian
== cwd: /tmp/test_python_318016
== CPU count: 12
== encodings: locale=utf-8, FS=utf-8
Run tests sequentially
0:00:00 load avg: 1.28 [1/1] test_statistics
test_kde (test.test_statistics.TestKDE.test_kde) ... ok
test_kde_kernel_invcdfs (test.test_statistics.TestKDE.test_kde_kernel_invcdfs) ... ok
test_kde_random (test.test_statistics.TestKDE.test_kde_random) ... ^C
^Z
[2]+  Stopped                 cargo run --release -- -m test -v -u cpu test_statistics -m TestKDE

real	55m28.234s
user	0m0.000s
sys	0m0.007s

Summary by CodeRabbit

  • Bug Fixes
    • Improved input validation for the normal distribution inverse CDF function, allowing more flexibility in parameter values.

Copy link
Contributor

coderabbitai bot commented Jul 24, 2025

Walkthrough

The input validation in the normal_dist_inv_cdf function was updated to remove the explicit check for sigma <= 0.0. Now, the function only returns None if the probability p is not strictly between 0 and 1. All other logic in the function remains unchanged.

Changes

File(s) Change Summary
stdlib/src/statistics.rs Modified input validation in normal_dist_inv_cdf to remove the sigma <= 0.0 check.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

A bunny hopped through fields of code,
Where sigma's check was once bestowed.
Now lighter bounds on p remain,
Simpler logic, less to explain!
In statistics' gentle spring,
The rabbit smiles at what small tweaks can bring.

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96f47a4 and 3ef5264.

⛔ Files ignored due to path filters (2)
  • Lib/statistics.py is excluded by !Lib/**
  • Lib/test/test_statistics.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • stdlib/src/statistics.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit Inference Engine (.github/copilot-instructions.md)

**/*.rs: Follow the default rustfmt code style (cargo fmt to format)
Always run clippy to lint code (cargo clippy) before completing tasks. Fix any warnings or lints that are introduced by your changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/statistics.rs
🔇 Additional comments (1)
stdlib/src/statistics.rs (1)

10-12: LGTM: Improved CPython compatibility by removing early sigma validation.

Removing the sigma <= 0.0 check from the initial validation aligns with the CPython fix mentioned in the PR objectives. This change makes the function more permissive and mathematically consistent - the inverse CDF is primarily constrained by the probability p needing to be in the interval (0,1), while sigma validation can be handled at higher levels if needed.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@jackoconnordev jackoconnordev changed the title [DRAFT] Add kde function and tests to RustPython statistics module Add kde function and tests to RustPython statistics module Jul 24, 2025
@jackoconnordev jackoconnordev marked this pull request as ready for review July 24, 2025 16:47
@jackoconnordev
Copy link
Contributor Author

I don't know if the tests will pass with this. I still haven't fully figured out a fully working local development setup. I can build a rustpython binary which works as excpected, but cargo test --no-default-features --features stdlib,importlib,stdio,encodings,sqlite,ssl which I grabbed from the CI yaml file gives me the following output.

RUSTPYTHONPATH or PYTHONPATH is set, but it wasn't loaded to `Settings::path_list`. If you are going to customize the RustPython vm/interpreter, those environment variables are not loaded in the Settings struct by default. Please try creating a customized instance of the Settings struct. If you are developing the RustPython interpreter, it might be a bug during development.
If you don't have access to a consistent external environment (e.g. targeting wasm, embedding rustpython in another application), try enabling the `freeze-stdlib` feature.
If this is intended and you want to exclude the encodings module from your interpreter, please remove the `encodings` feature from `rustpython-vm` crate.

Any pointers would be appreciated if anyone spots this and is familiar with the message 🫡.

@ShaharNaveh
Copy link
Contributor

ShaharNaveh commented Jul 24, 2025

Hey @jackoconnordev ty for the PR:)

Does:

cargo run --release -- -m test -j1 -v -u cpu test_statistics -m "TestKDE"

works for you? (assuming you're checked to your branch)


Also, it would be better if you could update both Lib/statistics.py & Lib/test/test_statistics.py from CPython 3.13, as we are trying to refrain from doing partial updates of code from CPython.

If any tests fails you can mark them with:

# TODO: RUSTPYTHON
@unittest.expectedFailure
def test_X(self):
  ...

@jackoconnordev
Copy link
Contributor Author

Happy to and thanks for the quick review! I tried that test command now and got two AttributeErrors:

  • AttributeError: module 'statistics' has no attribute '_kernel_invcdfs'
  • AttributeError: module 'statistics' has no attribute 'kde_random'

I'll put this back into Draft and try porting the entire files from CPython 💪.

@jackoconnordev jackoconnordev marked this pull request as draft July 24, 2025 17:43
## test_kde

I'm not too sure why but this one takes a few seconds to run the second
for loop which calculates the cumulative distribution and does a rough
calculation of the area under the curve.

## test_kde_random

I have a lower bound for RustPython to sort a random list of 1_000_000
numbers on my laptop of > 1 hour. By dropping n to 30_000 sort will not
take an egregious amount of time to run. It is then necessary to lower
the tolerance for the math.isclose check, or the computed values may
**randomly** fail due to the higher variance caused by the smaller
sample size.
@jackoconnordev jackoconnordev force-pushed the statistics-module-kde-function branch from d8ff7f0 to bdf71bd Compare July 24, 2025 22:21
@jackoconnordev jackoconnordev marked this pull request as ready for review July 24, 2025 22:36
youknowone
youknowone previously approved these changes Jul 25, 2025
Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing!

@ShaharNaveh Thank you so much for the review!

@youknowone youknowone dismissed their stale review July 25, 2025 01:59

Test is not run yet

Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_inv_cdf requires investigation. Adding back @unittest.expectedFailure will be enough for test_slots

@jackoconnordev
Copy link
Contributor Author

Relevant CPython PR which introduced the failing assertion to test_inv_cdf. I will try to find the equivalent for the CPython C function _normal_dist_inv_cdf in RustPython and try make a similar change.

See python/cpython#95265.

To quote:
> Restores alignment with random.gauss(mu, sigma) and
random.normalvariate(mu, sigma) both. of which are equivalent to
sampling from NormalDist(mu, sigma).inv_cdf(random()). The two functions
in the random module happy accept sigma=0 and give a well-defined
result.

> This also lets the function gently handle a sigma getting smaller,
eventually becoming zero. As sigma decrease, NormalDist(mu,
sigma).inv_cdf(p) forms a tighter and tighter internal around mu and
becoming exactly mu in the limit. For example, NormalDist(100,
1E-300).inv_cdf(0.3) cleanly evaluates to 100.0but withsigma=1e-500``
the function previously would raised an unexpected error.
@jackoconnordev
Copy link
Contributor Author

I've run the full test_statistics suite and it's passing on my laptop now cargo run --release -- -m test -v -u cpu test_statistics.

The pytest tests in extra_snippets are passing too. I added a hardcoded path to /Lib in impl Default for Settings and the cargo tests are passing locally too. Is that the normal solution for running cargo test?

@youknowone
Copy link
Member

I usually don't run cargo test, but I expect it works without any local patch.
What was the problem? Fixing the problem will be better

Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@youknowone youknowone merged commit 4079776 into RustPython:main Jul 25, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants