MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

lucyleeow · 2025-07-17T11:21:00Z

Reference Issues/PRs

Supercedes #30945

What does this implement/fix? Explain your changes.

Refactor _average_weighted_percentile so we are not just performing _weighted_percentile twice, thus avoids sorting and computing cumulative sum twice.

#30945 essentially uses the sorted indicies and calculates _weighted_percentile(-array, 100-percentile_rank) - this was verbose and required computing cumulative sum again on the negative (you could have used symmetry to avoid computing cumulative sum in cases when fraction above is greater than 0 - i.e., g>0 from Hyndman and Fan)

I've followed the Hyndman and Fan computation more closely and calculate g and just use j+1 (since we already know j). This did make handling the case where j+1 had a sample weight of 0 (or when you have sample weight of 0 at the end of the array) more complex.

Any other comments?

github-actions · 2025-07-17T11:21:58Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: bba43c4. Link to the linter CI: here}

lucyleeow · 2025-07-17T11:25:28Z

sklearn/utils/stats.py

+
+        result = xp.where(
+            is_fraction_above,
+            array[percentile_in_sorted, col_indices],


I initially thought this should be percentile_plus_one_in_sorted as from the paper, when g>0, $\gamma=1$, ~~but searchsorted defaults to left (equals is on the right), whereas the paper defined j <= pn < j+1~~ but searchsorted effectively gives i-1 < pn <= i whereas the paper had j <= pn < j+1. This means that when pn is greater than the LHS, searchsorted's i equals j+1, from the paper.

When the quantile exactly matches an index, searchsorted's i equals j, from the paper (as the equals is on opposite sides in paper vs searchsorted).

lucyleeow · 2025-07-21T01:50:31Z

I think this is ready for review, maybe @ogrisel @betatim ?

lucyleeow · 2025-07-21T01:50:35Z

I think this is ready for review, maybe @ogrisel @betatim ?

lucyleeow added 3 commits July 14, 2025 14:58

try reverse cum sum

8fe6ae2

initial implementation, wip tests

b9c0c7b

fix and add tests, update use

b56fab0

github-actions bot added module:metrics module:preprocessing module:utils labels Jul 17, 2025

lucyleeow mentioned this pull request Jul 17, 2025

Refactor weighted percentile functions to avoid redundant sorting #30945

Closed

lucyleeow commented Jul 17, 2025

View reviewed changes

lucyleeow added 2 commits July 18, 2025 23:51

fixes and add tests

f99366c

simplify zero sample code

ba57727

lucyleeow added the No Changelog Needed label Jul 19, 2025

typos

bba43c4

lucyleeow added the Array API label Jul 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

Uh oh!

lucyleeow commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025 •

edited

Loading

Uh oh!

lucyleeow Jul 17, 2025 •

edited

Loading

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

MNT Refactor _average_weighted_percentile to avoid double sort #31775

Are you sure you want to change the base?

MNT Refactor _average_weighted_percentile to avoid double sort #31775

Uh oh!

Conversation

lucyleeow commented Jul 17, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

lucyleeow Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

Uh oh!

MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

github-actions bot commented Jul 17, 2025 •

edited

Loading

lucyleeow Jul 17, 2025 •

edited

Loading