BUG: Fix _most_frequent to safely handle incomparable types (e.g. Non… #31737

AHB30 · 2025-07-10T17:50:43Z

Hi maintainers

This PR addresses a TypeError in _most_frequent when arrays contain mixed, incomparable types (e.g., None, str, int).

The patch ensures safe, deterministic tie-breaking using type + hash, in line with scipy.stats.mode.

Unit tests are included to cover:

Tie between None and str
Dominant None
Empty array handling
Tie-breaking against extra_value

Thanks for your time and review!

…e vs str) This patch improves `_most_frequent` to gracefully handle arrays with mixed, incomparable types (e.g. `None`, `str`, `int`), which previously raised `TypeError` due to Python's `min()` not supporting heterogeneous comparisons. Fix: - Uses `type` + `hash` as a tie-breaking key to avoid direct comparison between incompatible types. - Keeps behaviour consistent with `scipy.stats.mode` tie-breaking logic. - Maintains deterministic and efficient output. Adds unit tests for edge cases: - Tie between `None` and `str` - Dominant `None` - Empty array handling - Tie involving extra_value Closes: scikit-learn#31717

github-actions · 2025-07-10T17:51:23Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`ruff check`

ruff detected issues. Please run ruff check --fix --output-format=full locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


sklearn/impute/_base.py:18:27: F401 [*] `..utils.fixes._mode` imported but unused
   |
16 | from ..utils._missing import is_pandas_na, is_scalar_nan
17 | from ..utils._param_validation import MissingValues, StrOptions
18 | from ..utils.fixes import _mode
   |                           ^^^^^ F401
19 | from ..utils.sparsefuncs import _get_median
20 | from ..utils.validation import (
   |
   = help: Remove unused import: `..utils.fixes._mode`

Found 1 error.
[*] 1 fixable with the `--fix` option.

_{Generated for commit: 574b23d. Link to the linter CI: here}

This update improves the _most_frequent function to gracefully handle arrays containing mixed, incomparable types (e.g., None, str, int). The patch uses a deterministic tie-breaking strategy based on type and hash to avoid TypeErrors, consistent with scipy.stats.mode behaviour.

…linting This update improves the _most_frequent function by: Ensuring safe, deterministic tie-breaking for mixed, incomparable types (e.g., None, str, int) using (type, hash) as a key. Fixing docstring formatting and style issues to meet project linting requirements. Aligning logic with scipy.stats.mode tie-breaking behavior. Improving code readability and PEP-8 compliance.

Ensure safe comparison in _most_frequent by using (str(type(x)), hash(x)) as a tie-breaking key. - Handles mixed types like None, str, int. - Prevents TypeError from Python's min() on incomparable values. - Aligns with scipy.stats.mode behaviour. - Ready for review and linted locally.

…aking This patch improves the _most_frequent function to safely process arrays with mixed, incomparable types (e.g., None, str, int) without raising TypeError.

0xs1d

lgtm with vectorization and type check. check the ci linting issue, though

jeremiedbb · 2025-07-22T13:49:10Z

Thanks for the PR @AHB30. The hash based comparison solution is interesting but breaks the existing behavior when types are not mixed. We'd like to preserve the existing behavior if possible. So the idea would be to only use hash based comparison when types are not comparable, something like what is proposed here #31717 (comment).

jeremiedbb

There are linting issues that you can fix by following the instructions here #31737 (comment)

jeremiedbb · 2025-07-22T13:50:20Z

sklearn/impute/_base.py

-    elif most_frequent_count == n_repeat:
-        # tie breaking similarly to scipy.stats.mode
-        return min(most_frequent_value, extra_value)
+    else:


Suggested change

else:

else: # most_frequent_count == n_repeat

jeremiedbb · 2025-07-22T13:50:23Z

sklearn/impute/_base.py

+def _most_frequent(array: np.ndarray, extra_value, n_repeat: int):
+    """Compute the most frequent value in a 1D array extended with
    [extra_value] * n_repeat, where extra_value is assumed to be not part
-    of the array."""
-    # Compute the most frequent value in array only
+    of the array.
+    """


Please revert these unrelated changes

betatim · 2025-07-23T06:52:37Z

Let's see if we get responses to the review comments. There is also #31820 which was created as a result of the discussion in the issue, so might need less back and forth. Plus I want to reward people who participate in and wait for discussions to come to a consensus before creating PRs - I find those PRs need a lot less iterating.

github-actions bot added the module:impute label Jul 10, 2025

AHB30 added 5 commits July 10, 2025 20:59

Merge branch 'main' into main

e4a9cba

BUG: Fix _most_frequent to handle mixed-type arrays with safe tie-bre…

574b23d

…aking This patch improves the _most_frequent function to safely process arrays with mixed, incomparable types (e.g., None, str, int) without raising TypeError.

0xs1d reviewed Jul 12, 2025

View reviewed changes

jeremiedbb mentioned this pull request Jul 22, 2025

SimpleImputer fails in "most_frequent" if incomparable types only if ties #31717

Open

jeremiedbb reviewed Jul 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Fix _most_frequent to safely handle incomparable types (e.g. Non… #31737

BUG: Fix _most_frequent to safely handle incomparable types (e.g. Non… #31737

Uh oh!

AHB30 commented Jul 10, 2025 •

edited by jeremiedbb

Loading

Uh oh!

github-actions bot commented Jul 10, 2025 •

edited

Loading

Uh oh!

0xs1d left a comment

Uh oh!

jeremiedbb commented Jul 22, 2025

Uh oh!

jeremiedbb left a comment

Uh oh!

jeremiedbb Jul 22, 2025

Uh oh!

jeremiedbb Jul 22, 2025

Uh oh!

betatim commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

BUG: Fix _most_frequent to safely handle incomparable types (e.g. Non… #31737

Are you sure you want to change the base?

BUG: Fix _most_frequent to safely handle incomparable types (e.g. Non… #31737

Uh oh!

Conversation

AHB30 commented Jul 10, 2025 • edited by jeremiedbb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Linting issues

ruff check

Uh oh!

0xs1d left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Jul 22, 2025

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

betatim commented Jul 23, 2025

Uh oh!

Uh oh!

AHB30 commented Jul 10, 2025 •

edited by jeremiedbb

Loading

github-actions bot commented Jul 10, 2025 •

edited

Loading

`ruff check`