Skip to content

BUG: Fix _most_frequent to safely handle incomparable types (e.g. Non… #31737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

AHB30
Copy link

@AHB30 AHB30 commented Jul 10, 2025

Fixes #31717

Hi maintainers

This PR addresses a TypeError in _most_frequent when arrays contain mixed, incomparable types (e.g., None, str, int).

The patch ensures safe, deterministic tie-breaking using type + hash, in line with scipy.stats.mode.

Unit tests are included to cover:

  • Tie between None and str
  • Dominant None
  • Empty array handling
  • Tie-breaking against extra_value

Thanks for your time and review!

…e vs str)

This patch improves `_most_frequent` to gracefully handle arrays with mixed, incomparable types (e.g. `None`, `str`, `int`), which previously raised `TypeError` due to Python's `min()` not supporting heterogeneous comparisons.

Fix:
- Uses `type` + `hash` as a tie-breaking key to avoid direct comparison between incompatible types.
- Keeps behaviour consistent with `scipy.stats.mode` tie-breaking logic.
- Maintains deterministic and efficient output.

Adds unit tests for edge cases:
- Tie between `None` and `str`
- Dominant `None`
- Empty array handling
- Tie involving extra_value

Closes: scikit-learn#31717
Copy link

github-actions bot commented Jul 10, 2025

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff check

ruff detected issues. Please run ruff check --fix --output-format=full locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


sklearn/impute/_base.py:18:27: F401 [*] `..utils.fixes._mode` imported but unused
   |
16 | from ..utils._missing import is_pandas_na, is_scalar_nan
17 | from ..utils._param_validation import MissingValues, StrOptions
18 | from ..utils.fixes import _mode
   |                           ^^^^^ F401
19 | from ..utils.sparsefuncs import _get_median
20 | from ..utils.validation import (
   |
   = help: Remove unused import: `..utils.fixes._mode`

Found 1 error.
[*] 1 fixable with the `--fix` option.

Generated for commit: 574b23d. Link to the linter CI: here

AHB30 added 5 commits July 10, 2025 20:59
This update improves the _most_frequent function to gracefully handle arrays containing mixed, incomparable types (e.g., None, str, int). The patch uses a deterministic tie-breaking strategy based on type and hash to avoid TypeErrors, consistent with scipy.stats.mode behaviour.
…linting

This update improves the _most_frequent function by:

Ensuring safe, deterministic tie-breaking for mixed, incomparable types (e.g., None, str, int) using (type, hash) as a key.

Fixing docstring formatting and style issues to meet project linting requirements.

Aligning logic with scipy.stats.mode tie-breaking behavior.

Improving code readability and PEP-8 compliance.
Ensure safe comparison in _most_frequent by using (str(type(x)), hash(x)) as a tie-breaking key.

- Handles mixed types like None, str, int.
- Prevents TypeError from Python's min() on incomparable values.
- Aligns with scipy.stats.mode behaviour.
- Ready for review and linted locally.
…aking

This patch improves the _most_frequent function to safely process arrays with mixed, incomparable types (e.g., None, str, int) without raising TypeError.
Copy link
Contributor

@0xs1d 0xs1d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with vectorization and type check. check the ci linting issue, though

@jeremiedbb
Copy link
Member

Thanks for the PR @AHB30. The hash based comparison solution is interesting but breaks the existing behavior when types are not mixed. We'd like to preserve the existing behavior if possible. So the idea would be to only use hash based comparison when types are not comparable, something like what is proposed here #31717 (comment).

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are linting issues that you can fix by following the instructions here #31737 (comment)

elif most_frequent_count == n_repeat:
# tie breaking similarly to scipy.stats.mode
return min(most_frequent_value, extra_value)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else:
else: # most_frequent_count == n_repeat

Comment on lines +41 to +45
def _most_frequent(array: np.ndarray, extra_value, n_repeat: int):
"""Compute the most frequent value in a 1D array extended with
[extra_value] * n_repeat, where extra_value is assumed to be not part
of the array."""
# Compute the most frequent value in array only
of the array.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert these unrelated changes

@betatim
Copy link
Member

betatim commented Jul 23, 2025

Let's see if we get responses to the review comments. There is also #31820 which was created as a result of the discussion in the issue, so might need less back and forth. Plus I want to reward people who participate in and wait for discussions to come to a consensus before creating PRs - I find those PRs need a lot less iterating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer fails in "most_frequent" if incomparable types only if ties
4 participants