-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
BUG: Fix _most_frequent to safely handle incomparable types (e.g. Non… #31737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e vs str) This patch improves `_most_frequent` to gracefully handle arrays with mixed, incomparable types (e.g. `None`, `str`, `int`), which previously raised `TypeError` due to Python's `min()` not supporting heterogeneous comparisons. Fix: - Uses `type` + `hash` as a tie-breaking key to avoid direct comparison between incompatible types. - Keeps behaviour consistent with `scipy.stats.mode` tie-breaking logic. - Maintains deterministic and efficient output. Adds unit tests for edge cases: - Tie between `None` and `str` - Dominant `None` - Empty array handling - Tie involving extra_value Closes: scikit-learn#31717
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
This update improves the _most_frequent function to gracefully handle arrays containing mixed, incomparable types (e.g., None, str, int). The patch uses a deterministic tie-breaking strategy based on type and hash to avoid TypeErrors, consistent with scipy.stats.mode behaviour.
…linting This update improves the _most_frequent function by: Ensuring safe, deterministic tie-breaking for mixed, incomparable types (e.g., None, str, int) using (type, hash) as a key. Fixing docstring formatting and style issues to meet project linting requirements. Aligning logic with scipy.stats.mode tie-breaking behavior. Improving code readability and PEP-8 compliance.
Ensure safe comparison in _most_frequent by using (str(type(x)), hash(x)) as a tie-breaking key. - Handles mixed types like None, str, int. - Prevents TypeError from Python's min() on incomparable values. - Aligns with scipy.stats.mode behaviour. - Ready for review and linted locally.
…aking This patch improves the _most_frequent function to safely process arrays with mixed, incomparable types (e.g., None, str, int) without raising TypeError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with vectorization and type check. check the ci linting issue, though
Thanks for the PR @AHB30. The hash based comparison solution is interesting but breaks the existing behavior when types are not mixed. We'd like to preserve the existing behavior if possible. So the idea would be to only use hash based comparison when types are not comparable, something like what is proposed here #31717 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are linting issues that you can fix by following the instructions here #31737 (comment)
elif most_frequent_count == n_repeat: | ||
# tie breaking similarly to scipy.stats.mode | ||
return min(most_frequent_value, extra_value) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: | |
else: # most_frequent_count == n_repeat |
def _most_frequent(array: np.ndarray, extra_value, n_repeat: int): | ||
"""Compute the most frequent value in a 1D array extended with | ||
[extra_value] * n_repeat, where extra_value is assumed to be not part | ||
of the array.""" | ||
# Compute the most frequent value in array only | ||
of the array. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert these unrelated changes
Let's see if we get responses to the review comments. There is also #31820 which was created as a result of the discussion in the issue, so might need less back and forth. Plus I want to reward people who participate in and wait for discussions to come to a consensus before creating PRs - I find those PRs need a lot less iterating. |
Fixes #31717
Hi maintainers
This PR addresses a
TypeError
in_most_frequent
when arrays contain mixed, incomparable types (e.g.,None
,str
,int
).The patch ensures safe, deterministic tie-breaking using
type
+hash
, in line withscipy.stats.mode
.Unit tests are included to cover:
None
andstr
None
extra_value
Thanks for your time and review!