-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
WIP Add function to convert array namespace and device to reference array #31829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
sklearn/metrics/_classification.py
Outdated
xp, _, device = get_namespace_and_device(y_true, y_pred, sample_weight) | ||
y_true, sample_weight = _convert_to_reference( | ||
reference=y_pred, arrays=(y_true, sample_weight) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't get_namespace_and_device
raise if they're not already on the same device ?
If so, the second line is a noop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was like this in the past (that it raised), but since #29476 it only returns an empty list in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't get_namespace_and_device raise if they're not already on the same device ?
Ah yes, I'd change to:
xp, _, device = get_namespace_and_device(y_pred)
y_true, sample_weight = _convert_to_reference(
reference=y_pred, arrays=(y_true, sample_weight)
)
It was like this in the past (that it raised), but since #29476 it only returns an empty list in this case.
I thought #29476 filtered out sparse arrays in _remove_non_arrays
? I think get_namespace_and_device
will error with different namespace/devices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does raise. get_namespace_and_device
calls sklearn.utils._array_api.device
with the list of arrays which raises when they're not all on the same device
scikit-learn/sklearn/utils/_array_api.py
Lines 198 to 209 in aa680bc
device_ = _single_array_device(array_list[0]) | |
# Note: here we cannot simply use a Python `set` as it requires | |
# hashable members which is not guaranteed for Array API device | |
# objects. In particular, CuPy devices are not hashable at the | |
# time of writing. | |
for array in array_list[1:]: | |
device_other = _single_array_device(array) | |
if device_ != device_other: | |
raise ValueError( | |
f"Input arrays use different devices: {device_}, {device_other}" | |
) |
try: | ||
# Note will copy if required | ||
array_converted = xp_ref.from_dlpack(array, device=device_ref) | ||
except AttributeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to only except AttributeError
, which I think occurs if input or output namespace does not support dlpack.
from_dlpack
can give 2 (or more) other errors
- BufferError - The dlpack and dlpack_device methods on the input array may raise BufferError when the data cannot be exported as DLPack (e.g., incompatible dtype, strides, or device). It may also raise other errors when export fails for other reasons (e.g., not enough memory available to materialize the data). from_dlpack must propagate such exceptions.
- I thought that if dlpack fails to convert due to one of the above errors, it would not make sense to try ourselves manually.
- ValueError - If data exchange is possible via an explicit copy but copy is set to False.
- I've left
copy=None
, allowing it to copy if need be, so this error is not relevant. I am not sure about thecopy=None
setting though, it is a lot of memory usage.
- I've left
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking to Evgeni about what could possibly cause a BufferError
- here are some ideas, but but I don't know if any of these will cause an error:
- numpy supports negative strides but torch does not (https://discuss.pytorch.org/t/negative-strides-in-tensor-error/134287) (I don't think torch supports DLPack atm, so hard to figure out)
- numpy structured array?
- numpy memmapped array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we could simplify _convert_to_reference
a bit when xp_ref
is NumPy:
if _is_numpy_namespace(xp_ref):
return tuple([_convert_to_numpy(array, get_namespace(array)) for array in arrays])
However, I'm not sure this offers much benefit beyond readability, since in most cases there are only two arrays to convert: y_true
and sample_weight
.
I'll have to give it some more thought.
At the moment that would be simpler, but I think
For metrics I think this would mostly be it, but for estimators there could be other arrays to convert. |
@@ -466,6 +466,34 @@ def get_namespace_and_device( | |||
return xp, False, arrays_device | |||
|
|||
|
|||
def _convert_to_reference(*, reference, arrays): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit/pet peeve: sklearn.utils._array_api
is a private module, we don't need to make functions private here. It feels like repeating ourselves. I've made this argument before and you can tell I've not convinced everyone because there are many functions here that start with a _
but I will keep trying :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried at some point. It's a lost battle 😄
def _convert_to_reference(*, reference, arrays): | ||
"""Convert `arrays` to `reference` array's namespace and device.""" | ||
xp_ref, _, device_ref = get_namespace_and_device(reference) | ||
arrays_converted_list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling it converted_arrays
? arrays_converted_list
feels backwards and we can use the plural "s" to let people know it is a sequence/collection of more than one thing
xp, _, device = get_namespace_and_device(y_pred) | ||
y_true, sample_weight = _convert_to_reference( | ||
reference=y_pred, arrays=(y_true, sample_weight) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that returning xp
and device
from convert_to_reference
feels like too much. But it also feels weird to have this "duplication" of get_namespace_and_device
and convert_to_reference
. The only alternative I can think of, though I'm not sure I like it, is to have convert_to_reference(arrays, namespace=xp, device=device)
(but then maybe rename it to move_to
or some such.
@@ -466,6 +466,34 @@ def get_namespace_and_device( | |||
return xp, False, arrays_device | |||
|
|||
|
|||
def _convert_to_reference(*, reference, arrays): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a great fan of the *
. Let people use or not use keyword arguments. You can foo(bar=42)
with def foo(bar):
People are grownups :D
# Note will copy if required | ||
array_converted = xp_ref.from_dlpack(array, device=device_ref) | ||
except AttributeError: | ||
# Convert to numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Convert to numpy | |
# Converting to numpy is tricky, handle this via a dedicated function |
# Convert to numpy | ||
if _is_numpy_namespace(xp_ref): | ||
array_converted = _convert_to_numpy(array, xp_array) | ||
# Convert from numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Convert from numpy | |
# Convert from numpy, all array libraries know how to use a Numpy array |
elif _is_numpy_namespace(xp_array): | ||
array_converted = xp_ref.asarray(array, device=device_ref) | ||
else: | ||
# Convert to numpy then to reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Convert to numpy then to reference | |
# There is no generic way to convert from namespace A to B | |
# So we first convert from A to numpy and then from numpy to B | |
# The way to avoid this round trip is to lobby for DLpack support | |
# in libraries A and B |
What's the difference with |
I think the goal of both is the same. I also think that |
Reference Issues/PRs
Towards #28668 and #31274
What does this implement/fix? Explain your changes.
Adds a function that converts arrays to the namespace and device of the reference array.
Tries DLPack first, and if either array does not support it, tries to convert manually.
Any other comments?
This is an initial attempt, and what it would look like in a simple metric. Feedback welcome. (Tests to come)
I thought about also outputting the namespace and device of the reference array, to avoid the second call to
get_namespace_and_device
, but I thought it would make the outputs too messy.cc @ogrisel @betatim @StefanieSenger @virchan @lesteve