Skip to content

WIP Add function to convert array namespace and device to reference array #31829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Jul 24, 2025

Reference Issues/PRs

Towards #28668 and #31274

What does this implement/fix? Explain your changes.

Adds a function that converts arrays to the namespace and device of the reference array.

Tries DLPack first, and if either array does not support it, tries to convert manually.

Any other comments?

This is an initial attempt, and what it would look like in a simple metric. Feedback welcome. (Tests to come)

I thought about also outputting the namespace and device of the reference array, to avoid the second call to get_namespace_and_device, but I thought it would make the outputs too messy.

cc @ogrisel @betatim @StefanieSenger @virchan @lesteve

Copy link

github-actions bot commented Jul 24, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e99209e. Link to the linter CI: here

Comment on lines 357 to 360
xp, _, device = get_namespace_and_device(y_true, y_pred, sample_weight)
y_true, sample_weight = _convert_to_reference(
reference=y_pred, arrays=(y_true, sample_weight)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't get_namespace_and_device raise if they're not already on the same device ?
If so, the second line is a noop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was like this in the past (that it raised), but since #29476 it only returns an empty list in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't get_namespace_and_device raise if they're not already on the same device ?

Ah yes, I'd change to:

    xp, _, device = get_namespace_and_device(y_pred)
    y_true, sample_weight = _convert_to_reference(
        reference=y_pred, arrays=(y_true, sample_weight)
    )

It was like this in the past (that it raised), but since #29476 it only returns an empty list in this case.

I thought #29476 filtered out sparse arrays in _remove_non_arrays ? I think get_namespace_and_device will error with different namespace/devices

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does raise. get_namespace_and_device calls sklearn.utils._array_api.device with the list of arrays which raises when they're not all on the same device

device_ = _single_array_device(array_list[0])
# Note: here we cannot simply use a Python `set` as it requires
# hashable members which is not guaranteed for Array API device
# objects. In particular, CuPy devices are not hashable at the
# time of writing.
for array in array_list[1:]:
device_other = _single_array_device(array)
if device_ != device_other:
raise ValueError(
f"Input arrays use different devices: {device_}, {device_other}"
)

try:
# Note will copy if required
array_converted = xp_ref.from_dlpack(array, device=device_ref)
except AttributeError:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to only except AttributeError, which I think occurs if input or output namespace does not support dlpack.

from_dlpack can give 2 (or more) other errors

  • BufferError - The dlpack and dlpack_device methods on the input array may raise BufferError when the data cannot be exported as DLPack (e.g., incompatible dtype, strides, or device). It may also raise other errors when export fails for other reasons (e.g., not enough memory available to materialize the data). from_dlpack must propagate such exceptions.
    • I thought that if dlpack fails to convert due to one of the above errors, it would not make sense to try ourselves manually.
  • ValueError - If data exchange is possible via an explicit copy but copy is set to False.
    • I've left copy=None, allowing it to copy if need be, so this error is not relevant. I am not sure about the copy=None setting though, it is a lot of memory usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking to Evgeni about what could possibly cause a BufferError - here are some ideas, but but I don't know if any of these will cause an error:

Copy link
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we could simplify _convert_to_reference a bit when xp_ref is NumPy:

if _is_numpy_namespace(xp_ref):
   return tuple([_convert_to_numpy(array, get_namespace(array)) for array in arrays])

However, I'm not sure this offers much benefit beyond readability, since in most cases there are only two arrays to convert: y_true and sample_weight.

I'll have to give it some more thought.

@lucyleeow
Copy link
Member Author

I suspect we could simplify _convert_to_reference a bit when xp_ref is NumPy:

At the moment that would be simpler, but I think _convert_to_numpy is considered a bit of a 'hack'. It's got numpy conversions for specific array namespaces hard coded (needs maintenance, though I do doubt any of the APIs will change) and it doesn't necessarily work for all array API arrays. I think DLPack is more future proof and it has a copy parameter (we could specify to avoid copying if we wanted to) - at least this was the thinking when I decided to just try DLPack first...

since in most cases there are only two arrays to convert: y_true and sample_weight.

For metrics I think this would mostly be it, but for estimators there could be other arrays to convert.

@@ -466,6 +466,34 @@ def get_namespace_and_device(
return xp, False, arrays_device


def _convert_to_reference(*, reference, arrays):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit/pet peeve: sklearn.utils._array_api is a private module, we don't need to make functions private here. It feels like repeating ourselves. I've made this argument before and you can tell I've not convinced everyone because there are many functions here that start with a _ but I will keep trying :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried at some point. It's a lost battle 😄

def _convert_to_reference(*, reference, arrays):
"""Convert `arrays` to `reference` array's namespace and device."""
xp_ref, _, device_ref = get_namespace_and_device(reference)
arrays_converted_list = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling it converted_arrays? arrays_converted_list feels backwards and we can use the plural "s" to let people know it is a sequence/collection of more than one thing

Comment on lines +357 to +360
xp, _, device = get_namespace_and_device(y_pred)
y_true, sample_weight = _convert_to_reference(
reference=y_pred, arrays=(y_true, sample_weight)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that returning xp and device from convert_to_reference feels like too much. But it also feels weird to have this "duplication" of get_namespace_and_device and convert_to_reference. The only alternative I can think of, though I'm not sure I like it, is to have convert_to_reference(arrays, namespace=xp, device=device) (but then maybe rename it to move_to or some such.

@@ -466,6 +466,34 @@ def get_namespace_and_device(
return xp, False, arrays_device


def _convert_to_reference(*, reference, arrays):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a great fan of the *. Let people use or not use keyword arguments. You can foo(bar=42) with def foo(bar):

People are grownups :D

# Note will copy if required
array_converted = xp_ref.from_dlpack(array, device=device_ref)
except AttributeError:
# Convert to numpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Convert to numpy
# Converting to numpy is tricky, handle this via a dedicated function

# Convert to numpy
if _is_numpy_namespace(xp_ref):
array_converted = _convert_to_numpy(array, xp_array)
# Convert from numpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Convert from numpy
# Convert from numpy, all array libraries know how to use a Numpy array

elif _is_numpy_namespace(xp_array):
array_converted = xp_ref.asarray(array, device=device_ref)
else:
# Convert to numpy then to reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Convert to numpy then to reference
# There is no generic way to convert from namespace A to B
# So we first convert from A to numpy and then from numpy to B
# The way to avoid this round trip is to lobby for DLpack support
# in libraries A and B

@jeremiedbb
Copy link
Member

What's the difference with sklearn.utils._array_api.ensure_common_namespace_device ? Looks like both are trying to do the same thing.

@betatim
Copy link
Member

betatim commented Jul 25, 2025

What's the difference with sklearn.utils._array_api.ensure_common_namespace_device ? Looks like both are trying to do the same thing.

I think the goal of both is the same. I also think that ensure_common_namespace_device is broken right now, at least for the application that Lucy has in mind. Maybe the thing to do is to replace the code of ensure_common_namespace_device with that of the new function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants