Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array API standard and Numpy compatibility #400

Closed
vnmabus opened this issue Mar 2, 2022 · 19 comments
Closed

Array API standard and Numpy compatibility #400

vnmabus opened this issue Mar 2, 2022 · 19 comments
Labels
Question General question.

Comments

@vnmabus
Copy link

vnmabus commented Mar 2, 2022

This issue is created as a continuation of numpy#21135, as a request from mattip.

The idea is to discuss between the array API standard and the NumPy communities how to make code that is compatible with both the array API standard and Numpy functionalities, in order to avoid code duplication and facilitate the move towards the standard.

@rgommers
Copy link
Member

Thanks for opening this issue here @vnmabus. I agree with your comments on the NumPy issue that it's more a NumPy topic/problem than an API standard one. However, visibility is good because (a) this is going to be important to resolve since it may present a hurdle to adoption, and (b) other libraries like CuPy and Dask are copying NumPy's approach (not surprising, that's their general API design approach).

Let me add a short summary and a few cross-links here:

  • The original intent of at least @shoyer and myself when writing NEP 47 was to add support to the main numpy namespace. However we came to the conclusion that this would be difficult to do on a reasonable timescale due to a few issues around backwards compatibility. In particular the ndarray object is hard to change (there's not much in functions that's a real problem I believe) - in particular dtype casting rules.
  • A downside of having a separate namespace is that there can now be two code paths, so more maintenance.
  • @thomasjpfan has done the most work on exploring and comparing potential solutions - see the write-up in Path for Adopting the Array API spec scikit-learn/scikit-learn#22352. The preferred solution is used in ENH Adds Array API support to LinearDiscriminantAnalysis scikit-learn/scikit-learn#22554, which is the main PR for adoption of the array API standard in scikit-learn.
  • A first PR for SciPy is ENH: port scipy.signal._arraytools to be Array API compatible scipy/scipy#15395. It shows a similar issue (xp.concat vs np.concatenate) which is papered over with some helper function. There's probably less to learn from that PR than from the scikit-learn ones at this point.
  • It seems clear though that it's undesirable to have SciPy, scikit-learn and other libraries all individually deal with differences between the numpy and numpy.array_api namespaces (and the same for cupy/cupy.array_api and other libraries who may introduce a separate namespace).

I suggest to continue the discussion around numpy.array_api on the NumPy issue linked by @vnmabus above, and use this issue for keeping track of related issues/PRs in other projects and concerns that are not specific to NumPy.

@shoyer
Copy link
Contributor

shoyer commented Mar 10, 2022

I wonder if we could at least extend the Array API standard to allow for arrays with novel dtypes for storage but not computation. For example, you could convert a NumPy object array into an numpy.array_api array and then doing operations like slicing and transpose but arithmetic would raise an error.

This would be quite helpful for downstream libraries like Xarray that do need basic operations on objects arrays, e.g., to handle strings. The actual string specific computation could be performed on NumPy arrays, of course, but it would be nice to be able to switch the core manipulation routines to use array standard APIs.

@rgommers
Copy link
Member

I wonder if we could at least extend the Array API standard to allow for arrays with novel dtypes for storage but not computation.

I think this is always allowed? Every existing library is going to provide a superset of the standard, and we don't require that exceptions are raised for things that are not included in the standard. So string/object dtypes should be perfectly fine.

Given that object or string dtypes are mostly numpy-specific, I don't see how it would help to explicitly name them in the standard though.

For example, you could convert a NumPy object array into an numpy.array_api array and then doing operations like slicing and transpose but arithmetic would raise an error.

So this is another issue: it's whether the numpy.array_api module should be restricted to only what's in the standard (which is what it is now, to serve as a reference) or whether it should be a superset. That'd be an option for NumPy (and what @vnmabus suggested in numpy#21135) - take the current numpy.array_api and convert that into a standalone reference package, and expand numpy.array_api in ways like you are suggesting here.

@rgommers
Copy link
Member

@shoyer I followed up on this in numpy/numpy#21135 (comment). Thoughts there would be much appreciated.

@seberg
Copy link
Contributor

seberg commented Jul 18, 2022

@asmeurer this issue is basically the existing discussion on the two points. I will summarize my view here:

  • np.array_api does not seem useful users or libraries with the exception for testing. For libraries to use it, they would have to unpack and repack incoming numpy arrays and drop any potential support for float16, object, or even normal promotion! (IMO, there is no reason for this minimal namespace to even be part of NumPy. The only advantage is forcing NumPy people to have a look and maybe making it easier to distribute.)
    To be clear: The namespace is useful and required, but only really for libraries to test their compliance.
  • It seems undesirable to have every library create their own namespace that pretends NumPy supports the array-api and maps things like concat -> concatenate. But: Such a namespace is necessary unless you are OK with forcing libraries to duplicate their python code.
    Libraries do not need a compliant namespace, they already live with whatever NumPy gives them, they don't need something "better", they need something that does exactly the same thing that they get right now. If that is wonky promotion, than that namespace should give them wonky promotion.

So, we need a non-minimal namespace that does not need to be fully compatible with the array-API. We could make that np.array_api, we could make it np.ndarray.__array_namespace__. But, maybe we do not want that (because it is only almost compliant).

So, as an alternative, maybe it should be:

def get_array_namespace(*array_likes, allow_numpy=False):
    if any_array_like_implements_api:
         return best_api
    return np._new_almost_comliant_namespace

That could even live inside NumPy.

The other point is the *array_likes. It would be nice to allow very basic "promotion" between array objects. That is, basically what __array_function__ does. Allow the cupy implementation to signal that it can deal with numpy arrays, allow dask to deal with numpy+cupy...

There would need to be some best-practices (i.e. get consent from the library you wish to coerce). I would do no actual coercion, if cupy wishes to support NumPy arrays, its namespace must coerce numpy arrays. ns = get_array_namespace(cupy_array, numpy_array); ns.function(np.arange(10)) must work.

@rgommers
Copy link
Member

rgommers commented Jul 21, 2022

this issue is basically the existing discussion on the two points

I guess this was a reference to a conversation at SciPy'22? I feel like I'm missing some context or assumptions made in this reply.

  • Libraries do not need a compliant namespace, they already live with whatever NumPy gives them, they don't need something "better", they need something that does exactly the same thing that they get right now. If that is wonky promotion, than that namespace should give them wonky promotion.

np.array_api does not seem useful users or libraries with the exception for testing. For libraries to use it, they would have to unpack and repack incoming numpy arrays and drop any potential support for float16, object, or even normal promotion!

The unpack/repack is annoying and not desired, but I would not a priori say it's not useful to do so (EDIT: as of right now, of course if numpy had array API support in its main namespace, that'd be much better). If as an author of an array-consuming library (e.g., scikit-learn) you want to (a) support multiple array libraries and (b) not have duplicate code paths in many places, then you kinda need this namespace and unpack/repack.

Good points regarding float16 and object - not sure how widely they're used, but indeed you can't support them via the unpack/repack method. While in principle it's very well possible to do so in a single code path using only array API standard compliant inputs/code.

"normal promotion" is different - that is not portable and hence not supportable in a code path that supports multiple array libraries. So you should not be relying on that - use explicit casts instead of cross-kind casting.

It seems undesirable to have every library create their own namespace that pretends NumPy supports the array-api and maps things like concat -> concatenate.

Agreed

Libraries do not need a compliant namespace, they already live with whatever NumPy gives them, they don't need something "better", they need something that does exactly the same thing that they get right now. If that is wonky promotion, than that namespace should give them wonky promotion.

This I don't quite get. You are actively trying to change/improve the NumPy promotion rules because they are indeed, well, wonky:) Maybe you're reasoning from a different goal here than writing portable code in a downstream library, but I think casting rules in any namespace you want to use here should be compliant.

The other point is the *array_likes. It would be nice to allow very basic "promotion" between array objects.

This is something I disagree with as a goal. There's a reason that no other array library allows this; it's hard to even be precise about what this means, and it's an endless source of problems and bug reports. What CuPy, PyTorch, JAX et al. all do is better imho. Having to be explicit about conversions and not mixing different array objects directly is healthy, and not a missing feature.

if cupy wishes to support NumPy arrays

but it doesn't?

@seberg
Copy link
Contributor

seberg commented Jul 21, 2022

Yes it was a reference to SciPy'22, although I hope it isn't particularly detached from here, all things have been mentioned before.

Good points regarding float16 and object - not sure how widely they're used, but indeed you can't support them via the unpack/repack method. While in principle it's very well possible to do so in a single code path using only array API standard compliant inputs/code.

Well you may be able to if it was not a minimal implementation. The main point is that libraries should not have to do (many) backcompat breaks for existing NumPy code. Even rare/strange BC issues seem bound to hinder adoption to existing libraries quite seriously, and the minimal implementation has a lot of BC limitations compared to vanilla NumPy.

[NumPy promotion] is not portable and hence not supportable in a code path that supports multiple array libraries.

The array-API already leaves many promotions as undefined. So you already get different dtypes/precision results out depending on what you pass in (if you mix dtypes).

The current NumPy promotion just adds a few quirks to that. So the user already needs to be a bit careful about promotion if they globally swap out the array-object. I am not convinced that there is a problem with being pragmatic about it. Especially from a library adoption point of view.

Maybe you're reasoning from a different goal here than writing portable code in a downstream library, but I think casting rules in any namespace you want to use here should be compliant.

I am reasoning purely from the goal of getting portable code into existing downstream libraries. The point is that the dtype promotion rules don't matter much in practice: Currently, sklearn uses the NumPy promotion rules. If they do nothing NumPy will eventually switch them over and they are happy with that (presumably).

For downstream/sklearn it doesn't matter if they get a "wrong" namespace as long as they get unchanged behavior.
This could, and likely should, be explicit, i.e. get_namespace(..., accept_plain_numpy=True).

That is what both the SciPy and the sklearn PRs that tried to adopt the array-api ended up doing, so it seems like a pragmatic approach. They just did it independently and "incrementally" (e.g. only adding concat because they happened to use it).

In other words: libraries need a namespace that gives easy NumPy compatibility right now (with as little backward compat concerns as possible). This would give the library the ability to have a single code path that supports NumPy (unmodified) and any API compatible object. Yes, that might be explicitly not "compliant" for NumPy arrays.

np.array_api is useful and needed, but it is not that namespace.

if cupy wishes to support NumPy arrays

but it doesn't?

cupy.asarray(np.arange(100)) as well as many other calls work. But yes, it is a choice, since cupy could refuse to do it on the grounds that it may slow some calls down.
For dask.Array, I see very little reason to not accept a stray NumPy/CuPy array (e.g. as a kernel for a convolution).

I don't care too much about this, but dask.Array people may know better if it is useful for them. Promotion was explicitly mentioned also in gh-403.

@shoyer
Copy link
Contributor

shoyer commented Jul 21, 2022

I don't care too much about this, but dask.Array people may know better if it is useful for them. Promotion was explicitly mentioned also in #403.

For the most part, I have come to @rgommers's perspective on preferring explicit promotion. This is the way JAX handles things, for example, and it really is a joy to be able to reliably compose different array types. (I similarly prefer explicit broadcasting with jax.vmap in most cases over NumPy-style implicit broadcasting.)

In JAX, it is easy to avoid implicit array promotion because different array types are created via transformations (at a single function call) rather by creating array objects separately. This works great for "computational oriented" code like most deep learning programs, and means you always have a well defined "order" in which array wrapping should happen.

On the other hand, libraries like Xarray that are focused on "data oriented" use-cases don't have the luxury of being able to use transformations for handling different array types. Implicit casting definitely seems more appealing here (so users can write more generic code), but it still has some serious scaling issues if more than a few array types are involved. For example users want to be able to compose stacks of wrapped arrays like cupy/dask/pint inside Xarray objects (pydata/xarray#5648). I'm still not entirely sure what the right long term fix looks like, though definitely leaving it up to array library separately with protocols like __array_ufunc__ is not going to scale -- there are just too many potential combinations to anticipate.

@rgommers
Copy link
Member

The main point is that libraries should not have to do (many) backcompat breaks for existing NumPy code. Even rare/strange BC issues seem bound to hinder adoption to existing libraries quite seriously, and the minimal implementation has a lot of BC limitations compared to vanilla NumPy.

Completely agreed, the fewer changes they have to make the better.

I am reasoning purely from the goal of getting portable code into existing downstream libraries.

Okay, we're on the same page there. And with my SciPy maintainer hat on, I'm very much interested in the details of how that looks too.

The array-API already leaves many promotions as undefined. So you already get different dtypes/precision results out depending on what you pass in (if you mix dtypes).
....
The point is that the dtype promotion rules don't matter much in practice: Currently, sklearn uses the NumPy promotion rules. If they do nothing NumPy will eventually switch them over and they are happy with that (presumably).

This is probably not true beyond numpy usage (pass in mixed-type pytorch tensors for example, and it won't work), and is where the numpy.array_api strict implementation will help.

In other words: libraries need a namespace that gives easy NumPy compatibility right now (with as little backward compat concerns as possible). This would give the library the ability to have a single code path that supports NumPy (unmodified) and any API compatible object.

Thanks @seberg, yes this makes sense to me and is quite important to have. Something to also consider here, given the "right now", is that it's probably pointless for this namespace to live in NumPy itself - because it cannot be used unconditionally for several years, given the need to support older numpy versions. Having that "compat namespace" in scikit-learn, SciPy etc. avoided this. So probably what is needed to achieve this is a separate package that can simply be vendored into each library that needs it.

@ilan-gold
Copy link

Sorry for resurrecting this for a specific question, but what would some sort of "array-api-compliant" string container in numpy hypothetically look like, at least at a high level?

For example, as things stand, strings are just outright not supported.

from numpy import array_api as np
np.asarray(['foo', 'bar'])
"""
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 np.asarray(['foo', 'bar'])

File ~/Projects/Theis/array-api-tests/venv/lib/python3.10/site-packages/numpy/array_api/_creation_functions.py:72, in asarray(obj, dtype, device, copy)
     70     raise OverflowError("Integer out of bounds for array dtypes")
     71 res = np.asarray(obj, dtype=dtype)
---> 72 return Array._new(res)

File ~/Projects/Theis/array-api-tests/venv/lib/python3.10/site-packages/numpy/array_api/_array_object.py:81, in Array._new(cls, x)
     79     x = np.asarray(x)
     80 if x.dtype not in _all_dtypes:
---> 81     raise TypeError(
     82         f"The array_api namespace does not support the dtype '{x.dtype}'"
     83     )
     84 obj._array = x
     85 return obj

TypeError: The array_api namespace does not support the dtype '<U3'
"""

To me this makes sense given the API specification and its intention. My reaction was similar to @shoyer that the spec would need to be extended to support this somehow. However @rgommers seems to suggest otherwise

I wonder if we could at least extend the Array API standard to allow for arrays with novel dtypes for storage but not computation.

I think this is always allowed? Every existing library is going to provide a superset of the standard, and we don't require that exceptions are raised for things that are not included in the standard. So string/object dtypes should be perfectly fine.

Given that object or string dtypes are mostly numpy-specific, I don't see how it would help to explicitly name them in the standard though.

I could see allowing the few operations that make sense for strings from the array-api, like addition=string concatenation (although a lot of this is currently in np.char and not the top level np package), and then just raising NotImplementedError otherwise. But this seems to go against the doc which specifies that the array container must implement the mentioned methods ("A conforming implementation of the array API standard must provide and support an array object having the following attributes and methods." from https://data-apis.org/array-api/latest/API_specification/array_object.html#arithmetic-operators).

@lucascolley
Copy link
Contributor

lucascolley commented Jan 12, 2024

what would some sort of "array-api-compliant" string container in numpy hypothetically look like

The answer is different depending on what you mean by "array-api-compliant".

If you mean 'how could an array-provider library both support the array API and string dtypes', then that is already possible. See numpy/numpy#25542 (and NEP 55 I suppose!). numpy.array_api is a strict minimal implementation of the standard, so indeed string dtypes would have to be added to the standard for them to appear there. Conforming libraries are free to implement any functionality which the standard leaves unspecified, though.

If you mean 'how could an array-consumer library use string dtypes with arrays, regardless of which array type they get (as long as it conforms to the standard)', then yes, the standard would have to be extended. This would likely have to be an optional extension though (note that the only extensions so far are just extra namespaces of functions, this would be a bit more given extra dtypes), as Ralf said:

Given that object or string dtypes are mostly numpy-specific, I don't see how it would help to explicitly name them in the standard though.

@asmeurer
Copy link
Member

The first question is whether there would be other libraries than NumPy that would implement it. The main purpose of the standard is interoperability between libraries.

@ilan-gold
Copy link

ilan-gold commented Jan 16, 2024

So what it sounds like is that while string arrays can work with libraries that are array-api compliant, the actual array container for a string array will not be array-api compliant. Is this accurate?

Thanks for the replies!

@lucascolley
Copy link
Contributor

lucascolley commented Jan 16, 2024

the actual array container for a string array will not be array-api compliant

Since string dtypes are not in the standard, it is slightly confusing to ask whether a certain container would be compliant or not. The tl;dr is:

  • string dtypes are not in the standard (and there aren't any plans for them to be right now, since only really NumPy implements them). So you can't assume any string dtype functionality if you write array-agnostic code.
  • This doesn't stop any compatible library from doing whatever they like with string dtypes, so NumPy is free to implement them, and you can write code which makes use of them as long as you expect it to only work with NumPy (or e.g. your own custom library which decides to support the array API and string dtypes).

@ilan-gold
Copy link

ilan-gold commented Jan 16, 2024

Right, this was my impression. So it sounds like the answer is basically "no" at the moment to the question of whether or not one can create an interoperable container but "yes" to whether or not a library (which has an array-api compliant Array object) can handle them.

@lucascolley
Copy link
Contributor

Yes, and said array api compliant Array object is allowed to have additional attributes and methods to support features like this (it just isn't required/guaranteed to).

@ilan-gold
Copy link

Thanks! Really appreciate the fast replies (making my life infinitely easier).

@lucascolley
Copy link
Contributor

I think this issue can be closed now that NEP 56 was accepted!

@rgommers
Copy link
Member

Good point @lucascolley, closing. Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question General question.
Projects
None yet
Development

No branches or pull requests

7 participants