Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate eager and lazy APIs #249

Closed

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Sep 4, 2023

A lot of thought and effort is going into this, so I hope people will be open to at least considering it. No worries if it's rejected, but I hope that the rejection can come with an alternative solution to the problems with the status quo.

Why?

The current API makes it look like certain operations are possible. However, for some implementations (e.g. dataframes as frontends to SQL backends, polars lazyframes), the following would either have to raise, or trigger an implicit join (a "performance footgun"):

df1: DataFrame
df2: DataFrame
# problematic call:
df1.filter(df1.get_column_by_name('a') > df2.get_column_by_name('b'))

The problem is that the current API makes it look like it should run fine.

What's the suggested solution?

The basic idea is that, instead of only having DataFrame and Column, there will now be 4 classes:

  • DataFrame
  • Column
  • PermissiveFrame
  • PermissiveColumn

DataFrame

Completely agnostic to execution details (so, some implementations may choose to use lazy execution for it).

Initialisation:

  • namespace.dataframe_from_dict
  • namespace.dataframe_from_2d_array
  • df.__dataframe_consortium_standard__()

Individual standalone columns cannot be extracted. All operations must be done on the dataframe level, using Columns to refer to individual columns.

Column

Can be used to create new columns in a dataframe, update a dataframe's column(s), or filter a dataframe's rows. Examples:

# keep rows where 'a' is greater than 3
df = df.filter(col('a') > 3)

# divide the values in column 'a' by 2
df = df.assign(col('a') / 2)

# insert a new column 'c' which is the sum of columns 'a' and 'b'
df = df.assign((col('a') + col('b')).rename('c'))

Formally, a Column is lazy and is evaluated within the context of a DataFrame.

The idea is that

        df: DataFrame
        col = df.__dataframe_namespace__().col
        df = df.filter(col('a') > col('b')*2)

should behave a bit like (pandas syntax):

        df: pd.DataFrame
        col_a = lambda df: df.loc[:, 'a']
        col_b = lambda df: df.loc[:, 'b']
        col_b_doubled = lambda df: col_b(df) * 2
        mask = lambda df: col_a(df) > col_b_doubled(df)
        df = df.loc[mask(df)]

This has two benefits:

  • no need to repeat the dataframe name over-and-over again (and this is a common complaint about pandas syntax)
  • the Column API restricts you to refer to columns from the given dataframe. If you want to insert a column from a different dataframe, you'll need to do a join first.

In particular, this resolves the "problematic call" from the "Why?" section as it would become impossible to write.

Initialisation:

  • namespace.col
  • other top-level Column functions (namespace.any_rowwise, namespace.all_rowwise, namespace.sorted_indices, namespace.unique_indices)

Supports all functions on Column, but with the following differences:

  • reductions (e.g. col('a').mean()) return a length-1 Column rather than a Scalar
  • there is no to_array_object
  • no .dtype, as Columns are free-standing. If you want to know the dtype of a DataFrame's column, use DataFrame.schema[column_name]
  • no name - instead, there are:
    • root_names: the column names from the dataframe to consider when creating the output column
    • output_name: the name of the resulting column

Note that Column.len is supported, which (lazily) evaluates the number of rows

PermissiveFrame

Like DataFrame, but must be able to support eager evaluation when evaluating to_array_object. Other operations can stay lazy - if it wishes - but it needs to be able to evaluate eagerly when required.

Initiallisation: can only be done via:

  • DataFrame.collect()

Supports everything in DataFrame (except collect), but also:

  • to_array_object (to produce an 2D ndarray)
  • relax (to go back to DataFrame, which is totally agnostic to the execution mode and so might be lazy in some implementations)
  • get_column_by_name (to create a single PermissiveColumn)

I've documented that collect should be called as little and as late as possible.

PermissiveColumn

Initialisation:

  • PermissiveFrame.get_column_by_name
  • ser.__column_consortium_standard__()
  • namespace.column_from_sequence
  • namespace.column_from_1d_array

Just like Column, but:

  • reductions (e.g. Column.mean()) return scalars
  • it supports to_array_object (for creating a 1D array)
  • it has .dtype property
  • it has name, and no root_names nor output_name

Assorted

shape

I've removed shape as well to discourage df.collect().shape. It's not needed anyway:

  • if people want the number of columns, they can do len(df.column_names)
  • if people want the number of rows, they can evaluate that on the Column level, which will stay lazy, e.g. namespace.col('a').len()

If anyone has a use-case for which this is really necessary, we can always add it in again later (adding extra functionality is backwards-compatible)

Column | PermissiveColumn

A PermissiveColumn can be thought of as a trivial Column. For example (pandas syntax) df.loc[pd.Series([True, False, True])] can be thought of as the following Column:

df: DataFrame

def mask(_df):
    return pd.Series([True, False, True])

df.loc[mask(df)]

and so functions which accepts Columns (like .assign / .filter) should also accept PermissiveColumns.


A current goal of the Standard is to have the API be completely independent of execution details. This proposal would slightly relax that goal, so that only DataFrame and Column would be completely independent of execution details


Examples

I've included some examples in the PR, which I think would generally be good to include in the repo anyway to demonstrate how this is meant to be used.

1. center dataset based on mean from training set

Function just transforms a DataFrame, so everything can happen at that level. No need for collect

2. Group-by then plot

Function accepts Series, so PermissiveColumn needs using here

3. Split dataset and train model

In order to train an ML model, conversion to ndarray will have to take place. So,
DataFrame.collect is needed, but it only needs to be called once, after the more expensive calculations have been done

FAQ

Why namespace.col instead of DataFrame.col?

I've stated in #244 that I'd like the Standard to feel familiar. Multiple libraries (pyspark, polars, ibis, siuba) already have something namspace.col-like, and so this would feel familiar to their users.

If someone wants the dtype of a column of a DataFrame, they can do df.schema[column_name] (if #248 is accepted)

Why does PermissiveFrame need namespace.col? Isn't that just a lazy thing?

It can work for eager dataframes as well. If plant_statistics is an PermissiveFrame, you can write

    plant_statistics = plant_statistics.filter(
        (col("sepal_width") > col("sepal_height")) | (col("species") == "setosa")
    )

instead of

    plant_statistics = plant_statistics.filter(
        (
            plant_statistics.get_column_by_name("sepal_width")
            > plant_statistics.get_column_by_name("sepal_height")
        )
        | (plant_statistics.get_column_by_name("species") == "setosa")
    )

and maybe there'd be some scope for limited and basic query optimisation in libraries which up until now have been completely eager (like pandas)

Can I try this out?

By the time I mark this as ready-for-review, I'm hoping it will be possible to try out data-apis/dataframe-api-compat#13

Why do we need collect at all? Can't everything be agnostic to execution details?

Please see example 3

Why do we need PermissiveColumn at all?

Please see example 2

See Also

Some related discussions happened at:

@MarcoGorelli MarcoGorelli force-pushed the separate-eager-and-lazy branch 2 times, most recently from a02583d to a6ef75f Compare September 11, 2023 14:05
@MarcoGorelli MarcoGorelli marked this pull request as ready for review September 13, 2023 11:33
@MarcoGorelli MarcoGorelli changed the title WIP RFC: Separate eager and lazy APIs RFC: Separate eager and lazy APIs Sep 13, 2023
@rgommers
Copy link
Member

Thanks @MarcoGorelli for the detailed proposal + prototype!

Here are some of the bigger picture questions that came to mind:

  1. Should EagerColumn and EagerDataframe be "full" APIs, i.e. a superset of Expression and DataFrame? The alternative would be to keep them limited to things that don't fit into the common subset. I think I agree with your take that it should be a superset, but it seems like that decision could go either way.

  2. example_2.py shows a .collect() on line 31 that's a bit early. You suggested that there could be double computing happening in case it's put at the last possible moment (which is right before the .to_array_object() calls). However, this depends on the underlying engine and how good it is at caching intermediate results. And whether those intermediate results are ever fully materialized to begin with. That example does show a single .collect() call but a fully-materialized dataframe; the alternative would be to use two .collect() calls and only materialize two columns. Which one is better is data-dependent and implementation-dependent. The optimal case would be to have great full-program optimization and zero .collect() calls, but as we discussed that is a huge engineering effort and there are current no dataframe libraries that can actually do this. So practicality beats purity. That said, this is going to require some guidance for users I think.

  3. The expressions are still a little new/confusing to me (and I'm probably not alone in that). One thing that I think is missing is a specific name for "function that returns an expression". There are currently 5 such functions in the API (col, any/all_rowwise and sorted/unique_indices).

  4. It's slightly suboptimal that there's multiple ways of composing the same thing. E.g.: df.select(col('a')).mean() vs. df.select(col('a').mean()). The latter one expresses better what's actually happening in execution order, but the former one is used in the example (and perhaps more analogous to SQL?). When expressions are composed, the semantic order of operations isn't completely obvious. In the example given before in this question, its "select column 'a' from dataframe, then take mean of column". For the .filter((col("sepal_width") > col("sepal_height")) ...) example in the PR description, there's more going on - this is actually nontrivial to write down; the expressions aren't chained but instead there are multiple "select column from dataframe" operations followed by two comparison between columns, then a row-wise selection. What is allowed here and what isn't? E.g. is df.filter((col("sepal_width") allowed when that sepal_width column is not boolean?

  5. More a comment than a question: one thing I learned today from you is how this solves some of the issues around lazy implementations. A key design rule seems to be that there is only a single DataFrame method that takes another dataframe as input, namely .join(). That by design avoids problems with deterministic operations etc. that we identified in previous discussions on other issues/PRs.

Like DataFrame, but must support eager evaluation.

  1. Should it be "must" here? I'd think that any library which can implement this lazily should still be free to do so? The Eager in the name is more a hint that some methods for this column class may need to trigger execution, not that it's forbidden to stay lazy as long as possible and do execute-when-needed? I think that may be what you intended; it requires writing that down explicitly I'd say.

Readability matters.

Please remember that to a large extent this is in the eye of the beholder. To me your preferred version uses both method name and syntax that I don't yet understand and therefore would have to look up. While this, while more verbose, is much easier to understand for me - and arguably also for the average non-initiated reader:

col_width = plant_statistics.get_column_by_name("sepal_width")
col_height = plant_statistics.get_column_by_name("sepal_height")
col_species = plant_statistics.get_column_by_name("species")
rows = plant_statistics.get_rows_by_mask((col_width > col_heigh) | col_species)

Also, the plant_statistics name is unusually long here, so repeating it is an extra penalty on any syntax that repeats the dataframe name. If the variable was named df (or even df_plant) the difference in number of characters wouldn't be nearly as large.

So while I'm not at all opposed to changes to naming for those who have to read it more, I'd keep in mind that this is not at all clear cut and by itself not the strongest rationale for introducing new classes/objects.

That said, I don't think this particular rationale is needed at all here. The stronger rationale is that an expression is nicer to work with as an execution-independent abstraction, so it seems useful and preferable to a separate lazy-only expression object.

Also, it was encouraging to hear that something expression-like may be in the works for Pandas.


Current impression: this is quite promising. There may be some things to clarify regarding semantics, but I like it so far.

If any of the expression's expressions is not boolean.
"""

def any(self, *, skip_nulls: bool = True) -> Expression:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct for reductions to be methods on Expression? The PR description suggests it isn't, because it calls out reductions to be specific to EagerColumn. Or if this is meant to be a "lazy scalar" type of return, doesn't that give problems with returning an expression object on which most methods cannot be considered valid/callable?

Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - an expression acts as a function from a dataframe to a column, and so here, it would just return a length-1 column.

Then, broadcasting rules (which I need to write out which I've added to the docs) define the rest

Here's some examples:

In [1]: df
Out[1]: 
     sepal_length  nice_plant
0             5.1       False
1             4.9       False
2             4.7       False
3             4.6       False
4             5.0       False
..            ...         ...
145           6.7        True
146           6.3        True
147           6.5        True
148           6.2       False
149           5.9       False

[150 rows x 2 columns]

# just select a single length-1 column, so the result is a length-1 dataframe
In [2]: df.select(col('nice_plant').any())
Out[2]: 
   nice_plant
0        True

# col('sepal_length') inserts a length-150 column, and
# col('nice_plant').any() inserts a length-1 column. So, the
# second one gets broadcasted to the same length as the first one
In [3]: df.select(col('sepal_length'), col('nice_plant').any())
Out[3]: 
     sepal_length  nice_plant
0             5.1        True
1             4.9        True
2             4.7        True
3             4.6        True
4             5.0        True
..            ...         ...
145           6.7        True
146           6.3        True
147           6.5        True
148           6.2        True
149           5.9        True

[150 rows x 2 columns]

# Both columns we insert are length-1, so so is the output
In [4]: df.select(col('sepal_length').mean(), col('nice_plant').any())
Out[4]: 
   sepal_length  nice_plant
0      5.843333        True

# Both columns are length-150, so so is the output
In [5]: df.select(col('sepal_length'), col('nice_plant'))
Out[5]: 
     sepal_length  nice_plant
0             5.1       False
1             4.9       False
2             4.7       False
3             4.6       False
4             5.0       False
..            ...         ...
145           6.7        True
146           6.3        True
147           6.5        True
148           6.2       False
149           5.9       False

[150 rows x 2 columns]

@MarcoGorelli
Copy link
Contributor Author

Current impression: this is quite promising. There may be some things to clarify regarding semantics, but I like it so far.

Thanks! This is a huge relief, I kind of needed to hear something like that

For the .filter((col("sepal_width") > col("sepal_height")) ...) example in the PR description, there's more going on - this is actually nontrivial to write down

Let's try :) I'll try to write down what: df.filter(col('a') > col('b')) would resolve to

  • col('a'): lambda df: df.loc[:, 'a']
  • col('b'): lambda df: df.loc[:, 'b']
  • col('a') > col('b'): lambda df: (lambda df: df.loc[:, 'a'])(df) > (lambda df: df.loc[:, 'b'])(df)
  • df.filter(col('a') > 3) : df.loc[(lambda df: (lambda df: df.loc[:, 'a'])(df) > (lambda df: df.loc[:, 'b'])(df))(df)].reset_index(drop=True)

We can try this out with a concrete pandas dataframe:

In [11]: df = pd.DataFrame({'a': [1, 4, 2, 3, 5], 'b': [9,1,4,2,5]})

In [12]: df.loc[(lambda df: (lambda df: df.loc[:, 'a'])(df) > (lambda df: df.loc[:, 'b'])(df))(df)].reset_index(drop=True)
Out[12]:
   a  b
0  4  1
1  3  2

The usual rules would still apply - filtering with something not boolean would still raise

That example does show a single .collect() call but a fully-materialized dataframe; the alternative would be to use two .collect() calls and only materialize two columns

I am indeed materialising the whole dataframe though:

  • x_train gets all columns except for 'y'
  • y_train gets only column 'y'

If I was only selecting two columns, then it would have been better to first select those two columns, then collect, and then split. But there would still just be a single collect, and it would have been as late as possible.

I expect the maxim "use collect as late as possible, and ideally just once" it to be close-to-optimal for most cases - counterexamples would be very welcome (and very interesting!)

@MarcoGorelli
Copy link
Contributor Author

However, this depends on the underlying engine and how good it is at caching intermediate results

Not sure I see how this would work at all actually, even with infinite engineering resources

Say I do

df: DataFrame
x = df.filter(col('score') > 0).select('a', 'b').collect().to_array_object()

I now have a array assigned to x. Should the underlying library have cached anything when calling .collect()?

It depends.

  • If I was then to call
    df: DataFrame
    y = df.filter(col('score') > 0).select('c').collect().to_array_object()
    then the optimal thing to do would have been to cache the result of df.filter(col('score') > 0).
  • But if df was not used again, then the optimal thing would have been to not cache anything.

Problem is, if we're defining a Python API, then we're not in a purely compiled language. And so when .collect is called, we have no idea which of the two scenarios above we are in.

Should it be "must" here? I'd think that any library which can implement this lazily should still be free to do so?

I'll try to make this clearer, thanks.
EagerFrame doesn't need to do everything eagerly, there just needs to be a way to:

  • get a concrete value for a column reduction
  • convert to a concrete ndarray

Not saying it needs to do everything eagerly, happy to let implementations figure out the details here - but it does need to have the ability to step out of lazy mode and produce something eager

@rgommers
Copy link
Member

But if df was not used again, then the optimal thing would have been to not cache anything

That's kind of the same argument as sometimes used against Python garbage collector, but it's a thin one. If df was already allocated, it costs very little to hold onto it until you actually need the memory for something better. Surely a good caching system with the appropriate cache eviction rules will be able to manage memory usage better than manual selection of what to cache and what not.

It depends. [...] And so when .collect is called, we have no idea which of the two scenarios above we are in.

I think this is why PyTorch/JAX/TF (and Numba for NumPy) have "lazy modes" using @jit, torch.compile et al. Then you can know which situation you are in and make the optimal decision. There's a lot of prior art here I'd say.

I'd summarize the decision of using a .collect() as:

  • it's a way to signal to users that an operation cannot stay lazy, and nudge them to reconsider whether they cannot rewrite their code with a different operation that does allow staying lazy;
  • if it's not possible to keep things lazy anyway, the .collect() is extra baggage and now forces the user to make decisions about when to explicitly trigger a computation - and in principle they cannot do a job as good as the execution engine itself, because it may well be the case that "cache or recompute" depends on both the shape of the data and the machine's hardware config.

I expect the maxim "use collect as late as possible, and ideally just once" it to be close-to-optimal for most cases - counterexamples would be very welcome (and very interesting!)

I think the "just once" is important, but there's a little more to it - it's not hard to write code that's worse than eager mode would be. In your example_2.py, just reuse df at the end of the function (e.g. return df.std() as a second output), and it seems to me like that will run the expensive_feature_engineering call twice (at least in Polars, since it doesn't cache by default). I think to be "performance-safe", you should write:

eager_df = df.collect()
del df

Either that or do df = eager_df.relax() as soon as you're done with the operations that you needed eager_df for.

@MarcoGorelli
Copy link
Contributor Author

If df was already allocated, it costs very little to hold onto it

Sure, but what if it's not been allocated?

E.g. if I did

x = df.select(col('a').sort().head(5)).collect()

then, the dataframe could optimise it to

x = df.select(col('a').top_k(5)).collect()

under-the-hood and never even allocated df.select(col('a').sort()) - so if that part was needed in another computation, it would have to be recomputed

(e.g. return df.std() as a second output), and it seems to me like that will run the expensive_feature_engineering call twice

OK yes, it would if you then called .collect() again on the output

Maybe we can be more precise in the guidance then - only call .collect() once per dataframe. And perhaps even introduce an optional "dataframe API linter" which guards against double-collection on the same dataframe

@rgommers
Copy link
Member

Sure, but what if it's not been allocated?

If it's never been materialized in memory before there's nothing to cache. But that's not really a counterpoint, it's only a "there wasn't an opportunity to miss here, hence it wasn't missed".

Maybe we can be more precise in the guidance then - only call .collect() once per dataframe. And perhaps even introduce an optional "dataframe API linter" which guards against double-collection on the same dataframe

This sounds like a good rule to me. No need for a linter I'd say - it can be implemented simply by keeping track of an _already_called boolean variable that .collect() can check.

@MarcoGorelli
Copy link
Contributor Author

If I write

df: DataFrame
x = df.select(col('a').sort().head(5)).collect().to_array_object(Int64())

then should df['a'].sort() be cached, or should .sort().head(5) be optimized to .top_k(5)?

It depends on what comes next:

  • if I then wrote y = df.select(col('a').sort().abs()), then it would've been optimal to cache col('a').sort()
  • if I then finished the function and didn't do anything else with df, then the optimal thing would've been to optimise col('a').sort().head(5) into col('a').top_k(5)

But we don't know which case we're in - and so long as we're defining a Python API, then I don't know how we can.

Anyway, sounds like we're in agreement on just documenting that .collect should be used at most once per dataframe, and as late as possible.

it can be implemented simply by keeping track of an _already_called boolean variable

It would be a bit more complex, as it would need propagating - for example

df: DataFrame  # _already_called is False
x = some_function(df).collect()  # output of some_function(df) gets _already_called True
return df.std().collect()  # allowed, as _already_called on df is still False

In any case, I don't think we can enforce that all implementations add such a flag - but we could make a dataframe-api-linter independently of what implementations do


Anyway, we're getting a bit sidetracked here, and it sounds like we're in agreement on what to do with regards to collect anyway

@shwina any thoughts on this proposal? (no hurry)

@MarcoGorelli MarcoGorelli changed the title RFC: Separate eager and lazy APIs Separate eager and lazy APIs Sep 27, 2023
@anmyachev
Copy link
Contributor

The current API makes it look like certain operations are possible. However, for some implementations (e.g. dataframes as frontends to SQL backends, polars lazyframes), the following would raise:

df1: DataFrame
df2: DataFrame
# problematic call:
df1.filter(df1.get_column_by_name('a') > df2.get_column_by_name('b'))

@MarcoGorelli Is this problematic call still possible using EagerFrame and EagerColumn classes? If yes, and, for example, Polars will also implement them, then the problem will remain.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Sep 29, 2023

Polars already has an eager dataframe class (polars.DataFrame) from which you can extract eager columns (polars.Series). On those objects, that call has no issue - no, there's no problem here

The call is only problematic for polars.LazyFrame

@anmyachev
Copy link
Contributor

The call is only problematic for polars.LazyFrame

polars.LazyFrame has join operation, why not do it implicitly when comparing?

P.S. I already see that there were a lot of discussions about lazy calculations, quite easily I could have missed the discussion of the question that I asked, sorry for that and if there is anything you could just post a link to the right discussion (if it's more convenient).

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Sep 30, 2023

That would be a performance footgun:

In [1]: df1 = pl.LazyFrame({'a': np.random.randn(100_000_000)})

In [2]: df2 = pl.LazyFrame({'a': np.random.randn(100_000_000)})

In [3]: %timeit _ = df1.filter(pl.col('a')>0).collect()
146 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df1.with_row_count('idx').join(df2.with_row_count('idx'), on='idx').filter(pl.col('a_right')>0).drop('a_right', 'idx').collect()
821 ms ± 33.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PS. no worries, I don't think there's an existing discussion with this question (but it might have been asked during the latest call - recordings are at https://github.com/data-apis/workgroup/issues/10, let us know if you need access). Overall, please do feel welcome to ask all the questions you'd like! 🙏

@anmyachev
Copy link
Contributor

That would be a performance footgun:

In [1]: df1 = pl.LazyFrame({'a': np.random.randn(100_000_000)})

In [2]: df2 = pl.LazyFrame({'a': np.random.randn(100_000_000)})

In [3]: %timeit _ = df1.filter(pl.col('a')>0).collect()
146 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df1.with_row_count('idx').join(df2.with_row_count('idx'), on='idx').filter(pl.col('a_right')>0).drop('a_right', 'idx').collect()
821 ms ± 33.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PS. no worries, I don't think there's an existing discussion with this question (but it might have been asked during the latest call - recordings are at data-apis/workgroup#10, let us know if you need access). Overall, please do feel welcome to ask all the questions you'd like! 🙏

  1. What if something like this:
def __gt__(self, other):
  if [some condition that check `self` and `other` from different dataframes]:
    result = self.join(other)
  [comparison implementation]

In the case where the columns do not require a join, the function is still as fast as before (unless the condition requires a lot of computational resources). In the case where a join is needed, we will execute it implicitly in the function (without forcing the user to execute it explicitly before the comparison function). In some situations this may be extremely ineffective, but for this purpose we can provide a warning mode (off by default), which can warn that an implicit join is being made and that it may make sense to do it explicitly in order to reuse the result. This way the code will remain more attractive/familiar for beginners, but there will be hints (via warnings) on how to write the code more efficiently. However, I don’t know whether the use of warnings can be documented in the standard in principle.
Or prohibit (for all implementations) the operation of comparing columns if there is a need to join them before. A more strict solution, but still simple. It is very likely that this was also discussed somewhere (it would be great if this pull request had a link to the outcome of this discussion)

  1. I like your idea in terms of flexibility regarding possible optimizations, but I worry that new entities and relationships between them greatly increases the barrier to entry for new users/developers. Have you considered adding new (lazy) methods to existing entities of DataFrame API?

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Oct 1, 2023

Thanks for your suggestion - it's true that we could do that, but I'd like to make the case that we shouldn't

The whole "it's fast in some cases and painfully slow in others and I have no idea why" thing of pandas is a common pain-point for users - for example, can you guess which of series.apply(np.sin) and series.apply(lambda x: np.sin(x)) will be faster? (OK, you probably can, but most users can't 😄 )

In general, the Consortium is serious about avoiding performance footguns (e.g., #131 (comment), forbidding iterating over the elements of Column, not sorting by default in group_by, ...)

If filtering one dataframe by another ends up being a common use-case, then we can always consider allowing it later on - but it's much easier to start strict and relax later on than to do the other way round

@MarcoGorelli MarcoGorelli mentioned this pull request Oct 3, 2023
@MarcoGorelli
Copy link
Contributor Author

Have updated, and done some renamings as discussed last time

cc @jorisvandenbossche in case you have an opinion here (getting this right here will help sort out some issues with pd.col before we try to get this into pandas itself)

spec/API_specification/dataframe_api/__init__.py Outdated Show resolved Hide resolved
@@ -166,11 +191,117 @@ def dataframe_from_2d_array(array: Any, *, names: Sequence[str], dtypes: Mapping
"""
...

def any_rowwise(*columns: str | Column | PermissiveColumn, skip_nulls: bool = True) -> Column:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to type things as Column | PermissiveColumn feels like a design flaw to me and will make downstream usage potentially annoying. I.E. for code that agnostically handles columns and dataframes, there's now 4 objects that people will need to type check to understand what they're working with.

Given that PermissiveColumn is a superset of Column, maybe we could use Column as a base class and inherit from it in PermissiveColumn? Similarly for DataFrame and PermissiveFrame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is type checking the only concern here? If so, we could define (and export) a type alias, like we do for DType?

code that agnostically handles columns and dataframes

if you have completely agnostic code, then I'd suggest just accepting DataFrame and leaving it up to the caller to convert to DataFrame

After all:

  • if an end user passes in df_non_standard to your function, then df_non_standard.__dataframe_consortium_standard__() returns a DataFrame
  • if the function is only used internally, then you can control what you pass it. If you have a PermissiveFrame, you can call .relax to convert it to DataFrame

@@ -50,6 +60,21 @@
implementation of the dataframe API standard.
"""

def col(name: str) -> Column:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't bound to a DataFrame, for libraries other than polars and ibis this will be a new concept that will require implementation and maintenance. Do you have a sense for what this would look like for Pandas for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup take a look here data-apis/dataframe-api-compat#13

It's surprisingly simple to just add the syntax

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really look simple to me: https://github.com/data-apis/dataframe-api-compat/blob/76284fa158ffe0f21ab1758f46caf523427077a3/dataframe_api_compat/pandas_standard/pandas_standard.py#L81-L417

My concern is that we went from pushing for changes in polars to now pushing for changes in most other dataframe libraries. I would love some thoughts from other dataframe library maintainers here.

Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a matter of recording some lambda calls and then unpacking them - e.g.

        df: DataFrame
        col = df.__dataframe_namespace__().col
        df = df.filter(col('a') > col('b')*2)

becomes

        df: pd.DataFrame
        col_a = lambda df: df.loc[:, 'a']
        col_b = lambda df: df.loc[:, 'b']
        col_b_doubled = lambda df: col_b(df) * 2
        mask = lambda df: col_a(df) > col_b_doubled(df)
        df = df.loc[mask(df)]

If this lives in a separate namespace, then there's not really any extra maintenance that needs doing in the main implementation

I would love some thoughts from other dataframe library maintainers here.

For a start, it might be added to pandas (regardless of what the consortium does), check Joris' lightning talk from euroscipy: https://youtu.be/g2JsyNQgcoU?si=ax0ZINFQINf9a5jv&t=512

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that we went from pushing for changes in polars to now pushing for changes in most other dataframe libraries

Also, this isn't "apples to apples" : asking Polars to add complexity to the query optimiser isn't comparable to keeping track of lazy column calls in a separate namespace

Anyway, thanks for your input, and I hope you're keeping well on parental leave!

Copy link
Contributor

@cbourjau cbourjau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response! I am looking at this proposal through the lens of an inherently lazy, ONNX-based implementation that can never evaluate eagerly.

I am worried the current proposal encourages users to write code with eager collections in order to optimize the performance of some particular implementations. The plotting example is an instance where the collection is unavoidable since the core business logic needs to process eager values. The other examples can technically remain lazy as far as I can tell.

It seems to me that we have two different kinds of "collections". One is a "collection hint" for the underlying implementation, and the other is an actual collection resulting in values in memory.

From our perspective, it would be very beneficial to express this difference in the API. We would be happy to ignore any "collection hints" and would throw exceptions for any true "collect" call.

Another concern for me is the predictability and semver stability of a library's API. It would be great if we could type-hint on function boundaries that a function taking a DataFrame never eagerly computes a value. This way, changing a lazy implementation to an eager one would require an explicit semver-breaking API change.

@@ -4,11 +4,12 @@

def my_dataframe_agnostic_function(df_non_standard: SupportsDataFrameAPI) -> Any:
df = df_non_standard.__dataframe_consortium_standard__(api_version='2023.09-beta')
xp = df.__dataframe_namespace__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the other examples:

Suggested change
xp = df.__dataframe_namespace__()
namespace = df.__dataframe_namespace__()


for column_name in df.column_names:
if column_name == 'species':
continue
new_column = df.get_column_by_name(column_name)
new_column = xp.col(column_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new_column = xp.col(column_name)
new_column = namespace.col(column_name)


df = expensive_feature_engineering(df)

df_permissive = df.collect()
Copy link
Contributor

@cbourjau cbourjau Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the current discussion is that a call to collect may but is not forced to do any eager computation. Is this correct?

Comment on lines +35 to +38
x_train = train.drop_columns("y").to_array_object(namespace.Float64())
y_train = train.get_column_by_name("y").to_array_object(namespace.Float64())
x_val = val.drop_columns("y").to_array_object(namespace.Float64())
y_val = val.get_column_by_name("y").to_array_object(namespace.Float64())
Copy link
Contributor

@cbourjau cbourjau Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal reads:

PermissiveFrame
Like DataFrame, but must be able to support eager evaluation when evaluating to_array_object

This would render this function unusable for fully lazy implementations such as the ONNX-based one that we are working on, but there is not technical reason why this function can't be lazy. Note that our lazy data frame implemenation is build on top of a lazy array-api implemenation. We therefore would like to return a lazy array with the to_array_object call.

Sorry, I overlooked the call to fit further down 😨 . Fitting will usually indeed require actual values. For the ONNX use case, we are primarily interested in lazily evaluating predict and transform calls.

]
)
results_df: DataFrame = namespace.dataframe_from_2d_array(
results,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this proposal, but for our use case we would also need a way to instantiate a lazy data frame from a lazy results tensor.

@cbourjau
Copy link
Contributor

I think the "just once" is important, but there's a little more to it - it's not hard to write code that's worse than eager mode would be. In your example_2.py, just reuse df at the end of the function (e.g. return df.std() as a second output), and it seems to me like that will run the expensive_feature_engineering call twice (at least in Polars, since it doesn't cache by default). I think to be "performance-safe", you should write:

A real-world example that would take advantage of "cached" results would be our ONNX-based implementation. Consider the following (pseudo-code) example:

# Construct a lazy data frame. 
# The input columns are of type int64 and will be provided as user input at inference time
df = lazy_standard_df_from_user_input({"a": "int64", "b": "int64")
features = expensive_feature_engineering(df)
std = features.std()

# Build the ONNX model
model = build_onnx_artifact(inputs=df, outputs={"features": features, "std": std}) 
import onnx
onnx.save(model, "my_model.onnx")

Running this ONNX model with user provided input will run the feature engineering step only once in order to produce features and std.

@cbourjau
Copy link
Contributor

Thanks for the very fruitful discussion we just had! I'd like to provide an example that illustrates how function signatures of the current proposal can be used to indicate that functions require eager implementations, but not to signify that they are compatible with lazy implementations.

Let's consider the eager case first:

def do_things(df: PermissiveDataFrame) -> PermissiveDataFrame:
    ...

A function of this type would make it very clear that whatever happens inside it may requires eager computation. This is fine since some tasks such as plotting or fitting a model are usually inherently "eager".

However, the current proposal would also allow the following implementation:

def do_things_sneaky(df: DataFrame) -> DataFrame:
    return do_things(df.collect()).relax()

This is rather unfortunate for two reasons:

  1. It is unclear if do_things_sneaky works with a standard compliant lazy implementation until the user actually tries to call it
  2. Even if do_things_sneaky works today, it wouldn't be a (obvious) semver breaking change to break it for lazy libraries with the next batch release.

Might it be possible to have an inheritance-based solution where PermissiveDataFrame inherits from DataFrame rather than allowing the user to switch back and forth? Both PermissiveDataFrame and DataFrame may still have a way to hint to the underlying implementation that a collection may be in order.

@MarcoGorelli
Copy link
Contributor Author

Thanks for your comments and discussion @cbourjau , really productive and helpful!

From yesterday's call, I think (please correct me if I'm wrong) that we're in agreement that there needs to be a way to say "values need materialising" - the onnx-based dataframe case, you'd be able to use that to determine whether:

  • you can keep the whole graph lazy and export it to an onnx model
  • you need to materialise some values to pass them to a 3rd party library

An example of when the latter is needed could be fitting an xgboost model (or indeed, as of 2023, any ML model...I'm not aware of any library which can train a lazy model with a lazy array, but please do correct me if I'm wrong).

.collect is here to try to address that second need - and it should only be used in those cases. Anything else should just use the DataFrame (execution-agnostic) API, and a library like your onnx-based dataframe could export the whole thing.

Might it be possible to have an inheritance-based solution where PermissiveDataFrame inherits from DataFrame rather than allowing the user to switch back and forth?

This is a good point. Would just removing .relax from PermissiveDataFrame have the same effect? If so, maybe we could just go with that.
If someone needs to create a DataFrame, they can always use the namespace.from_* constructors


Do you have any thoughts on the namespace.col syntax?

@cbourjau
Copy link
Contributor

cbourjau commented Oct 18, 2023

From yesterday's call, I think (please correct me if I'm wrong) that we're in agreement that there needs to be a way to say "values need materialising" - the onnx-based dataframe case, you'd be able to use that to determine whether:

  • you can keep the whole graph lazy and export it to an onnx model
  • you need to materialise some values to pass them to a 3rd party library

Technically, the ONNX standard does have provisioning for training, but I'd say it is out of scope. I have never used it, but I think it would essentially require the Array/Dataframe API to standardize training functions such as def train_glm(X: Array, y: Array) -> Array: ....

Eitherway, your takeaway is correct. It would be great if it were possible to differentiate those two points by looking at a function signature.

This is a good point. Would just removing .relax from PermissiveDataFrame have the same effect? If so, maybe we could just go with that.

Going from PermissiveDataFrame to DataFrame is not an issue for our use case. Any function that takes a lazy DataFrame may also accept PermissiveDataFrame objects as far as I can tell. The other direction is the issue for us. I'd love a type hint that signals: "Whatever happens in this function cannot require any eager computation".


Do you have any thoughts on the namespace.col syntax?

I think it looks good and given that it works for Polars I don't see an issue for our lazy use case.

@MarcoGorelli
Copy link
Contributor Author

I'd love a type hint that signals: "Whatever happens in this function cannot require any eager computation".

Yes, I would also love to have this. Gonna think about this one a little more 🤔

@MarcoGorelli
Copy link
Contributor Author

@cbourjau just to clarify - would a type hint suffice, or would you need it to be a different at runtime?

I'd really appreciate it if you could contribute an example to the repo (#281) of some functions you'd envisage people using the API for - I'm particularly interested in examples you may have of functions for which your library (what's it called, btw?) could stay completely lazy and have everything exported at the end

@shwina
Copy link
Contributor

shwina commented Oct 18, 2023

Thank you @MarcoGorelli for the exploration. Admittedly, it is daunting to review a 2000+ line PR and I have not done a line-by-line review. It's my fault for not following this PR more closely since its earlier iterations.

That being said, thank you for the details in the PR description - it does help understand the motivation behind this change, and gives a good idea about the implementation.

While I can see how the changes introduced here solve the problem posed, I do feel they negatively impact the complexity and teachability of the standard.

From a cuDF perspective, our long-term hope is not to maintain a plethora of APIs (native, standard DataFrame, standard PermissiveDataFrame), but rather a single API that looks more and more like the standard over time. I worry that introducing lazy semantics into the standard will make that increasingly difficult/impossible.


No worries if it's rejected, but I hope that the rejection can come with an alternative solution to the problems with the status quo.

I am more than appreciative and awed at the effort that has gone into this solution, so I do not suggest an alternative lightly:

The goal is to disallow operations between columns originating from different DataFrames, a la Polars. As you say, it can be a performance footgun.

Are folks open to simply saying that the operation is undefined by the existing standard? Implementations can choose to raise. The "blessed" way to do this would be:

lhs = df1.select("a").to_array()
rhs = df2.select("b").to_array()
df.filter(column_from_1d_array(lhs > rhs))

It closes no doors for anyone and allows eager implementations to remain relatively untouched.

@MarcoGorelli
Copy link
Contributor Author

Thanks for your feedback

The "blessed" way to do this would be

This wouldn't quite work, df1.select("a") returns a Dataframe, not a Column

@shwina
Copy link
Contributor

shwina commented Oct 18, 2023

Ah, right you are. I meant get_column_by_name

@MarcoGorelli
Copy link
Contributor Author

rather a single API that looks more and more like the standard over time

Just for my understading, are you expecting that future cuDF syntax will look like this

    lineitem = lineitem.assign(
        [
            (
                lineitem.get_column_by_name("l_extended_price")
                * (1 - lineitem.get_column_by_name("l_discount"))
            ).rename("l_disc_price"),
            (
                lineitem.get_column_by_name("l_extended_price")
                * (1 - lineitem.get_column_by_name("l_discount"))
                * (1 + lineitem.get_column_by_name("l_tax"))
            ).rename("l_charge"),
        ]
    )

even though pandas (main namespace) won't*?

*it's not my decision. You're welcome to propose this to pandas, I just though it would be right to give you a heads up that this is almost certainly going to be unanimously rejected by other pandas devs, please don't shoot the messenger

The goal is to disallow operations between columns originating from different DataFrames

There's more to it than that:

  • column reductions
  • if/when to materialise (see conversation above with @cbourjau - would a function stay completely lazy and be exported to onnx, or would some values need materialising because some library needs them now?)
  • not needing to compute things more times than necessary when filtering

Are folks open to simply saying that the operation is undefined by the existing standard? Implementations can choose to raise.

I'm open to that. In fact, I'd suggested it here #224 , but the question went unanswered

@shwina
Copy link
Contributor

shwina commented Oct 19, 2023

Just for my understading, are you expecting that future cuDF syntax will look like this

There's no reason cuDF couldn't introduce the get_column_by_name method without keeping its nice getitem interface. The end user would use the latter. Third party library developers who care about cross library compat could use the former.

@MarcoGorelli
Copy link
Contributor Author

Sure, if you're happy to have both then that sounds fine


I don't think that having an API which makes certain operations look possible but which raise in some implementations makes for a particularly good user experience, but it seems that people are ok with that

We also don't solve the whole "when to materialise?" issue with the status quo, but people don't seem to consider that a real issue

Closing then - I'm skeptical about the current API's usability and usefulness, but I may well be wrong, so let's just wait for people to try using it and to see if they report issues

@cbourjau
Copy link
Contributor

cbourjau commented Oct 23, 2023

Apologies for the delayed reply! I had some very busy days! I created a PR with an example of how we envision using lazy ONNX-based data frames to trace inference logic in sklean-like pipelines.

Concerning the previous discussion, I hope that that example may illustrate the benefits of an API that could signal that the tranform methods are guaranteed to not require a collection. I don't have a good solution at the moment, but I hope that this discussion may be the seed of a future one!

@cbourjau just to clarify - would a type hint suffice, or would you need it to be a different at runtime?

I think a hint would go a long way since it would make semver-breaking change quite obvious.

I'd really appreciate it if you could contribute an example to the repo (#281) of some functions you'd envisage people using the API for - I'm particularly interested in examples you may have of functions for which your library (what's it called, btw?) could stay completely lazy and have everything exported at the end

I'm afraid our library is not yet open source, but I hope that will change in the next couple of months!

@rgommers
Copy link
Member

Just a brief comment, since it's really tricky to collect (pun intended) and express all my thoughts. First, this is a hard and complex topic, and I think closing this PR was a bit abrupt and probably premature. There's a desire to move ahead and perhaps some time constraints to get to a "ready to adopt" state, but this is the single most difficult design topic so far, so we should be ready to spend a bit more time here. I don't want to hit the "reopen" now, but we should discuss and disentangle things here. Leaving this unresolved and then putting it in some of the library-specific implementations or compat layer isn't going to help us in the long run.

There are at least 3 different topics here:

  1. Syntax preferences. There's a focus on get_column_by_name (continued in Feature request: get_column_by_name impairs readability #287), but there are others. It's a conciseness vs. explicitness thing. I think this is the easiest one to resolve, and most folks are roughly aligned I think. Syntax / method naming is unrelated to lazy/eager.
  2. Execution "hints" vs. being prescriptive. We should be clear that, whatever we add here (if anything), these are hints that individual libraries may use as execution triggers, but they don't have to. The ONNX examples reinforced that point.
  3. Conceptual design. This is the hardest point. There's significant tension here between lazy implementations and eager ones. I don't think we have a solution yet that everyone is happy with. It should be possible to find one though, the differences aren't small, but also not insurmountably large.

My point of view is that we can resolve (1) and (2) quickly, and then with those distractions gone it may be easier to focus on (3).

@MarcoGorelli
Copy link
Contributor Author

Thanks for your comments

My point of view is that we can resolve (1) and (2) quickly

How do you propose to resolve (2) quickly?

The use-case I'm particularly interested in addressing is the same one Christian had brought up - you need to materialise data for fit, but can stay lazy for transform / predict

@rgommers
Copy link
Member

you need to materialise data for fit, but can stay lazy for transform / predict

None of those are actually in this API. What is happening there is that fit is going to be an iterative algorithm, which is going to calculate some loss metric that is being optimized. So what it's going to boil down to is a some_scalar_metric > something_else - and at that point, the Python float you need is the one thing that cannot be kept lazy. Hence that is the place where an ONXX-based (or any maximally-lazy) dataframe implementation will raise.

All this is fully compatible with .collect being a hint rather than an execution directive.

@MarcoGorelli
Copy link
Contributor Author

Would it be acceptable to add .collect and explicitly note that it's to be interpreted as a hint, rather than as an execution directive? If so, then together with #298 (comment), then we may have a solution here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants