feat: "carefully" allow for dask Expr that modify index #743

FBruzzesi · 2024-08-08T11:30:40Z

What type of PR is this? (check all applicable)

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

Pretty dangerous stuff to workaround the dask index.

To assess that the implementation is working as expected, I implemented both sort (different index but same length) and drop_nulls (different index due to different length)

FBruzzesi · 2024-08-08T11:41:51Z

tests/expr_and_series/drop_nulls_test.py

+    result = df.select(nw.col("a").drop_nulls(), nw.col("d").drop_nulls())
+    expected = {"a": [1.0, 2.0], "d": [6, 6]}


Sadly this broadcast is not working as drop_nulls does not return a scalar. I would consider this an edge case and focus on the broader support

anopsy · 2024-08-08T11:59:25Z

MarcoGorelli · 2024-08-08T12:28:14Z

thanks for trying this - i'll test it out and see if there's a perf impact

FBruzzesi · 2024-08-10T19:58:50Z

narwhals/_dask/dataframe.py

+
+        col_order = list(new_series.keys())
+
+        left_most_series = next(  # pragma: no cover


this is guaranteed to not end up in StopIteration error as if everything was a scalar the previous block would have been entered and returned

MarcoGorelli · 2024-08-14T07:44:36Z

we've got the notebooks in tpch/notebooks, the first two support Dask - fancy running them with this branch and seeing if there's any perf impact?

FBruzzesi · 2024-08-21T07:20:12Z

Hey @MarcoGorelli, I am giving another thought on this feature (which I would still love to see), here is a simple idea to have partial support without loss of performance:

sort is the only method that changes the index and result in an output with the same length. Instead of changing the index to each series, we can do that specifically in sort, namely by assigning the original index.

def sort(self: Self, *, descending: bool = False, nulls_last: bool = False) -> Self:
     na_position = "last" if nulls_last else "first"

     def func(_input: Any, ascending: bool, na_position: bool) -> Any:  # noqa: FBT001
         name = _input.name

         result =_input.to_frame(name=name).sort_values(
             by=name, ascending=ascending, na_position=na_position
         )[name]
         return de._expr.AssignIndex(result, _input.index)

     return self._from_call(
         func,
         "sort",
         not descending,
         na_position,
         returns_scalar=False,
     )

All the other methods that change the index, do so by reducing the length of the series. In my working experience and in TPCH queries they are mostly used before a reduction or in isolation, therefore we should not worry of changing their index. Example:
```
df.select(
    head_sum=pl.col("a").head().sum(),
    tail_mean=pl.col("a").tail().mean(),
)
```
What is left and unsupported you may ask? Multiple ~~reductions~~ operations ending up with the same length, different from the original, won't be possible. Example:
```
df.select(
    head=pl.col("a").head(),
    tail=pl.col("a").tail(),
)
```

What do you think?

narwhals/_dask/dataframe.py

FBruzzesi · 2024-08-27T08:02:38Z

@MarcoGorelli I am tagging this as ready for review as I re-worked it a bit more.

The TL;DR is:

sort is kind of special, as it modifies the index but returns a Series of the same length of the original one, therefore in such specific case I am manually re-assigning the index
for all other methods, I added a boolean flag to DaskExpr called modifies_index and:
- that is not allowed in with_columns
- in select it should be allowed only if there are no other exprs or there is a reduction following ~~(I need to address both these cases actually)~~.

Yet before developing further, I would like some feedback on how likable this approach is and if we want to move forward with it 🙏🏼

FBruzzesi · 2024-08-27T08:47:37Z

narwhals/_dask/expr.py

+    def head(self: Self, n: int) -> Self:
+        return self._from_call(
+            lambda _input, _n: _input.head(_n, compute=False),
+            "head",
+            n,
+            returns_scalar=False,
+            modifies_index=True,
+        )
+
+    def tail(self: Self, n: int) -> Self:
+        return self._from_call(
+            lambda _input, _n: _input.tail(_n, compute=False),


So... head has a npartitions param which can be set to -1 and scan all partitions, while tail does not. This means that if we have multiple partitions, then this implementation of tail may not be what we expect

I haven't seen cases where users actually want more than 1 partition if they call head or tail tbh, Yes this is technically an issue, but not something I've encountered in the wild

MarcoGorelli · 2024-08-27T15:41:37Z

thanks @FBruzzesi !

to be honest I don't know about using such private methods, it makes me feel slightly uneasy - @phofl do you have time/interest in taking a look? specifically the de._collection.Series(de._expr.AssignIndex(result, _input.index)) part in narwhals/_dask/expr.py

I think that for sql engines (like duckdb, which hopefully we can get to eventually) operations like df.select(nw.col('a').sort(), nw.col('b')) would be problematic anyway, so I don't think it'd be an issue to leave them out of the Narwhals area of support

narwhals/_dask/dataframe.py

phofl · 2024-08-27T15:49:33Z

narwhals/_dask/expr.py

+            result = _input.to_frame(name=name).sort_values(
+                by=name, ascending=ascending, na_position=na_position
+            )[name]
+            return de._collection.Series(de._expr.AssignIndex(result, _input.index))


Yeah I can second @MarcoGorelli here, please don't do this. We are sorting a Series (i.e. a single column from the df), correct?

I would:

tmp = _input.reset_index().sort_values(...) result = tmp[_input.name] result.index = tmp["use the index name of _input"]

Or do you want to keep the Index of the input?

If yes, this is fundamentally a bad idea in Dask, it will shoot you in the foot all over the place. You have zero guarantees that the partitions keep their lengths when sorting (it is a lot more likely that they do not), so this is bound to fail in all kinds of places

Hey @phofl thanks for taking the time.

We are sorting a Series (i.e. a single column from the df), correct?

Yes indeed, but with the final goal to potentially add it as a new column to the original dataframe, and that's where the index misalignment gets into the game.

I will try the approach you are suggesting, which is not too far off from what is already implemented, and see if everything else falls into place 🙏🏼

Edit: it just ends up raising:

AssertionError: value needs to be aligned with the index

Traceback

--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[60], line 3 1 tmp = df_dd["a"].to_frame(name="a").sort_values("a") 2 result = tmp["a"] ----> 3 result.index = df_dd["a"].index 655 @index.setter 656 def index(self, value): --> 657 assert expr.are_co_aligned( 658 self.expr, value.expr 659 ), "value needs to be aligned with the index" 660 _expr = expr.AssignIndex(self, value) 661 self._expr = _expr AssertionError: value needs to be aligned with the index

FBruzzesi · 2024-08-30T15:41:26Z

Hi @phofl, apologies to call you in the mix once more.

I have a few questions in order to make this work and guarantee that we don't end up with a

fundamentally a bad idea in Dask, it will shoot you in the foot all over the place.

How can we test for when Dask will shoot us in the foot if we do something bad?
The latest approach TL;DR is that if a method changes the index, then it either has to be followed by a reduction or be a single selection. Examples:
- Reductions:
```
df.select(
    head_sum=pl.col("a").head().sum(),
    tail_mean=pl.col("a").tail().mean(),
)
```
  which would translate to something like dd.concat([df["a"].head().sum(), df["a"].tail().mean()])
- Single selection:
```
 df.select(
    head=pl.col("a").head(),
 )
```
In sight of the first question, what do you think about this approach? Is it a fundamentally bad idea?

FBruzzesi added 2 commits August 8, 2024 13:26

feat: dask index workaround

ce83134

test refactor

8677b8f

github-actions bot added the enhancement New feature or request label Aug 8, 2024

FBruzzesi changed the title ~~RFC feat: dask index workaround~~ RFC feat: dask index "hacking" Aug 8, 2024

FBruzzesi commented Aug 8, 2024

View reviewed changes

FBruzzesi mentioned this pull request Aug 9, 2024

feat: dask expr len method #762

Merged

10 tasks

FBruzzesi added 2 commits August 10, 2024 20:10

avoid sorting new series

1dae15f

merge main

99024ff

FBruzzesi commented Aug 10, 2024

View reviewed changes

FBruzzesi mentioned this pull request Aug 11, 2024

feat: add methods to Dask backend #637

Open

merge main

5da88e4

FBruzzesi mentioned this pull request Aug 17, 2024

[Enh]: Better benchmarking routine #805

Open

FBruzzesi added 2 commits August 27, 2024 08:56

merge main and refactor

749f177

add modifies_index flag

a80c94a

FBruzzesi commented Aug 27, 2024

View reviewed changes

narwhals/_dask/dataframe.py Outdated Show resolved Hide resolved

FBruzzesi marked this pull request as ready for review August 27, 2024 07:56

deal with reductions

d4ca60a

FBruzzesi commented Aug 27, 2024

View reviewed changes

phofl reviewed Aug 27, 2024

View reviewed changes

narwhals/_dask/dataframe.py Outdated Show resolved Hide resolved

phofl reviewed Aug 27, 2024

View reviewed changes

head with npartitions

941a675

FBruzzesi added 2 commits September 11, 2024 16:06

merge main

b876c3d

merge main

9bf2e39

FBruzzesi added 4 commits September 16, 2024 13:19

almost there with tests

994bb07

remaining methods

47be886

polars regression

44cb754

no cover anonymous expr in filter

6326446

FBruzzesi changed the title ~~RFC feat: dask index "hacking"~~ feat: "carefully" allow for dask Expr that modify index Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: "carefully" allow for dask Expr that modify index #743

feat: "carefully" allow for dask Expr that modify index #743

FBruzzesi commented Aug 8, 2024

FBruzzesi Aug 8, 2024

anopsy commented Aug 8, 2024

MarcoGorelli commented Aug 8, 2024

FBruzzesi Aug 10, 2024

MarcoGorelli commented Aug 14, 2024

FBruzzesi commented Aug 21, 2024 •

edited

Loading

FBruzzesi commented Aug 27, 2024 •

edited

Loading

FBruzzesi Aug 27, 2024 •

edited

Loading

phofl Aug 27, 2024

MarcoGorelli commented Aug 27, 2024

phofl Aug 27, 2024

phofl Aug 27, 2024

FBruzzesi Aug 28, 2024 •

edited

Loading

FBruzzesi commented Aug 30, 2024

		result = df.select(nw.col("a").drop_nulls(), nw.col("d").drop_nulls())
		expected = {"a": [1.0, 2.0], "d": [6, 6]}


		col_order = list(new_series.keys())

		left_most_series = next( # pragma: no cover

feat: "carefully" allow for dask Expr that modify index #743

Are you sure you want to change the base?

feat: "carefully" allow for dask Expr that modify index #743

Conversation

FBruzzesi commented Aug 8, 2024

What type of PR is this? (check all applicable)

Checklist

If you have comments or can explain your changes, please do so below.

FBruzzesi Aug 8, 2024

Choose a reason for hiding this comment

anopsy commented Aug 8, 2024

MarcoGorelli commented Aug 8, 2024

FBruzzesi Aug 10, 2024

Choose a reason for hiding this comment

MarcoGorelli commented Aug 14, 2024

FBruzzesi commented Aug 21, 2024 • edited Loading

FBruzzesi commented Aug 27, 2024 • edited Loading

FBruzzesi Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

phofl Aug 27, 2024

Choose a reason for hiding this comment

MarcoGorelli commented Aug 27, 2024

phofl Aug 27, 2024

Choose a reason for hiding this comment

phofl Aug 27, 2024

Choose a reason for hiding this comment

FBruzzesi Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

FBruzzesi commented Aug 30, 2024

FBruzzesi commented Aug 21, 2024 •

edited

Loading

FBruzzesi commented Aug 27, 2024 •

edited

Loading

FBruzzesi Aug 27, 2024 •

edited

Loading

FBruzzesi Aug 28, 2024 •

edited

Loading