Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to get underlying object #108

Closed
MarcoGorelli opened this issue Mar 14, 2023 · 17 comments · Fixed by #110
Closed

Method to get underlying object #108

MarcoGorelli opened this issue Mar 14, 2023 · 17 comments · Fixed by #110

Comments

@MarcoGorelli
Copy link
Contributor

Does there need to be a way to get back the underlying object?

I'm thinking about the pyjanitor clean_names example

Some user starts with a DataFrame (say, a pandas one) df, and calls clean_names(df). They would probably expect to get back what they started with, without caring that PyJanitor internally used the standard.

For example, PyJanitor could do

def clean_names(df, ...):
    df = dataframe_standard(df)  # or whatever the way to enable the standard will be
    df = ...  # clean names
    return df.dataframe  # return same type of DataFrame as was passed

So, should some .dataframe property be added, so that the library can "opt-out" of the standard once it has done all its work?

@rgommers
Copy link
Member

I think this is gh-85?

@rgommers
Copy link
Member

Or if it's about regular API methods/functions, then those should already give back instances of the correct dataframe type, right? There's no separate object to convert to/from.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Mar 14, 2023

My understanding was that the standard would be implemented as a separate class, with which would wrap the DataFrame, something like

class PandasStandardDataFrame:

    def __init__(self, df):
        _validate_df(df)  # check all columns are strings, no duplicate columns
        self.df = df

    def drop_column(self, label):
        return PandasDataFrame(self.df.drop(label, axis=1))

    def get_columns_by_name(self, names):
        if not isinstance(names, list) and not all(isinstance(name, str) for name in names):
            raise TypeError("Expected list of str")
        return PandasDataFrame(self.df.loc[:, names])
df  # pandas dataframe
df_standard = PandasStandardDataFrame(df)  # enable standard mode
df_standard = df_standard.drop_column("y")  # use some method from the standard
df_standard = df_standard.get_columns_by_name(["x_0", "x_1"])  # keep using methods from the standard
df = df_standard.df  # go back to having a pandas dataframe

If drop_column were to return a DataFrame of the correct type, then are you suggesting that the workflow be

df  # pandas dataframe
df = PandasStandardDataFrame(df).drop_column("y")  # use some method from the standard
df = PandasStandardDataFrame(df).get_columns_by_name(["x_0", "x_1"])  # use another method from the standard

?

@rgommers
Copy link
Member

My understanding was that the standard would be implemented as a separate class

Ah fair enough, you are right here. I was applying my array intuition too much - we really need the separate dataframe class because we design with methods not functions.

So the question is how to spell what you need here. You suggested df.dataframe or df_standard.df, so attribute access. I'm thinking that this isn't something that one expects on the standard dataframe (at least it won't be part of the standard itself) and this would make sense as a regular constructor. So for pandas, pd.DataFrame(df_standard)?

@jbrockmendel
Copy link
Contributor

Is there a viable way to do this going through the interchange protocol?

@rgommers
Copy link
Member

That's pretty expensive though, having to iterate through memory. This is within a single library so I'd just use a private ._df_pandasbase attribute and then the constructor like:

class DataFrame():  # pd.DataFrame
    def __init__(...):
        if hasattr(df, '_df_pandasbase'):
            return df._df_pandasbase

@jbrockmendel
Copy link
Contributor

id like to find an alternative that fits with the "assume pandas changes nothing" mantra

@jorisvandenbossche
Copy link
Member

So for pandas, pd.DataFrame(df_standard)?

I think accessing it from the object might be easier (with an attribute or method), because otherwise you need to know which namespace and function to use? For pandas it could be pd.DataFrame, but what is it for some standard dataframe from a library you don't know?

@jorisvandenbossche
Copy link
Member

Unless it would be a method in the "standard namespace" (if we will have something like that)

@rgommers
Copy link
Member

I think accessing it from the object might be easier (with an attribute or method),

That bakes in the assumption though that the "native dataframe" exists, and that there's a 1:1 relationship between any implementer of the standard and some other underlying dataframe object within the same library. I'm not sure that that assumption will hold - say you write a new library that only implements the standard, natively, plus the interchange protocol to transform itself into any other library's df object.

Or if you'd have a .df attribute in such cases, would you point it at self?

@jorisvandenbossche
Copy link
Member

Yes, I think if the standard dataframe is the "native" object itself, it can just return itself, that doesn't seem like a problem (similarly like the interchange object also returns itself in __dataframe__)

But maybe we should also first think about the question: as a user of the standard API, how do you get a "standard" object given a random dataframe?

@MarcoGorelli
Copy link
Contributor Author

Let's keep the question of opting into the standard for a separate issue

Or if you'd have a .df attribute in such cases, would you point it at self?

Sounds fine

at least it won't be part of the standard itself

In order for this to be usable, I don't see how it can not be part of the standard - otherwise how can a library implement a function like

def my_fancy_function(df: AnyDataFrame):
    standard_df = dataframe_standard(df)  # we still need to agree on how to opt-in to the standard

    standard_df = standard_df.get_columns_by_name(...)  # bunch of operations which use the standard

    return standard_df.df

and be guaranteed that it'll work for any DataFrame?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 15, 2023

Let's keep the question of opting into the standard for a separate issue

We can certainly discuss it separately, but I think the exact answer for this issue could depend on it. For example, if we define a namespace, we could also have a function in that namespace instead of a method or attribute.
(anyway, I also don't think the exact API for how to get the object is very important, the relevant discussion in this issue is probably about the concept that this can be useful essential to do)

@MarcoGorelli
Copy link
Contributor Author

we could also have a function in that namespace instead of a method or attribute.

sure, but it would still need to be the same for all DataFrame libraries taking part, right? Otherwise, in the example in #108 (comment) , how does one write DataFrame-agnostic code?

I'd have thought this was essential, not just useful

@jorisvandenbossche
Copy link
Member

I'd have thought this was essential, not just useful

Yes, to be clear I fully agree with that. Updated my comment above to not use the mere "useful" ;)

@rgommers
Copy link
Member

Okay, so there is agreement we do need this in some form. Do we think this is always an O(1) operation? If so, an attribute seems reasonable. If it can trigger computation, it should be either a method, or a way to retrieve a constructor function as in gh-85 (that's more complex, which is probably justified for the interchange protocol but not here).

@rgommers
Copy link
Member

Discussed today: folks agreed that this should exist and be cheap. Hence: an attribute .dataframe.

This was referenced Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants