Method to get underlying object #108

MarcoGorelli · 2023-03-14T16:02:04Z

Does there need to be a way to get back the underlying object?

I'm thinking about the pyjanitor clean_names example

Some user starts with a DataFrame (say, a pandas one) df, and calls clean_names(df). They would probably expect to get back what they started with, without caring that PyJanitor internally used the standard.

For example, PyJanitor could do

def clean_names(df, ...):
    df = dataframe_standard(df)  # or whatever the way to enable the standard will be
    df = ...  # clean names
    return df.dataframe  # return same type of DataFrame as was passed

So, should some .dataframe property be added, so that the library can "opt-out" of the standard once it has done all its work?

The text was updated successfully, but these errors were encountered:

rgommers · 2023-03-14T16:09:39Z

I think this is gh-85?

rgommers · 2023-03-14T16:11:19Z

Or if it's about regular API methods/functions, then those should already give back instances of the correct dataframe type, right? There's no separate object to convert to/from.

MarcoGorelli · 2023-03-14T17:25:04Z

My understanding was that the standard would be implemented as a separate class, with which would wrap the DataFrame, something like

class PandasStandardDataFrame:

    def __init__(self, df):
        _validate_df(df)  # check all columns are strings, no duplicate columns
        self.df = df

    def drop_column(self, label):
        return PandasDataFrame(self.df.drop(label, axis=1))

    def get_columns_by_name(self, names):
        if not isinstance(names, list) and not all(isinstance(name, str) for name in names):
            raise TypeError("Expected list of str")
        return PandasDataFrame(self.df.loc[:, names])

df  # pandas dataframe
df_standard = PandasStandardDataFrame(df)  # enable standard mode
df_standard = df_standard.drop_column("y")  # use some method from the standard
df_standard = df_standard.get_columns_by_name(["x_0", "x_1"])  # keep using methods from the standard
df = df_standard.df  # go back to having a pandas dataframe

If drop_column were to return a DataFrame of the correct type, then are you suggesting that the workflow be

df  # pandas dataframe
df = PandasStandardDataFrame(df).drop_column("y")  # use some method from the standard
df = PandasStandardDataFrame(df).get_columns_by_name(["x_0", "x_1"])  # use another method from the standard

?

rgommers · 2023-03-14T21:21:51Z

My understanding was that the standard would be implemented as a separate class

Ah fair enough, you are right here. I was applying my array intuition too much - we really need the separate dataframe class because we design with methods not functions.

So the question is how to spell what you need here. You suggested df.dataframe or df_standard.df, so attribute access. I'm thinking that this isn't something that one expects on the standard dataframe (at least it won't be part of the standard itself) and this would make sense as a regular constructor. So for pandas, pd.DataFrame(df_standard)?

jbrockmendel · 2023-03-14T21:23:56Z

Is there a viable way to do this going through the interchange protocol?

rgommers · 2023-03-14T21:30:32Z

That's pretty expensive though, having to iterate through memory. This is within a single library so I'd just use a private ._df_pandasbase attribute and then the constructor like:

class DataFrame():  # pd.DataFrame
    def __init__(...):
        if hasattr(df, '_df_pandasbase'):
            return df._df_pandasbase

jbrockmendel · 2023-03-14T21:38:57Z

id like to find an alternative that fits with the "assume pandas changes nothing" mantra

jorisvandenbossche · 2023-03-14T21:42:25Z

So for pandas, pd.DataFrame(df_standard)?

I think accessing it from the object might be easier (with an attribute or method), because otherwise you need to know which namespace and function to use? For pandas it could be pd.DataFrame, but what is it for some standard dataframe from a library you don't know?

jorisvandenbossche · 2023-03-14T21:43:31Z

Unless it would be a method in the "standard namespace" (if we will have something like that)

rgommers · 2023-03-14T22:07:07Z

I think accessing it from the object might be easier (with an attribute or method),

That bakes in the assumption though that the "native dataframe" exists, and that there's a 1:1 relationship between any implementer of the standard and some other underlying dataframe object within the same library. I'm not sure that that assumption will hold - say you write a new library that only implements the standard, natively, plus the interchange protocol to transform itself into any other library's df object.

Or if you'd have a .df attribute in such cases, would you point it at self?

jorisvandenbossche · 2023-03-15T09:10:09Z

Yes, I think if the standard dataframe is the "native" object itself, it can just return itself, that doesn't seem like a problem (similarly like the interchange object also returns itself in __dataframe__)

But maybe we should also first think about the question: as a user of the standard API, how do you get a "standard" object given a random dataframe?

MarcoGorelli · 2023-03-15T11:26:14Z

Let's keep the question of opting into the standard for a separate issue

Or if you'd have a .df attribute in such cases, would you point it at self?

Sounds fine

at least it won't be part of the standard itself

In order for this to be usable, I don't see how it can not be part of the standard - otherwise how can a library implement a function like

def my_fancy_function(df: AnyDataFrame):
    standard_df = dataframe_standard(df)  # we still need to agree on how to opt-in to the standard

    standard_df = standard_df.get_columns_by_name(...)  # bunch of operations which use the standard

    return standard_df.df

and be guaranteed that it'll work for any DataFrame?

jorisvandenbossche · 2023-03-15T11:33:23Z

Let's keep the question of opting into the standard for a separate issue

We can certainly discuss it separately, but I think the exact answer for this issue could depend on it. For example, if we define a namespace, we could also have a function in that namespace instead of a method or attribute.
(anyway, I also don't think the exact API for how to get the object is very important, the relevant discussion in this issue is probably about the concept that this can be ~~useful~~ essential to do)

MarcoGorelli · 2023-03-15T11:35:51Z

we could also have a function in that namespace instead of a method or attribute.

sure, but it would still need to be the same for all DataFrame libraries taking part, right? Otherwise, in the example in #108 (comment) , how does one write DataFrame-agnostic code?

I'd have thought this was essential, not just useful

jorisvandenbossche · 2023-03-15T11:43:28Z

I'd have thought this was essential, not just useful

Yes, to be clear I fully agree with that. Updated my comment above to not use the mere "useful" ;)

rgommers · 2023-03-15T12:08:40Z

Okay, so there is agreement we do need this in some form. Do we think this is always an O(1) operation? If so, an attribute seems reasonable. If it can trigger computation, it should be either a method, or a way to retrieve a constructor function as in gh-85 (that's more complex, which is probably justified for the interchange protocol but not here).

rgommers · 2023-03-16T17:19:41Z

Discussed today: folks agreed that this should exist and be cheap. Hence: an attribute .dataframe.

rgommers added the API design label Mar 14, 2023

This was referenced Mar 20, 2023

add dataframe property #110

Merged

How to enable the Standard? #115

Closed

rgommers closed this as completed in #110 Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method to get underlying object #108

Method to get underlying object #108

MarcoGorelli commented Mar 14, 2023

rgommers commented Mar 14, 2023

rgommers commented Mar 14, 2023

MarcoGorelli commented Mar 14, 2023 •

edited

Loading

rgommers commented Mar 14, 2023

jbrockmendel commented Mar 14, 2023

rgommers commented Mar 14, 2023

jbrockmendel commented Mar 14, 2023

jorisvandenbossche commented Mar 14, 2023

jorisvandenbossche commented Mar 14, 2023

rgommers commented Mar 14, 2023

jorisvandenbossche commented Mar 15, 2023

MarcoGorelli commented Mar 15, 2023

jorisvandenbossche commented Mar 15, 2023 •

edited

Loading

MarcoGorelli commented Mar 15, 2023

jorisvandenbossche commented Mar 15, 2023

rgommers commented Mar 15, 2023

rgommers commented Mar 16, 2023

Method to get underlying object #108

Method to get underlying object #108

Comments

MarcoGorelli commented Mar 14, 2023

rgommers commented Mar 14, 2023

rgommers commented Mar 14, 2023

MarcoGorelli commented Mar 14, 2023 • edited Loading

rgommers commented Mar 14, 2023

jbrockmendel commented Mar 14, 2023

rgommers commented Mar 14, 2023

jbrockmendel commented Mar 14, 2023

jorisvandenbossche commented Mar 14, 2023

jorisvandenbossche commented Mar 14, 2023

rgommers commented Mar 14, 2023

jorisvandenbossche commented Mar 15, 2023

MarcoGorelli commented Mar 15, 2023

jorisvandenbossche commented Mar 15, 2023 • edited Loading

MarcoGorelli commented Mar 15, 2023

jorisvandenbossche commented Mar 15, 2023

rgommers commented Mar 15, 2023

rgommers commented Mar 16, 2023

MarcoGorelli commented Mar 14, 2023 •

edited

Loading

jorisvandenbossche commented Mar 15, 2023 •

edited

Loading