-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What should the metrics API look like? #12
Comments
Thanks for getting this started @hildeweerts ! I have some concerns about the naming but I'll focus on the functionality since we can always adjust the naming. I'll walk through an example based on your file to figure out whether I understand it right. Let's say I care about accuracy. Then I can instantiate score_groups(y_true, y_pred, sensitive_features, metric='accuracy') which would mean we need to map those metric strings to actual metric functions. Maybe it's easier to pass the function itself? score_groups(y_true, y_pred, sensitive_features, metric=sklearn.metrics.accuracy_score) Regardless of that detail, I'd get a
This reminds me very much of our existing group_summary(metric_function, y_true, y_pred, *, sensitive_features, indexed_params=None, **metric_params) and there are individual ones per metric, e.g. For each individual metric there's the false_positive_rate_grouped(y_true, y_pred, sensitive_features, disparity="difference", aggregation="max") What I'm wondering is how that's different from score_groups(y_true, y_pred, sensitive_features, metric='false_positive_rate').disparity('difference') I remember that the complex object (here called |
Perhaps both should be possible? I don't think we'll be adding a lot of "base" metrics (we should probably think of a good name for those) on a regular basis, so we can just use scikit-learn's predefined values in addition to others that are not directly accessible from scikit-learn (e.g. false positive rate). For first time users, I can imagine having strings to choose from is a bit less intimidating than having to find (or even define) the appropriate (scikit-learn) function yourself? But this is just speculation from my side :)
This is a good question - I think my naming is not super clear. I was actually thinking that there should be one type of function for getting the "disparity" between all groups (i.e. ratio or difference); i.e.
From what you describe I think it's very similar indeed! I agree that writing something like scores.difference() is more natural (pandas style), especially because this pattern allows you to easily access all possible methods in most IDE's. If you use an attribute rather than a method, does that mean you have to precompute all of them? In that case, I'd be in favor of a method.
Yes, this redundancy is intended! Mostly because you can use something like |
I would also be in favour of sticking to just passing around functions, rather than allowing for 'string or callable' all over the place. This may reflect the fact that I'm far more used to typed languages, and I'm seeing a lot of tests being required to make sure the string->fn mappings are set up correctly. I agree that Can you clarify what the result of calling Supporting different aggregation methods (beyond simple max-min) is something I'm sure would be useful. I think that the aggregation method might look something like: class Scores:
...
def aggregate(self, f):
return f(self) We could provide things such as A note on attributes in Python... they're actually methods, and we used to use them in the |
From a developer perspective, I agree. But from a user perspective, I do think strings can be much easier to use. Particularly if you're familiar with scikit-learn.
It depends a bit on what we choose as an underlying data structure of the scores object. But it would allow (particularly novice) users to easily convert the current object to a data structure they are comfortable working with. E.g. if a user has called scores.difference() they can convert the result to a pandas dataframe and visualize it using e.g. matploblib. To clarify further: in my head 'disparity' and 'aggregation' are two different steps; in the disparity step you define what a disparity between two groups is (i.e. either 'ratio' or 'difference') and the other how you aggregate those disparities between more than two groups (e.g. the min-max difference, average difference, max difference with one particular group of interest etc.). Would it be clearer if we would have something like I didn't know that about attributes - interesting! |
You're probably right.... it just feels dirty and uncomfortable to me.
Could you propose a specific implementation? Right now, the returned object has an
Perhaps something like |
Ah yes, I forgot about the overall/by_group pattern. Off the top of my head, I can see two possibilities. Either perform the call on the groups attribute only, i.e. you'd have something like The second option would be to use a multi-index dataframe. I think this would make the most sense if we allow for multiple metrics in the same summary. Which, if I think about it, would actually be quite useful. Is passing multiple metrics possible in the current API? The resulting dataframe could look something like this:
I like this idea! 👍 |
I have seen One thing which has just occurred to me: are disparities always symmetric (i.e. would you always define the function such that |
@riedgar-ms disparities are not always symmetric. Think of the FPR ratio of two groups, for example. @adrinjalali and @MiroDudik should probably chime in here as well. |
Agreed. For that particular case, one could adjust the specification of the ratio, much like we specify differences to be absolute values. However, I'm sure that there will be other cases not amenable to something like that (especially if someone tries defining disparities on confusion matrices). Per a separate discussion which @MiroDudik and I had about a presentation I was putting together, rather than calling it The We could include these in the matrix of |
I'm a bit confused where the If it is the former, I agree something like I think we should probably raise an error if 'overall' is used as a subgroup name. I think most issues can be dealt with by proper indexing, but there's always the issue of comparing the actual overall to subgroup 'overall'; i.e. in a table with all comparisons you'll have (overall, overall) twice and it's impossible to tell which 'direction' the disparity is going without looking up the original scores. |
Let me comment on a couple of issues: Base metricsI don't mind allowing both strings as well as callable objects. This is a common pattern around metrics in sklearn, where you can pass a string or a scorer object to crossvalidation routines. In fact our metrics engine already has some dictionaries of that kind, so it would be just a matter of adding another dict. AggregatorsI am worried about over-engineering things. I don't think we should try to come up with generic aggregators (yet). I was thinking about using the following pattern for the # the current behavior
scores.difference()
# some alternatives could be enabled with keyword arguments
scores.difference(relative_to='overall', aggregate='max_flip')
scores.difference(relative_to='fixed', fixed_group='male', aggregate='max')
scores.difference(relative_to='min', aggregate='max_flip') # default
# similarly for ratio:
scores.ratio(relative_to='max', aggregate='min_flip') # default
scores.ratio(relative_to='fixed', fixed_group='male', aggregate='min')
scores.ratio(relative_to='overall', aggregate='min_flip') The It's not clear to me that we need to have a mechanism for custom aggregators. My sense is that for any custom behavior, it should be enough for the users to one of the following:
ExportingI'm not yet sure here. There are a few other patterns that might be handy, e.g., by_group, overall = scores.to_frames()
difference = (by_group-overall).abs().max()
difference_to_male = (by_group-by_group['male']).min() This is the one I was thinking about to avoid custom aggregators. |
I agree. It's probably best to keep things as simple as possible and allow for flexibility by easy exports to pandas. I like your suggested pattern to avoid generic aggregators. My hard-wired instinct to make things complicated says we might want to add I'm wondering what would be a good way to translate this to the fairness metrics, e.g. |
Fair enough on avoiding the custom aggregators - especially if the DataFrame export can reasonably allow users to define things to their hearts' content. My main concern is the complexity of the 'dispatch' logic at the top of the I'm not a huge fan of the |
hi yall! I just wanted to say that I love the way @hildeweerts framed the rationale for these decisions in the use cases of:
and
I didn't notice that earlier since it's in the attached It'd be amazing to see that wherever this discussion goes that it ultimately lands on "because of user need A, we are making this change, and this is how the design we chose helps meet that need" in the same way 👍 |
To expand on the second use case, I was mostly thinking about the 'optimization' part. I imagine a scenario where you first do some exploratory analysis on your initial data and build a simple model. If you find an issue you'd like to keep track of, you can use fairness metrics like |
Circling back to the original API: How should we handle extra arguments to the metric function (especially when We can add a One further suggestion for the |
I've been thinking a bit more about this and I think the pattern suggested by @MiroDudik is a very good candidate. As an alternative to the I do think it would make sense to have a Regarding arguments that need to be passed to a metric function, I'm not sure what the best approach would be. You could define two dicts, one for 'grouped_kwargs' and one for 'regular_kwargs', but I don't really like that solution either. If I recall correctly, @adrinjalali mentioned that the functionality to pass arguments such as |
What would be returned in that case? The matrix of differences (or ratios) between all pairs of the subgroups (plus |
If we only allow for |
Re. aggregate=NoneI'm worried about the function not always returning a scalar. One alternative is something like: scores.differences(relative_to='fixed', fixed_group='male') # returns pandas.Series
scores.differences(relative_to='fixed', fixed_group='male').max() # aggregate as convenient
scores.differences(relative_to='overall').abs().max() I think that this would be actually a really good pattern to support "explore different metrics" scenario, but less well suited for mitigation / hyperparameter tuning scenarios, so those would still need to be of the form: scores.difference(relative_to='fixed', fixed_group='male', aggregate='max')
scores.difference(relative_to='overall', absolute=True, aggregate='max') Is that better or worse than Re. exporting@hildeweerts (and others), do you have a preference between the following options: by_group, overall = scores.to_frame() # overall is returned as a scalar
by_group = scores.to_frame() # overall is only accessed as scores.overall
combined = score.to_frame(overall_key='overall') # overall is included in the frame with a flat index
combined = score.to_frame() # overall is included in the frame with MultiIndex |
Re. aggregate=NoneThe Wouldn't the scores object only be used 'under the hood' for the fairness metrics in the mitigation/tuning scenario? I.e. you'd call something like Re. exportingI like Support multiple metricsDo we want to support multiple metrics in a single scores object? In all industry projects that I've worked on, there's never been a case where only a single metric was relevant. I can imagine it'd be a bit annoying if you'd have to manually merge the scores to be able to easily compare them in a table. |
My concern with the They might be mutating the internal state of the Alternatively, perhaps they return a different object type each time. However, that's really taking us off in a different direction. The Have more thoughts on the other issues; need to go right now. |
For 'multiple metrics in a single object' do you mean something like: metrics = MetricsObject(y_true, y_pred, sensitive_feature)
accuracy_scores = metrics.accuracy(sample_weights=sample_weights, extra_arg=something)
tpr = metrics.true_positive_rate() Basically, the first call to a separate metric would calculate the confusion matrix, and it would be cached to generate the others (I believe AIF360 does this)? One immediate problem: how to handle the extra arguments (such as |
Re. scores.differences().abs() patternIf Re. Support metrics metricsFor multiple metrics I actually meant a pattern like this: scores = score_groups(y_true, y_pred, sensitive_features, metrics=['accuracy', 'precision', 'recall']) So you'd still have to pass the base metrics from the start, but the resulting scores are conveniently stored in the same Perhaps we can get some inspiration from scikit-learn for handling arguments of metric functions. For 'regular' arguments that need to be passed to a metric, there is |
For the For the multiple metrics, once you have called: scores = score_groups(y_true, y_pred, sensitive_features, metrics=['accuracy', 'precision', 'recall']) what does calling What about |
If we are going to support multiple metrics, I think I would be in favor of always returning a dataframe for the sake of consistency. This will be a bit weird in some cases, e.g. when you access 'overall' if there's only one metric, but it is probably better than returning different types all over the place. |
Emerging consensusLet me try to summarize what I think we are currently converging on! (But please correct me if I'm wrong here.) The actual names of classes / optional arguments are still to be decided, but I'll try to make the API format clear. Re. grouped metric object constructor and propertiesFor single metric, we consider something like the following: class GroupedMetric():
"""
Object with the overall metric value and metric values in each group.
"""
def __init__(self, base_metric, *, indexed_params=None, **kwargs):
"""
Parameters
----------
base_metric : str or callable
the metric being grouped; if callable this needs to be sklearn style metric
indexed_params : str[] or None
names of arguments that take form of arrays and need to be split
when grouping (e.g., sample_weight)
**kwargs :
additional parameters to be passed to base_metric (e.g., beta=0.5 or pos_label=2
for fbeta_score)
"""
def eval(self, y_true, y_pred, *, sensitive_features, **kwargs):
"""
Calculate the metric. After that, the metric values can be accessed in
the fields `by_group` (dict) and `overall`.
Returns
-------
result : GroupedMetric
eval method returns `self` to allow chaining
"""
def difference(self, relative_to='min', group=None, abs=True, agg='max'):
"""
Difference style disparity (scalar).
Parameters
----------
relative_to : 'min', 'overall', 'max', 'rest', 'group', or str/int/float
if none of the string constants are used then the value is interpreted as
the name/id of the group relative to which the difference is measured, which
is equivalent to choosing 'group' and specifying the id as `group` parameter
group: str/int/float
if relative_to='group' then this parameter provides the relevant group id
abs: bool
if True, the absolute value of the differences is taken
agg : 'max', 'min'
how group-level differences are aggregated
"""
def differences(self, relative_to='min', group=None):
"""
Differences of individual groups rel. to some baseline.
Parameters
----------
relative_to, group :
the same as in the method `difference`
Returns
-------
result : pandas.Series
"""
def ratio(self, relative_to='max', group=None, numerator_choice='smaller', agg='max'):
"""
Ratio style disparity (scalar). Parameters used to refine the type.
Parameters
----------
relative_to, group, agg :
the same as in the method `difference`
numerator_choice : 'smaller' or 'greater' or None
if 'smaller' or 'greater' is selected, the ratio is taken with the specified numerator
"""
def ratios(self, relative_to='max', group=None):
"""
Ratios of individual groups rel. to some baseline.
Parameters
----------
relative_to, group :
the same as in the method `difference`
Returns
-------
result : pandas.Series
"""
def max(self):
"""
The maximum group-level metric.
"""
def min(self):
"""
The minimum group-level metric.
"""
def to_frame(self):
"""
Convert to pd.DataFrame (grouped metrics) and pd.Series (overall metric)
Returns
-------
by_group, overall : pdDataFrame, pd.Series
""" Remaining questions
Discussion in progressRe. multiple metricsI see a lot of utility, but I don't think we should enable multiple metrics for now (but definitely in future). There are two complications that need solving:
Re. predefined fairness metrics[I'll comment on this one later on.] |
Re. grouped metric object constructor and propertiesThank you for this! I think it summarizes the conversation so far very well.
Re. multiple metricsIs there a specific reason you want to handle 'static' metric arguments (like I have to say that error bars would be on my 'nice-to-have' list rather than 'must-have' list. In fact, if we allow for multiple metrics, you could define a callable that computes e.g. the lower error bar and pass it as a 'metric'. But perhaps I just don't understand exactly what you mean? What part of cross_validate are you referring to exactly? |
That is a pretty big move away from what we've had to date - basically a As for the proposal itself.....
|
Could you expand a bit more on this? I'm not sure if I understand what you mean but it seems an important comment.
In the current proposal the I agree with your concerns regarding |
To the first point, right now our group metrics are ultimately generated by `make_metric_group_summary(). This takes in a metric function in the SciKit-Learn pattern, and returns a callable object (so it looks like a function to the user). So this is what lets us do: >>> from fairlearn.metrics import recall_score_group_summary
>>> type(recall_score_group_summary)
<class 'fairlearn.metrics._metrics_engine._MetricGroupSummaryCallable'>
>>> y_t = [0,1,1,0,1,1,0]
>>> y_p = [0,1,0,1,1,0,0]
>>> s_f = [4,5,4,4,4,5,5]
>>> recall_score_group_summary(y_t, y_p, sensitive_features=s_f)
{'overall': 0.5, 'by_group': {4: 0.5, 5: 0.5}}
>>> recall_score_group_summary(y_t, y_p, sensitive_features=s_f, pos_label=0)
{'overall': 0.6666666666666666, 'by_group': {4: 0.5, 5: 1.0}}
>>> recall_score_group_summary(y_t, y_p, sensitive_features=s_f, pos_label=1)
{'overall': 0.5, 'by_group': {4: 0.5, 5: 0.5}} We generate the functions like This is then related to the I suppose that we could use On |
I know I'm very late to the game and I haven't caught up on everything, but I wanted to point out some of the issues with the sklearn metric function interface that you might face. For example, the meaning of You might decide that it's ok to keep the established sklearn-like interface and live with the shortcomings, I just wanted to mention them briefly:
At least 2 and 3 are relatively easily fixed by passing around all known classes, as computed on the full dataset. You might also want to do that for the protected attribute, as I assume you'll otherwise run into similar issues there. In some cases, sklearn will currently produce wrong / misleading results bases on the issues above (for example when using a non-stratified split for classification and using macro-averaging. Though I'm not even sure what the correct result should be in that case). |
Actually, I stand corrected, what |
@amueller I can see that we need to warn users that the exact meaning of the extra arguments they pass in (such as |
@riedgar-ms I guess 4) is not as much of an issue, but I think the other items might be. |
I was speaking to a colleague (who I can't @ mention, for some reason), and he raised the point in (2) - and also wondered what would happen if some of the groups lacked all the classes. He also raised a point about whether we want to handle multiclass explicitly. |
I'll chime in and hopefully make myself @-able. While I don't think I have anything truly novel to add to this conversation, I would like to attempt to lay out some of the design questions raised more explicitly. A. Wrapping sklearn vs. injecting sklearn:Fairlearn has already taken the stance that it will be unopinionated about the implementations (or even interfaces) of metrics that it wraps. The points @amueller raises all get at this question of whether Fairlearn should try to handle the validation and edge cases that sklearn fails to handle. There will be a tradeoff here between customer experience, Fairlearn codebase maintenance, and compatibility with user-defined metrics. B. Results as an internal representation vs. open source representation:This is about what is returned by Fairlearn when fairness metrics are computed, the intermediate representation of metrics disaggregated by groups. The tension here seems to be between some Fairlearn-specific representation of results/scores and an open source representation, specifically some combination of Pandas dataframes and Python dictionaries. While an internal representation allows for method chaining of Fairlearn-specific aggregators, an existing representation gives the user more freedom to aggregate and inspect the results with their own tools. I should point back to the concerns of @hildeweerts and others that we want to make it easy for users to hook into this representation if they choose to. C. Aggregation in one step vs. two steps:Another question is about how aggregation should be implemented. This is somewhat dependent on design question B. There will be Fairlearn-specific aggregators that Pandas doesn't support out of the box. In one design, aggregators are built into the metric computation itself so that only the aggregation is returned from metric computation and there is only one fairness metric interface. On the other hand, to separate the concerns of metric computation and aggregation and to simplify the computation interface, there could be separate aggregators that have an implicit/explicit contract through the intermediate disaggregated representation. The proposal from @MiroDudik around method chaining vs an D. Metric functions vs metric names: (convenience)Fairlearn can either remain unopinionated about metric implementations or it can accept metric names as a convenience to the majority of users who would default to sklearn metrics anyway. E. Single metrics vs. multiple metrics: (convenience)Users will get results for more than one metric at the same time on the same data. That loop over metrics can either be left to the user, simplifying the Fairlearn interface, or provided as a convenience function that accepts a list/set of metric functions/names. The results format for multiple metrics is another matter. I won't pretend that this is a full summary of the nuanced conversation above, but I did want to start to organize these different threads. Is this list missing any important decisions? |
Maybe one question I missed is how multiple sensitive features should be handled. If each sensitive feature has a different number of groups, that might not be best represented as a pandas dataframe, but potentially a dictionary of feature name -> dataframe? |
I think that some of these discussions are mixing up the various levels of the API, and that's adding to the confusion. Right now, there are basically three layers to the metrics API:
I think (please correct me if I'm wrong), that our main concern is how to handle the aggregators, which in turn depends on the exact type returned by The group_summary() functionI would like to keep this function more-or-less as-is. The main question is the return type. I would like to propose that, rather than a For the most part, However, as @amueller and @gregorybchris have both pointed out, things can go wrong if the subgroups identified by the sensitive features are missing classes in their The Convenience WrappersI think the main issue here is whether these should exist or not. We could allow users to call The Convenience AggregatorsIf One downside to this is that to have |
Sorry, I'm still catching up, but do you allow metrics that require continuous scores, such as It look like you're mostly building on the function interface right now, and I was wondering if this issue came up at all. |
@amueller all of our current fairness metrics act on (y_true, y_pred, sensitive_features). They do not take an estimator / predictor itself as an argument. So it's right now up to the user to call the relevant function. |
@MiroDudik this will not work when used inside cross-validation, unfortunately. Though passing the sensitive feature during cross-validation might be a bigger issue. |
The idea was to use them in combination with |
That makes sense. |
One small comment.... the meaning of |
Let me try to regurgitate items again, indicate where we stand on each item, and add a couple of new items. I'll be presenting this as a list of questions. Hopefully, it will make it easier to refer to, track, and add to it as needed. Please respond below! Q1: Should we represent metrics as functions or objects?Our status quo was to have only functions with the signature @hildeweerts pointed out that when tuning hyperparameters or monitoring fairness, it is useful to have functions returning scalars, but when working with various variants of disaggregated metrics, it is much more natural to have an object that feels more like a So the consensus seems to be that we should have both objects and functions. Q2: What should the metric objects look like?The purpose of metric objects is to evaluate and store the results of disaggregated metrics. Inspired by
Q3: Should the metric objects subclass pandas.Series?[Implicitly brought up by @riedgar-ms; this is an updated version of the comment. Another update: Richard does not think we need this, so let's resolve the answer as No.] This seems interesting. I'm worried about the implementation complexity, maintenance overhead (deep dependence on another package), the semantics of mutable Series operations like updates/additions/deletions of entries (how would this impact I'm not sure what the exact proposal looks like, but one variant that I can imagine based on @riedgar-ms's description above is as follows. Consider calling grouped = GroupedMetric(metric_function).eval(y_true, y_pred, sensitive_features=sf) This call would return an immutable pandas.Series or pandas.DataFrame, extended with the following methods:
Ignoring implementation/maintenance complexity, I think that key is to make things immutable. Otherwise, I'm worried about semantics of Q4: Do we need group_summary()?(Implicitly brought up by @riedgar-ms.) I don't think so. I think it is now effectively replaced by the constructor of the metric object. This is because we could define def group_summary(metric_function, y_true, y_pred, *, sensitive_features, **kwargs):
return GroupedMetric(metric_function).eval(y_true, y_pred, sensitive_features=sensitive_features, **kwargs) Q5: What should the predefined metric functions look like?I think that we should keep the same signature of all the predefined functions as the status quo: <metric>_difference(y_true, y_pred, *, sensitive_features, **kwargs)
<metric>_ratio(y_true, y_pred, *, sensitive_features, **kwargs)
<metric>_group_min(y_true, y_pred, *, sensitive_features, **kwargs)
<metric>_group_max(y_true, y_pred, *, sensitive_features, **kwargs) But they would all be defined via the grouped metric object. For example, def accuracy_score_difference(y_true, y_pred, *, sensitive_features,
relative_to='min', group=None, abs=True, agg='max', **kwargs):
return GroupedMetric(skm.accuracy_score).eval(
y_true, y_pred, sensitive_features=sensitive_features, **kwargs).difference(
relative_to=relative_to, group=group, abs=abs, agg=agg) Since I'm proposing to drop Q6: Should we support strings in GroupedMetric?I think we should allow strings for all the base metrics which appear in the predefined metrics and possibly some additional common cases. Q7: What do we do about quirks and limitations of sklearn metrics?As @gregorybchris said, we've decided to make all the arguments pass-through (modulo disaggregation according to
Q8: What should be the invocation pattern for evaluating multiple metrics?As @gregorybchris said, evaluating multiple metrics is just a matter of convenience, but it's probably a common use case. I'd like to figure the calling convention not the return type. Calling Convention 1grouped = GroupedMetric([metric1, metric2]).eval(
y_true, y_pred, sensitive_features=sf, args_dict={metric1: kwargs1, metric2: kwargs2})
# or
grouped = GroupedMetric([metric1, metric2], args_dict={metric1: kwargs1, metric2: kwargs2}).eval(
y_true, y_pred, sensitive_features=sf) Calling Convention 2This was alluded to by @riedgar-ms above and is also similar to AIF360 pattern: grouped_pred = GroupedPredictions(y_true, y_pred, sensitive_features=sf)
grouped1 = grouped_pred.eval(metric1, **kwargs1) # a GroupedMetric object for single metric
grouped = grouped_pred.eval([metric1, metric2],
args_dict={metric1: kwargs1, metric2: kwargs2}) # a GroupedMetric object for two metrics Q9: Do we want to include any of the following extensions in this proposal?There are several extensions that are worth considering that impact the format of an evaluated
I think that all of these can be handled by hierarchical indices (i.e., |
OK, that has sort of cleared some things up. Definitely ignore my comment about subtyping If we're going to make metrics into objects, I think that question 8 is actually the big one. Looking at the single metric case, there are two basic options: Option 1: GroupedMetric holds functionsIn this scenario, a user would write something like: accuracy = GroupedMetric(sklearn.metrics.accuracy_score, index_params=['sample_weights'], pos_label="Approved")
accuracy.eval(y_true, y_pred, sample_weights=sw)
print(accuracy.by_groups)
>>> { 'male':0.7, 'female':0.5 } # Probably a DataFrame or Series rather than a dict()
print(accuracy.overall)
>>> 0.6
print(accuracy.function_name)
>>> 'sklearn.metrics.accuracy_score' and so on. Some open questions:
Option 2: GroupedMetric holds dataThis essentially reverses the constructor and my_data_for_metrics = GroupedMetric(y_true, y_pred, sample_weights=sw)
my_data_for_metrics.eval(sklearn.metrics.accuracy_score, index_params=['sample_weights'], pos_label='Approved')
print(my_data_for_metrics.by_groups)
>>> { 'male':0.7, 'female':0.5 } # Probably a DataFrame or Series rather than a dict()
print(my_data_for_metrics.overall)
>>> 0.6 And so on. Again we have the question of whether If I'm still misunderstanding things, please let me know. I think that the above is a big decision, on which we need to be clear. Other thoughtsFor Question 7, I think the best option is to say that users need to know their tools. @gregorybchris can correct me if I'm wrong, but I believe that to 'save users from themselves' we would need to figure out the inner workings of the metric functions we're wrapping in great detail, and that would wind up tying us to particular versions of SciKit-Learn. Even then I'm sure we'll miss some extreme edge cases (and we wouldn't be able to do anything for users who brought their own metric functions - and they then might be even more shocked that the magic we were doing for the SciKit-Learn metrics didn't work on their own). For Question 9(2), isn't providing an error estimate (or similar) the job of the underlying metric function? How would we provide that? Although we talk about Not to be settled now, but for Question 9(3 & 4), I suggest we start thinking in terms of DataFrames with the columns being the underlying metrics, and have a |
To get the conversation going on what the metrics API should look like and better explain my own view, I created a quick mock-up: metrics.txt. Please feel free to comment!
An issue @MiroDudik brought up is that this design does not provide a way to retrieve both "group" scores and "overall" scores, something that was possible with the previous Summary object. I am not sure what would be a nice way to incorporate this pattern, so if people have thoughts: please share!
Tagging @romanlutz @riedgar-ms @adrinjalali
The text was updated successfully, but these errors were encountered: