Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preserve task attributes when creating reduced_task by modifying data using mlr3pipelines #11

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mikoontz
Copy link

@mikoontz mikoontz commented Dec 7, 2022

This PR fixes what I think could be a problematic approach to creating a new mlr3 task with knockoff data.

Currently, the reduced task is created by creating a new data.frame with knockoff data substituted for a feature (or group of features), then creating a new task using mlr3::as_task_regr() or mlr3::as_task_classif(), then specifying the new data.frame as the backend, then specifying the same target as the original task as the target.

This approach doesn't respect other settings of the Task, such as observation-level weights or coordinates (as in the case when a user is setting up a spatial cross validation). It also requires logic that queries whether a task is a regression or classification task in order to call the correct function (mlr3::as_task_regr() or mlr3::as_task_classif()). This logic break with other possible task types that should still work (e.g., classif_st in the case where a user wants to set up spatial cross validation using coordinates associated with each observation with mlr3spatiotempcv::as_task_classif_st().

From here, the preferred way to modify the data in a task is to use mlr3pipelines. This PR implements this approach and adds a "suggests" dependency (mlr3pipelines package) in order to use the PipeOpMutate method. It works for both individual features and grouped features. An ancillary benefit is that we can reduce the logic complexity because we no longer need to query whether the task is regression or classification. It also allows for other task types that aren't strictly "classif" or "regr" (e.g., "classif_st" for spatial cross validation).

All checks pass when building and NAMESPACE and DESCRIPTION documentation was updated using {roxygen2::roxygenize()` and manual editing, respectively.

…koff data substituted for one feature or a group of features) using the mlr3pipelines mutate operation rather than creating a new task with the same target and a new data backend. This reduces the need to query what kind of task (regression or classification), and also preserves other task attributes (e.g., weights, coordinates) in the reduced task.
@mikoontz
Copy link
Author

mikoontz commented Dec 8, 2022

I just noticed that the logic of deciding whether the task_type is "regr" or "classif" does exist at least one other place in {cpi}. In the cpi() function, this block:

  if (is.null(measure)) {
    if (task$task_type == "regr") {
      measure <- msr("regr.mse")
    } else if (task$task_type == "classif") {
      measure <- msr("classif.logloss")
    } else {
      stop("Unknown task type.")
    }
  }

So this will break if a user has a "classif_st" task_type, even though the code block that would execute for the "classif" task_type should work fine.

Just noting that the "ancillary benefit" I point out above doesn't fully realize that benefit if a user doesn't specify a value for the measure argument!

Using grepl could work, but maybe that overcorrects and is too anti-conservative (happy to make a separate PR if helpful):

  if (is.null(measure)) {
    if (grepl(x = task$task_type, pattern = "regr")) {
      measure <- msr("regr.mse")
    } else if (grepl(x = task$task_type, pattern = "classif")) {
      measure <- msr("classif.logloss")
    } else {
      stop("Unknown task type.")
    }
  }

@mikoontz mikoontz changed the title preserve task features when creating reduced_task by modifying data using mlr3pipelines preserve task attributes when creating reduced_task by modifying data using mlr3pipelines Dec 8, 2022
@mnwright
Copy link
Member

Thanks for the PR! I think using pipelines for this makes perfect sense and I wonder why I didn't come up with that myself?

I will look more into this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants