-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Paramspace to automate the file naming scheme based on wildcards #40
Open
kelly-sovacool
wants to merge
89
commits into
main
Choose a base branch
from
paramspace
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Can't have hyphens. Better to use hyphens to separate params.
to give users control over paramspace wildcards
and write unit tests
Also fix instances_drop_wildcard() so it returns a unique set
Have to use output filepath, not rules.output, because the plot is blank when find_feat_imp is False
Otherwise, will get a ModuleNotFoundError when deploying this module with snakedeploy
Just use all columns that match
@sklucas Let me know your thoughts on this! In some ways it makes understanding the workflow more complicated, but with the goal of making execution more flexible for different use-cases. We can iterate on this to find a balance between flexible execution with paramspaces without sacrificing too much understandability/readability of the workflow. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Paramspace()
is a class provided by Snakemake that provides an automated way to build a file naming scheme with wildcards based on a data frame. I implemented a custom way to create a paramspace from the configfile so the user doesn't have to manually create a csv file of their parameters, which would be somewhat redundant with the configfile anyway.This implementation assumes that for each list in your configfile, you will want all-vs-all pairwise combinations of parameters in the list. Users can bypass this custom config->paramspace implementation in case they would like to make their own paramspace some other way by providing
paramspace_csv
in their configfile.Issues
Change(s) made
Paramspace
to define the wildcard pattern in therun_ml
rule.workflow/scripts/functions.py
:get_paramspace_from_config()
- takes a config dictionary and returns a Paramspace.pattern_drop_wildcard()
- get the wildcard pattern from paramspace without this wildcard. Needed by rulecombine_hp_performance
.pattern_tame_wildcard()
- get the wildcard pattern from paramspace with all wildcards escaped with curly braces except this wildcard. Needed by rulecombine_hp_performance
.instances_drop_wildcard()
- get a list of all wildcard instances from paramspace without this wildcard. Needed by ruleplot_hp_performance
.set_default()
- helper function to get a value from the config file if it exists, or return a default value. Reduces repetitive code when setting variables from the config.workflow/scripts/test_functions.py
and setup GitHub Actions to test them.rules/config.smk
.paramspace_csv
is a new key in the configfile. If it exists and is not empty, the paramspace will be created by parsing this CSV file. Effectively this is a way for users to bypass this custom config->paramspace implentation. If this key doesn't exist or is empty, the parameters will include all keys listed in the configfile except those listed inexclude_param_keys
.exclude_param_keys
is a new key in the configfile that lists all keys in the configfile that should be excluded from the paramspace. For the default config, the keys excluded are all exceptdataset
,method
, andkfold
. This way a new key added to the configfile is automatically included in the paramspace unless the user intentionally excludes it by adding it to theexclude_param_keys
list.pandas
Python package is now a dependency that must be installed alongsidesnakemake
.Checklist
(
Strikethroughany points that are not applicable.)README.md
,config/README.md
, &quick-start.md
).