-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General deployment strategy for reference data files #11
Comments
The nf-core workflows have a common, mostly standardized way of implementing reference genome data: The igenomes reference sets and these special resources can be implemented into their own reference container. Another option would be to code these copy and delete steps as separate module that has to be implemented into the nextflow workflows. This may require more effort for implementation, but could automatize the installation process. |
So like Kai, I also differentiate between
What I dont like is the part with "copying the required files into local directories and delete them afterwards". |
There is no such thing as a specialized reference when it's downloaded from an external resource and used as-is. For example, using VEP came up several times in the past couple of months in the lab. It's just quite cumbersome to deploy offline, but commonly requested as part of the analysis. I don't see a compelling reason why VEP would be used only in exactly one workflow? That level of abstraction seems desirable, but potentially quite hard to achieve in a practical setting. A genuinely special (as in: not expected to be useful in other contexts) reference file should be derived as part of the workflow itself. For example, if my workflow requires that the gene model (say, GENCODE) only consists of pseudo-genes and nothing else, filtering the downloaded GENCODE files to just contain pseudo-genes should happen as part of the workflow. (Note that GENCODE provides dedicated files, e.g., for "basic" gene annotation, or only protein-coding transcript sequences; quite likely that many people just need that and nothing else.)
What would be your proposed solution to achieve that? |
Alright, so no more workflow specialized references in local folders, cause this is afaik how we use nextflow currently. And concerning my solution: Kai and me talked about that this week shortly and the idea came up to work in a cubi singularity container that has all references and then start nextflow which starts its own singularity containers. However, I am not sure if that works- a container within a container. So far I have only been using singularity containers, not writing my own ones. |
That is right, these resources are not workflow-specific. I'd rather wanted to state that specific nf-core workflows require additional resources, while all of them accept the pre-built iGenome sets.
Nextflow can work with singularity containers and actually needs to download a bunch of containers with the used software tools to run properly (also another topic, whether we should maintain a copy of these software containers to deploy this workflow offline). So it should be able to implement a direct access to the container, but I don't know yet how to code that and it probably involves changing the base code of the workflows (which may hamper integration of future updates from nf-core). |
It seems I need to see a hands-on example on using these iGenome sets and also some info about what these files/sets are, how they are identified and deployed etc. If these are already properly packaged in some way and we could maintain a local (offline mirror) resource with those datasets as basic starting point for the nf-core workflows, then that seems to check enough of the above points to be marked as done. For the other reference files that are not derived (i.e., they are downloaded from an online resource and placed into the working directory of thte workflow), a simple solution would be to define some metadata information per workflow (in the CUBI fork) and add a mini setup workflow* that takes care of copying the files into place from the local (offline) resource. This would keep upstream compatibility, and should fall into the domain of "automate the boring stuff". Kai, didn't you say you do that, just in form of a bash script at the moment? *realized as nextflow workflow, this would be a very simple thing to gain some experience with writing Nextflow worflows edit: following the link that Kai posted, I am greeted with a "warning" that the annotations in many of the iGenome sets are totally outdated. Who is maintaining these iGenome sets? Reads like no one is in charge? |
So Phil Ewels took a copy of the iGenomes and uploaded them to AWS s3 bucket and added additional files like indexes for star, all done with a time limited fund. Its 5 TB big. |
That means the S3 resource is vanishing at some point?
So, in its entirety, it seems like the iGenomes resource is a good candidate for local mirroring with an enforced read-only state. That way, all groups could access the reference data on the local infrastructure and a central authority (= Research IT) would manage/regulate that. I'll talk to them about this use case. |
I dont see an immediate licence file. iGenomes is maintained by Illumina, so I guess their usual regularities apply there as well, all me guessing however. On Phil"s github he says: "AWS has agreed to host up to 8TB data for AWS-iGenomes dataset until at least 28th October 2022. The resource has been renewed once so far and I hope that it will continue to be renewed for the forseeable future." Yes, I think it might be a good idea to approach the research IT for a local read only mirror. |
We need to develop a general strategy to deploy reference data files - or potentially also database dumps - on any compute infrastructure.
Requirements:
Although the current Snakemake pipeline implementing the creation of reference containers needs refactoring to simplify usage, the reference containers themselves fulfill all of the above requirements.
The question:
The major point to discuss is how to find a solution that enables transparent and standardized integration into various existing Nextflow workflows. This might entail finding an overall different approach.
The text was updated successfully, but these errors were encountered: