State of the project #3653

lbeltrame · 2022-05-24T06:07:05Z

DISCLAIMER: This post is not meant to raise unduly criticism and / or fuel anxiety about the project. It is meant as a way to kickstart discussion for those who are interested in improving it.

It is evident to anyone following this project that its development (for many, and probably justified reasons) has de facto stalled (no new commits for a month and a half). Low activity is not per se a problem, but it can be when there are open PRs for a while, some closed after a while (for example support for additional BWA methods, or newer VEP) with no one that is actually reviewing them.

This post is meant, as I wrote above, on enquiring about how bcbio stands today, and if the community can do anything to help. I understand that the landscape in the past years has changed considerably, with WDL, CWL, Snakemake and Nextflow entering the fray, and containerization of analyses (more Apptainer / Singularity than Docker, but that's my personal opinion, and out of scope). However, bcbio gets some things right that other workflows don't, yet:

Tight integration
Post processing and harmonization of outputs (PureCN, etc.)
Splitting and parallelizing on regions -> this is IMO a killer feature that I haven't seen anywhere else

OTOH, it is clear that there's some technical debt accumulating over the years, and this can prevent further improvement. It would be sad to see this project wither away, so my question is: what can be done to help?

I'm aware of a roadmap issue but as far as I can see it's fairly high level. I'm also aware that multiple hands are needed because the codebase is large, and that also involves cloudbiolinux.

So, these questions go to @naumenko-sa and @roryk:

What are the largest low level issues in bcbio as of today?
What kind of help is needed?
Does it make sense to raise awareness in the community to increase the critical mass needed to go forward?

Of course I'd love input from some other contributors here.

sjhosui · 2022-05-24T14:06:47Z

@lbeltrame, thank you for your post and for being a long-term contributor/user of bcbio. We appreciate your concern and succinct summary of some of the challenges. Your questions are timely as we (the Harvard Chan Bioinformatics Core) are in the process of writing up a proposal for the CZI EOSS Cycle 5 awards and have been contemplating how best to continue to maintain and develop bcbio. I agree that bcbio has several advantages over other workflow systems, including the ones you have mentioned. Our challenge is that many of our developers have since moved on to other positions and can no longer support this project. @naumenko-sa has been doing a fantastic job trying to juggle ad hoc support with his research/consulting projects in the core, but he is now the sole contributor in our group and this work has not been funded since the start of the year (hence the application for funds to support this effort for the next two years). I’ll leave it to @naumenko-sa to comment on the largest low-level issues. In terms of help, we need developers who are familiar with the bcbio ecosystem to become more involved. We have attempted to recruit additional team members but have found it challenging to find the right combination of skills for this project. If we could identify someone who is already familiar with how bcbio works, understands the biological applications, is motivated and has the skills to participate, we’d be interested in speaking to that person (see also https://bioinformatics.sph.harvard.edu/careers; though it does not mention bcbio, if someone comes along who is a good fit for it, we can adjust). In the long term however, this project does need, as you so aptly put it, “to raise awareness in the community to increase the critical mass needed to go forward.” The application we’re putting together now shows that bcbio is widely used (367 published papers, 133000 downloads from bioconda the past three years, ~1400 unique visitors to this github over the past two weeks). Some questions for the community here: * What would help to encourage others to contribute? Should we consider a hackathon, an annual meeting, other events that would bring this community together? * What are the technical roadblocks for other contributors? How can we help make it easier? I have further thoughts but would love to hear from the community. I encourage anyone interested in contributing more to contact us to help galvanize this effort.

naumenko-sa · 2022-05-24T17:14:38Z

Hi @lbeltrame !

Thanks for bringing this issue up and also for many years of your commitment to bcbio as a contributor!

If anybody can help to run another installation test - that would be helpful, to test the solution to the openssl issue.
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/bcbio-nextgen/meta.yaml#L43

The next thing is to bring T2T reference, which I am also interested in doing.

Other than that, I don't think there are any burning issues, please post links here if I missed anything super important.
Most issues that were coming lately, are educational - again, anybody please feel free to chime in and help less experienced bcbio users with python/conda/PATH basics.

The time of other top bcbio contributors who have done fantastic work (Rory, Lorena, Michael, Brad) is not available now. Everyone in the bcbio community is grateful to them - they put in a lot of effort and created a very robust code base.

I am trying my best to take on the issues in the GitHub, given the time/budget limitations (my major effort goes to bioinformatics consulting at https://bioinformatics.sph.harvard.edu/). Also, I can't ignore an elephant in the room: https://github.com/vshymanskyy/StandWithUkraine. The last 3 months have been devastating. Please support Ukraine in the way you can!

Overall, we are trying to make sure bcbio is working for major use cases and we are using it to process data in our projects, i.e. bcbio now in a maintenance rather than active development mode. Our projects lately included bulk RNA-seq, SV calling in WGS, WGBS, CHIP calling (T only somatic on germline data). If anybody could contribute more to the maintenance of PureCN pipeline, T/N, T only, UMI pipelines to complement the set of projects our group receives, that would be greatly appreciated.

Many groups are running new projects or production on Illumina Dragen or Broad Terra, which is understandable given the speed of Dragen (30min for a WGS, 10 min for RNA-seq), and the scalability of Terra workspaces.
I am personally involved in 3 projects that are using these platforms.

The remaining Bcbio niche is small labs and projects, underfunded labs and specific use cases you have mentioned (some of the downstream analyses and integration) and we continue supporting bcbio for them.

The big topic that needs to be addressed in bcbio is bringing back container support for separate pipelines - help needed.

More specifically:

@matthdsm regularly updates VEP, and we usually merge it fast; I think there was some issue with the latest update and was rolled back. Maybe now it has been fixed in the upstream?
I am not sure what happened to WIP: add bwa-mem2 aligner #3359 likely it needed more effort to get going; We can re-open it and get back to it if there is a stakeholder (a person who will actively push the agenda, having bwa mem2 is not critical for our lab).
the latest dbNSFP https://github.com/chapmanb/cloudbiolinux/pulls requests need to be tested. While I understand that for the type of work @matthdsm is running having the latest annotation is critical, dbNSFP is a largest data installation in bcbio (currently 32G for me), the process of installation requires 1T of tmp space and runs for a week for some users. Once I merged the request, the upgrade will be triggered for many users. At least, I need to run the recipe myself and make sure it is working and to give users an estimate for how long it will be running. If anybody could do this, please do so - it will speed up the merge of these PRs. This merge also depends on openssl installation testing - we don't need two issues in the installation at the same time.
Add support for germline resources (from gnomAD) in Mutect2 #2937 - sorry we dropped the ball on this one. I was not sure how it interferes with the germline sites calling in the purecn pipeline, and also it was important for production runs not to filter calls at the calling step. I think other large T/N and T users also opposed the change.

Please feel free to DM if you like at snaumenko[at]the same domain as our main contact if you have more questions or suggestions.

SN

matthdsm · 2022-05-24T19:28:46Z

Hi,
First of all, I would also like to thank all contributors large and small. Bcbio has (and still is) given us many results over the years.

As for the comments:

@matthdsm regularly updates VEP, and we usually merge it fast; I think there was some issue with the latest update and was rolled back. Maybe now it has been fixed in the upstream?

I'm not aware of any issues. We recently updated VEP in our bcbio installation without a hitch. I'll look into pushing another update if I find the time.

I am not sure what happened to #3359 likely it needed more effort to get going;

The PR had been open for a while, but there seemed to be little interest to support a new aligner, so I abandoned the effort. We just replaced bwa with bwa-mem2 for evaluation. Since it's a drop in replacement, the effort was minimal.

the latest dbNSFP

We're using dbNSFP v4.3a, with the config in the PR. As you said, the preprocessing to make the dataset usable is immense, although I think a week is a bit of an overstatement. I ran the download and processing overnight, but this may largely be dependent on individual internet connections and disk speeds.
If there's funding, someone could host a processed version of dbNSFP on an S3 server to mitigate most of the issues.
We could even strip out unused columns to make the download more easy to stomach.

As I see it, the paradigm has shifted from bringing the data to the analysis to the other way around. As it is now, bcbio is huge monolith, which sometimes makes it difficult to use it as a portable analysis platform.
As mentioned, workflow languages have made it very easy to create quality pipelines.

In my opinion, the following would pull bcbio back to the front page:

Make installation less error prone. Right now, there's a million conda packages and a bunch of different env's to keep them from fighting. The resolver also makes version pinning very flaky (Feature request: pinned dependency versions for a specific bcbio release, to avoid future problems with faulty conda environment solving #3644). Containers are the way to go here. There's a biocontainer for every bioconda package.
clean up the toolchain, upgrade everything to the latest version and drop tools that are no longer supported.
clean up the codebase and remove legacy code. Keeping old code around makes it harder to maintain.
thorough (automated) testing of the code and the analysis

The pipeline offered by bcbio is rock solid, so I think the project should focus on that. There's a lot of workflow languages that can be used to make bcbio more portable. There's no longer a need to maintain a scheduling manager to track and run jobs. I realise this is not a trivial task, but embracing a workflow language could make the codebase way easier to maintain.

I also think cloudbiolinux should be pulled or forked into the bcbio organisation and strippped down to only handle the bcbio installation in it's current form. There's no need for ansible, homebrew or all the other complexities (in my experience).

Anyways, just my 5 cents. Keep up the good work!
Cheers

chapmanb · 2022-05-25T01:38:15Z

Luca, Matthias, Sergey and Shannan;
i'm so appreciative of the work everyone does to support and extend bcbio, and thankful that it's helped so many groups do amazing science. As you've all mentioned, the landscape has changed a ton since bcbio was initially developed in terms of packaging, workflow systems, and the awesome open source bioinformatics community.
I totally trust the community, Sergey. Shannan and the Harvard core team in deciding how best to support and continue bcbio. It's always worth thinking about which components of bcbio are unique, widely used, and not a huge burden to extend and update. The hardest decisions for me were always in choosing not to do something or change functionality, so know where everyone is at. Thanks to everyone for being so awesome and thoughtful, I miss building things with this community.

lbeltrame · 2022-05-25T06:34:20Z

Thanks a lot for the responses to this thread!

@sjhosui

Your questions are timely as we (the Harvard Chan Bioinformatics Core) are in the process of writing up a proposal for the CZI EOSS Cycle 5 awards and have been contemplating how best to continue to maintain and develop bcbio.

I figured as much: this is an endemic problem with maintaining software in the context of doing research and certainly not anyone's fault.

What would help to encourage others to contribute? Should we consider a hackathon, an annual meeting, other events that would bring this community together?

My opinion, as someone deeply involved in Free Software projects also outside my profession: something that would be truly useful if possible would be identifying "junior jobs" or low hanging fruit for contributors to hack on. This helps in getting familiar with the codebase without getting stumped into too large projects.

A hackathon or something like that would be useful as well for those wanting to get into bcbio, or at least to see what can be changed / improved. But even there, my suggestion for outsiders is: start small. To be honest, I didn't back in the days, but in retrospect, I should have. ;)

What are the technical roadblocks for other contributors? How can we help make it easier?

The code is quite large, and in particular the handling of the "outside" dependencies can be daunting. In addition, the "correct" way of doing things in 2013-2015 changed in later years.

As I said, the first step would be to identify what are (bite-sized) areas of improvement.

@naumenko-sa

f anybody could contribute more to the maintenance of PureCN pipeline, T/N, T only, UMI pipelines to complement the set of projects our group receives, that would be greatly appreciated.

While in "maintenance mode", there are a few areas where I think that new improvements can land. For example, sWGS pipelines (the golden standard is QDNAseq, although it is fairly old) or specific ctDNA methods like ichorCNA (in active development). Then, as you said, UMI pipelines are something worth looking into (e.g., dual UMI solutions like IDT's).

Other stuff, fairly lower level, would be benchmarking. There's already quite a lot of stuff for variants (very useful) but little else (that said, I have no idea how much benchmarking is around for, say, CNVs).

I wouldn't mind also dropping support for some software if maintainability becomes a problem.

Many groups are running new projects or production on Illumina Dragen or Broad Terra, which is understandable given the speed of Dragen (30min for a WGS, 10 min for RNA-seq), and the scalability of Terra workspaces.

One of the reasons I'm "pushing" for bcbio is because outside the US, even in a G7 country like mine, policies, availability of funding, and concerns ("the cloud") make relying on such platforms untenable. At my previous job there were zero resources to run analyses like that (and the connectivity wouldn't support it) and bcbio allowed us to run many (10+K jobs) analyses on inexpensive, old-generation hardware. This software is IMO invaluable for those who work on-premises.
Also, the results it provided improved my career considerably, so that is a plus, too. ;)
It's even important now even though my current institution has far more resources, because by policy stuff has to be done on premises. IOW, there are even large institutions who cannot use Terra or similar platforms.

The big topic that needs to be addressed in bcbio is bringing back container support for separate pipelines - help needed.

You mean, kind of like what Nextflow does? Probably this is a worthy long-term goal. Splitting it into chunks of what "needs to be done" would help in prioritizing work.

@matthdsm

I also think cloudbiolinux should be pulled or forked into the bcbio organisation and strippped down to only handle the bcbio installation in it's current form. There's no need for ansible, homebrew or all the other complexities (in my experience).

While perhaps I have a weaker opinion on this, I too honestly don't see the need of cloudbiolinux as a separate project at this stage. Clearly it grew organically, but perhaps a "diet" would help, and I would support folding it into bcbio.

Agreed on the scheduling, which is now handled better than when support was introduced, but as far as I can see this is deeply tied inside bcbio's internals, so that would probably be a longer term effort.

matthdsm · 2022-05-25T07:29:42Z

One of the reasons I'm "pushing" for bcbio is because outside the US, even in a G7 country like mine, policies, availability of funding, and concerns ("the cloud") make relying on such platforms untenable

This ⏫ . I work at a hospital, and absolutely NO data can leave the premises without possible legal issues.

amizeranschi · 2022-05-27T05:39:38Z

Hi everyone, happy to see renewed interest in bcbio!

@naumenko-sa could I also ask for your help regarding a recent issue, where we're unable to install the sacCer3 genome data in a fresh install? #3652

@marianastase0912 is building some wrappers for bcbio as part of her B.Sc. diploma project and we've been using sacCer3 data for building demos. We were able to install the genome data before the latest bcbio release, but this stopped in the mean time, for some reason.

naumenko-sa · 2022-06-02T00:51:25Z

@amizeranschi please try installing sacCer3 again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of the project #3653

State of the project #3653

lbeltrame commented May 24, 2022

sjhosui commented May 24, 2022 via email •

edited

Loading

naumenko-sa commented May 24, 2022 •

edited

Loading

matthdsm commented May 24, 2022

chapmanb commented May 25, 2022

lbeltrame commented May 25, 2022

matthdsm commented May 25, 2022

amizeranschi commented May 27, 2022

naumenko-sa commented Jun 2, 2022

State of the project #3653

State of the project #3653

Comments

lbeltrame commented May 24, 2022

sjhosui commented May 24, 2022 via email • edited Loading

naumenko-sa commented May 24, 2022 • edited Loading

matthdsm commented May 24, 2022

chapmanb commented May 25, 2022

lbeltrame commented May 25, 2022

matthdsm commented May 25, 2022

amizeranschi commented May 27, 2022

naumenko-sa commented Jun 2, 2022

sjhosui commented May 24, 2022 via email •

edited

Loading

naumenko-sa commented May 24, 2022 •

edited

Loading