This repository contains the source code for two interactive Shiny sites that support the variant review process. Reviewers do not need the files housed here to contribute to this project. That said, feel free to dig into the source code if you are into! The repository also contains several thousand IGV screenshots from a single large cancer exome data set Reddy et al, 2017. The screenshots are split between two directories (false_positives
and false_negatives
). Respectively, these represent sets of variants that were not identified in a reanalysis of this data set with modern pipelines (false positives
) or were newly identified by these pipelines (false_negatives
). This division should be ignored while performing the review process to avoid biasing the opinion of reviewers as they rate variants.
Thanks for your interest in this crowdsourced effort to curate a set of variant calls of the highest possible quality. If done with sufficient care and rigor, we hope this work will enable subsequent analyses to uncover common sources of error in individual analysis pipelines. The reviewing will follow the guidelines detailed in this paper along with a streamlined scoring system we devised to simplify the process. We use a 5-point scale to grade variants based on their quality. The maximal score is reserved for variants with an ideal amount of support and no issues that cause reduction in the reviewers confidence in its accuracy. Hence, a variant with good support but one or more confounders should be down-graded to a lower score.
- Our scoring scale allows for variants to be given a score of 0 for when it has absolutely no evidence in the data according to IGV. This scenario should theoretically never happen.
- Reseved for variants that appear to have the bare minimum of support. Either a single read supporting the variant or, in the case of short-insert libraries, you will likely see two reads (F and R from the same molecule). Variants should rarely be downgraded to this category otherwise.
- For variants with slightly above the minimal possible support (i.e. 2-3 molecules from up to 6 total reads, in the case of short insert libraries). You may downgrade a variant with higher support to this score based on the presence of other confounders
- For variants with less-than-ideal (i.e. modest) support but exceeding that in the category below. You may downgrade a variant with stronger support to this score based on the presence of other confounders
- For variants with almost ideal support or with ideal support and possibly one confounder
- For the ideal variant with no confounders
Use this page to see how other users have applied our scoring system. This page will show you random examples of IGV screenshots and the score (or scores) given by individual participants.
When you are ready, use this page to review the variants. The interface should be relatively self-explanatory. The only requirement for scoring a variant is that you select a score between 0 and 5 (realistically, usually between 1 and 5). The section of tags below are optional but using them is encouraged especially if you have downgraded a variant's score due to issues that can be described with the tags. If you want to note something not readily captured with a tag, feel free to enter a brief comment in the box provided.
When you start, be sure to enter a user ID in the box to avoid your submissions being tracked under anonymous
. We recommend using the first part of the email address associated with your GitHub account (everything before the @
). If you have done this correctly, your ID should appear on the leaderboard on the bottom of the side pane after you have submitted at least one review. If we start seeing suspicious user IDs showing up we will restrict the leaderboard to only show white-listed IDs. Hopefully this will not become necessary.
Things to note:
- None of the samples have a matched normal available. The page auto-selects the No count normal (NCN) tag for you to track this.
- Pretty much all libraries have a short insert size. This means that many of the candidate variants will be supported by overlapping F and R from the same pair. Try to take this into consideration when using the scoring and aim to consider how many distinct molecules support the mutation (rather than reads), when possible.
- Some regions of certain genes have been noted to have evidence of contaminating PCR amplicon. Examples of what this looks like can be found in the PubPeer post describing them