Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add further details on EC ingestion #447

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from
71 changes: 71 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,74 @@ python parse_generic_metadata.py \
--search-path "gs://cpg-upload-bucket/collaborator" \
<manifest-path>
```

## Existing Cohorts Data Ingestion

Data is commonly received in the form of CSV files. Each CSV file can include data on a Participant, Sample, Sequence Group, and other related attributes.

As described above, the generic metadata parser takes in a CSV file to be ingested. Each row corresponds to a sequencing group/sample, and each column corresponds to an attribute. While we try to ensure all input data is homogenized, some inputs may need further work before they can be ingested. To handle specific data formats or conditions, we have developed additional parsers that build upon our generic metadata parser.

### Existing Cohorts Ingestion Workflow

1. Parse Manifest CSVs with `parse_existing_cohort.py`

Usage:

```shell
analysis-runner --dataset <DATASET> --access-level standard
--description <DESCRIPTION> -o parser-tmp
python3 -m scripts.parse_existing_cohort --project <PROJECT>
--search-location <BUCKET CONTAINING FASTQS> --batch-number <BATCH>
--include-participant-column <PATH TO CSV>
```

For example:

```shell
analysis-runner --dataset fewgenomes --access-level standard
--description "Parse Dummy Samples" -o parser-tmp
python3 -m scripts.parse_existing_cohort --project fewgenomes
--search-location gs://cpg-fewgenomes-main/ --batch-number "1"
--include-participant-column gs://cpg-fewgenomes-main-upload/fewgenomes_manifest.csv
```

The existing cohort parser behaves as follows:

- If the input CSV does not include a participant_id column, the external_id is used by default. If a participant column does exist a user can specify the —include-participant-column flag, which is by default False.
- It assumes the batch-number column does not exist in the input CSV and should be specified as a parameter. Note, this means manifests should be ingested batch per batch. Often, the batch-number is found in the csv filename. The default behavior expects this parameter to be set.
- It assumes that FASTQ URLs are not provided in the CSV, so it pulls them from the input bucket. It also handles the fact that while the data in the input CSV is provided according to a sample ID, these sample IDs are not present in the file path of the FASTQs. Instead, FASTQ’s are named according to their fluidX tube id The script matches samples to FASTQ's accordingly.
- It discards the header.

2. Update Participant IDs with `fix_participant_ids.py`
If participant IDs need to be ingested and were not handled correctly, `fix_participant_ids.py` can be used to update external participant IDs. It takes a map of {old_external: new_external} as input.

3. Parse Pedigrees with `parse_ped.py`
Parsing ped files is not handled by the parser. To do so, `parse_ped.py` should be used. Note, step 2 must be completed first.

### Post-Ingestion

In order to test new and existing workflows, `-test` projects should be used.
For development work, you can use fewgenomes-test, thousandgenomes-test or hgdp-test.

Prior to running an existing workflow on a new dataset, a -test project should first be populated. The `create_test_subset.py` script handles this.

Usage:

```shell
analysis-runner --dataset <DATASET> --access-level standard --description
<DESCRIPTION> -o test-subset-tmp python3 -m scripts.create_test_subset --project
<PROJECT> --samples <N_SAMPLES> --skip-ped
```

Parameters
| Option | Description |
|---------------|------------------------------------------------------------------------------|
| --project | The sample-metadata project ($DATASET) |
| -n, --samples | Number of samples to subset |
| --families | Minimal number of families to include |
| --skip-ped | Flag to be used when there isn't available pedigree/family information |
| --add-family | Additional families to include. All samples from these fams will be included |
| --add-sample | Additional samples to include. |

In some cases, a random subset of samples or families is sufficient. In this case the `--families` or `--samples` parameters should be used.
In other cases, a specific subset of samples or families is more useful (for example, in the case of a pipeline failing on a specific sample in production). In this instance you can use the `--add-family` or `--add-sample` parameters any number of times.