Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding analyst tutorial markdown and jupyter notebook #143

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

michael-harper
Copy link
Contributor

Created a tutorial for any new analysts to come on board. A couple of points:

  • Using bioheart dataset at this stage as there is no dataset to be used by those onboarding. Obviously this should change and would be best to use publicly available genomes.
  • The Final_analyst_tutorial.ipynb is not able to be completed because I have not included an annotations.txt file as it references CPG ID's and did not want them visible.
  • I'm still trying to figure out how to merge both 1kg.mt and the tutorials mt. At this stage the 1kg.mt is built using GRCh37 and our the Matrix Table from our pipeline uses GRCh38. If anyone has any ideas please let me know!

--access-level full \
scripts/create_test_subset.py --project bioheart --samples XPG280371 XPG280389 XPG280397 XPG280405 XPG280413 --skip-ped
```
**FOR REFERENCE: The above was taken from [this](https://centrepopgen.slack.com/archives/C03FA2M1MR9/p1700020527448029?thread_ts=1699935103.776929&cid=C03FA2M1MR9) Slack thread**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

referencing is good in code, but I don’t think necessary in this README

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this was for personal referencing so I could keep track!


#### Task: Write a config file
- If we are wanting to run the `large cohort` pipeline on the `bioheart-test` dataset, we will need to create a config file that is capable of doing this.
- Have a go at writing your own config file capable of running the `large cohort` pipeline on `bioheart-test` up until the `Combiner` stage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The person reading this has no chance of successfully completing this; there is simply not enough information provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that asking the reader to recreate a config file from scratch might be a bit overwhelming, especially without a comprehensive guide. I'll revise this section to provide a step-by-step walkthrough of the parameters that need to be changed, emphasising that this is not exhaustive and the parameters needing to be changed will vary based on the requirements of the analysis.

#### Task: Write a config file
- If we are wanting to run the `large cohort` pipeline on the `bioheart-test` dataset, we will need to create a config file that is capable of doing this.
- Have a go at writing your own config file capable of running the `large cohort` pipeline on `bioheart-test` up until the `Combiner` stage.
- You can use the default config file as a starting point, and then override the necessary parameters to run the pipeline on `bioheart-test`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the default config file as a starting point

I personally think the default is very confusing and missing lots of entries. Plus, most of it is not explained well. It would be very hard for the individual going through this pipeline to know what they need to keep and omit without any background

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a full list of parameters available?


Instructions on setting up a Jupyter Notebook in the cloud can be found [here](https://github.com/populationgenomics/team-docs/blob/main/notebooks.md)

Please continue this tutorial once you have a Jupyter Notebook running in the cloud, good luck!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good luck!

Again, I think a task-based tutorial with very little context and supporting help doesn't seem like the best way to help a new hire

@KatalinaBobowik
Copy link
Contributor

Fundamentally, I don't think production pipelines is a good place to have a tutorial for an analyst. Rather, it should be a dataset in test. For example, you could run a PCA (or any other analysis) on a dataset in test (e.g., bioheart-test or tob-wgs-test), iterating upon it in a notebook. Then once the individual is happy with exploring the data, they could send the script to the analysis runner (using --access-level test)

@KatalinaBobowik
Copy link
Contributor

For the notebook, I think your two sections (the content taken from the production-pipelines README in the first half) and the PCA example at the bottom, are confusing. They both lack an explanation of what is happening (one of the benefits of having a notebook), context, and referencing of where the material came from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants