Adding analyst tutorial markdown and jupyter notebook #143

michael-harper · 2024-01-05T02:05:22Z

Created a tutorial for any new analysts to come on board. A couple of points:

Using bioheart dataset at this stage as there is no dataset to be used by those onboarding. Obviously this should change and would be best to use publicly available genomes.
The Final_analyst_tutorial.ipynb is not able to be completed because I have not included an annotations.txt file as it references CPG ID's and did not want them visible.
I'm still trying to figure out how to merge both 1kg.mt and the tutorials mt. At this stage the 1kg.mt is built using GRCh37 and our the Matrix Table from our pipeline uses GRCh38. If anyone has any ideas please let me know!

… there is an onboarding dataset

onboarding_documentation/technical_documentation/analyst_tutorial/analyst_tutorial.md

KatalinaBobowik · 2024-01-07T23:19:25Z

onboarding_documentation/technical_documentation/analyst_tutorial/analyst_tutorial.md

+--access-level full \
+scripts/create_test_subset.py --project bioheart --samples XPG280371 XPG280389 XPG280397 XPG280405 XPG280413 --skip-ped
+```
+**FOR REFERENCE: The above was taken from [this](https://centrepopgen.slack.com/archives/C03FA2M1MR9/p1700020527448029?thread_ts=1699935103.776929&cid=C03FA2M1MR9) Slack thread**


referencing is good in code, but I don’t think necessary in this README

Agreed, this was for personal referencing so I could keep track!

onboarding_documentation/technical_documentation/analyst_tutorial/analyst_tutorial.md

KatalinaBobowik · 2024-01-07T23:24:48Z

onboarding_documentation/technical_documentation/analyst_tutorial/analyst_tutorial.md

+
+#### Task: Write a config file
+- If we are wanting to run the `large cohort` pipeline on the `bioheart-test` dataset, we will need to create a config file that is capable of doing this.
+- Have a go at writing your own config file capable of running the `large cohort` pipeline on `bioheart-test` up until the `Combiner` stage. 


The person reading this has no chance of successfully completing this; there is simply not enough information provided.

I agree that asking the reader to recreate a config file from scratch might be a bit overwhelming, especially without a comprehensive guide. I'll revise this section to provide a step-by-step walkthrough of the parameters that need to be changed, emphasising that this is not exhaustive and the parameters needing to be changed will vary based on the requirements of the analysis.

KatalinaBobowik · 2024-01-07T23:25:59Z

onboarding_documentation/technical_documentation/analyst_tutorial/analyst_tutorial.md

+#### Task: Write a config file
+- If we are wanting to run the `large cohort` pipeline on the `bioheart-test` dataset, we will need to create a config file that is capable of doing this.
+- Have a go at writing your own config file capable of running the `large cohort` pipeline on `bioheart-test` up until the `Combiner` stage. 
+- You can use the default config file as a starting point, and then override the necessary parameters to run the pipeline on `bioheart-test`.


You can use the default config file as a starting point

I personally think the default is very confusing and missing lots of entries. Plus, most of it is not explained well. It would be very hard for the individual going through this pipeline to know what they need to keep and omit without any background

Is there a full list of parameters available?

onboarding_documentation/technical_documentation/analyst_tutorial/analyst_tutorial.md

KatalinaBobowik · 2024-01-07T23:32:51Z

onboarding_documentation/technical_documentation/analyst_tutorial/analyst_tutorial.md

+
+Instructions on setting up a Jupyter Notebook in the cloud can be found [here](https://github.com/populationgenomics/team-docs/blob/main/notebooks.md)
+
+Please continue this tutorial once you have a Jupyter Notebook running in the cloud, good luck!


good luck!

Again, I think a task-based tutorial with very little context and supporting help doesn't seem like the best way to help a new hire

KatalinaBobowik · 2024-01-07T23:36:51Z

Fundamentally, I don't think production pipelines is a good place to have a tutorial for an analyst. Rather, it should be a dataset in test. For example, you could run a PCA (or any other analysis) on a dataset in test (e.g., bioheart-test or tob-wgs-test), iterating upon it in a notebook. Then once the individual is happy with exploring the data, they could send the script to the analysis runner (using --access-level test)

KatalinaBobowik · 2024-01-07T23:42:04Z

For the notebook, I think your two sections (the content taken from the production-pipelines README in the first half) and the PCA example at the bottom, are confusing. They both lack an explanation of what is happening (one of the benefits of having a notebook), context, and referencing of where the material came from.

… Also improved descriptions of tools and improved clarity

…e clarity and ease of access to newcomers

adding analyst tutorial markdown and jupyter notebook

f2e2de5

michael-harper requested a review from KatalinaBobowik January 5, 2024 02:05

changed from fewgenomes to bioheart in markdown for consistency until…

355f021

… there is an onboarding dataset