JupyterLab notebooks can be extremely useful for exploring data and prototyping code in an interactive way. Running them directly on GCP VMs has two big advantages:
- avoiding any egress costs for downloading data from GCP, and
- avoiding the audit and security challenges that additional copies of the data (e.g. on laptops) cause.
The interactivity of notebooks is somewhat at odds with reproducibility and code reviews. To strike a balance, we grant the service accounts that are used to run the notebooks access to only a subset of buckets, namely the test
and temporary
buckets for a dataset (see the storage policies for context).
All notebooks should be created in the notebooks-314505
GCP project. Click here to create a User-Managed notebook instance. As of writing (July 2022), Managed Notebooks are not available in Australia yet, but they'll be preferrable once they become available due to their automatic idle shutdown feature.
Note the following settings in the screenshot below:
- Region: Make sure this is set to an Australian region, to avoid egress costs when accessing locally stored datasets.
- Environment: The "R" environments include both Python and R notebook options. To run Hail, select Custom Container and enter
australia-southeast1-docker.pkg.dev/cpg-common/images/hail-gcp-notebook:0.2.126
(or a later version of the hail-gcp-notebook image) for the Docker container image field. - Machine type: These are standard VM types. Pick the smallest configuration that's not sluggish to work with -- you can see the impact on price in the upper right corner.
- Permission: If you pick "Single user only", only you will be able to access the instance. Otherwise you can share the instance with everybody who has access to the dataset that corresponds to the service account (see below).
- Identity and API access: Make sure to unselect "Use Compute Engine default service account" here. Use a service account of the form
notebook-<dataset>@notebooks-314505.iam.gserviceaccount.com
. Replace<dataset>
withfewgenomes
,tob-wgs
, etc. as required.
As the notebook runs on a VM, the cost for keeping the notebook running is identical to keeping a VM running. It's therefore a good idea to stop notebooks when you're not using them and to delete them when they're no longer needed. Make sure to "check" the corresponding notebook instance on the left side in the notebooks overview page to start / stop an instance, as shown in the screenshot below.
Our Hail notebook image already has this set up, but in case you're using a different image: To be able to access GCS paths (gs://...
) directly in Hail, you need to install the GCS Connector. To install, run the following command:
curl https://raw.githubusercontent.com/broadinstitute/install-gcs-connector/master/install_gcs_connector.py | python3
Then run the following after installing Hail:
cd $(find_spark_home.py)/jars && curl -O https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-2.0.1.jar && cd -
Instead of the default Jupyter notebook IDE in the browser, you can also use Visual Studio Code on your local machine and connect to the notebook kernel running remotely.
There are a few steps to go through the first time you want to connect to a newly started notebook running on Google Cloud, but it's quick to reconnect afterwards.
The following instructions are a summary of this guide:
-
Create a new notebook as explained above.
-
Install the Google Cloud Code extension.
-
In the status bar at the bottom, switch to the
notebooks-314505
project. -
Select the Cloud Code extension, navigate to
Compute Engine
, and connect to your notebook VM using SSH. -
On the remote machine, enter
whoami
and note the result, which should look likejane_doe_population
. -
Run
sudo usermod -aG docker $USER
on the remote machine. -
Disconnect (
exit
) and back on your local machine, run:gcloud --project=notebooks-314505 compute config-ssh
-
This should have populated your
~/.ssh/config
file. Open this file and find theHost
associated with your notebook (based on its name). In that section, add the lineUser=jane_doe_population
, copying the correct value from the previous step above. -
Back in VS Code, run the Remote-SSH: Connect to Host... command and select your notebook VM. This should open a new window.
-
In the new window, run the Dev Containers: Attach to Running Container... command. Select the
/payload-container
entry. This will open up yet another window! -
In the new window, attached to the container, install the Python and Jupyter extensions in the container.
-
Run the Create: New Jupyter Notebook command.
-
In the upper right, select the
python310
kernel.