-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-Data-backup.Rmd
415 lines (272 loc) · 21.8 KB
/
04-Data-backup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
# Data backup {#data_backup}
The purpose of this chapter is to guide you through the available options for backup of data stored on the Ryten server. For backup of data on your personal/work computers, we recommend making use of cloud solutions, such as OneDrive for Business, available to [UCL staff and students](https://www.ucl.ac.uk/isd/services/communicate-collaborate/remote-working/file-storage).
> **Data backup is the responsibility of each Ryten server user**. We do not currently have any automated systems in place, so it is really important that you take the time to ensure you have all the proper backups in place.
## Why backup your data stored on the Ryten server?
There are several good reasons to backup your data stored on the Ryten server, including:
- As with all other servers, the Ryten server can also be affected by server failure (this has not happened to us, \*touch wood\*, but it certainly can).
- There may be occasions when you cannot access the server (e.g. VPN does not permit access).
- Other users may accidentally delete or modify one of your files (yes, this has happened in the past).
In short, a lot of things can happen, which might mean you need to access your files via other means. Thus, it is important to have a backup of the raw files you use, any processed intermediate files, and the scripts you use both for data processing and analysis.
## What options are available to you as Ryten server user?
Luckily, there are plenty of options available to you, including:
- **Amazon Web Services (AWS)**
- The Ryten lab has an account on AWS where it is possible to store data.
- We primarily use this for unprocessed data (e.g. RNA-sequencing FASTQs), although it can also be useful to store processed files (e.g. BAMs) that collaborators may ask to access.
- [**GitHub**](https://github.com/)
- Git is a version control system (think "Track Changes" in Word), while GitHub is a cloud-based hosting service for Git-based projects (think OneDrive).
- We primarily use this to store functions, scripts and workflows for our analyses.
- Importantly, as a student/researcher you can access a free/discounted version of GitHub Pro, which allows users to have private repositories. Private repositories are useful for ongoing projects that are not ready for publication just yet. Upon publication, these private repositories can be made public.
- If you're unfamiliar with GitHub or `git`, please refer to this handy [guide](https://happygitwithr.com/), also mentioned in Section \@ref(contribute-how).
- [**UCL RDS**](https://www.ucl.ac.uk/isd/services/research-it/research-data-storage-service)
- This is a UCL-based service, while allows UCL staff to store research data while projects are ongoing.
- We primarily use this to store processed intermediate files.
## What are good practices for data backup?
- It is a good idea to ensure that **really important** files that cannot be regenerated (e.g. raw sequencing files) or alternatively would take days/weeks to re-generate are stored in **two** backup locations.
- Make sure the scripts you need to re-generate analyses/files are on GitHub.
- Ensure that you have an easily accessible record of what you are backing up and where.
- Most importantly, make sure you regularly backup to these locations. E.g. ongoing projects should be backed up to UCL RDS at least once a month, while scripts/workflows should preferably be committed to GitHub daily/weekly.
## AWS
### Login on AWS using UCL SSO
Our AWS account is managed under the ICH organisation. To access to our AWS S3 buckets, you first need to login on [UCL Cloud Portal](https://ucl-cloud.awsapps.com/start#/) using your UCL credentials. Once the login has been successful, you will see an interface similar to the one below.
```{r aws-sso-login, fig.cap = "Login on AWS using UCL SSO", echo = FALSE}
knitr::include_graphics('figures/04-AWS/aws_sso_login.PNG')
```
Click, on *"AWS Account --> Rytenlab Storage --> S3 bucket access --> Management console"*, and this should give you access to the list of buckets we have on the AWS S3 service.
In case your don't have access, you will need to get in contact with **Sonia Garcia Ruiz** to arrange access to our AWS account.
### Accessing AWS using the AWS Command Line Interface
The AWS Command Line Interface (or AWS CLI) is an open source tool that enables you to interact with the AWS services from your command-line Shell. AWS CLI tool is already installed in our server, and it is very easy to use.
Below are described the most commonly used interactions with AWS.
#### Configure the AWS CLI service
<!-- To configure this service, you will need your AWS Key and Secret numbers that are unique to you. To obtain that, you will need to login on [UCL Cloud Portal](https://ucl-cloud.awsapps.com/start#/) using your UCL credentials. Once the login has been successful, click, on *"AWS Account > Rytenlab Storage > S3 bucket access > Command line or programmatic access"*. Scroll down until the **Option 3: Use individual values in your AWS service client** section and copy your AWS Access Key Id and AWS Secret access key. -->
Connect to our Rytenlab server and, in the command line, please type *"aws configure sso"* on the console. This command is interactive, so the service will prompt 2 different pieces of information. [More detailed information here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sso.html). Please, introduce the information as it is shown below:
```{r aws-sso-configure, fig.cap = "Configure AWS-UCL SSO login", echo = FALSE}
knitr::include_graphics('figures/04-AWS/aws_sso_configure.PNG')
```
Copy the URL and the code returned by the *"aws configure sso"* command and introduce it following the instructions.
Once the login request has been approved, the command line will ask you to choose the role that you would like to use:
```{r aws-sso-roles, fig.cap = "Choose a role.", echo = FALSE}
knitr::include_graphics('figures/04-AWS/aws_sso_roles.PNG')
```
Once you have chosen your role, the command line will ask for 3 pieces of information. In all cases, AWS will provide default values in square brackets. These can be selected by simply pressing `<ENTER>`.
1. `CLI default client Region [None]`. Please use `eu-west-2`.
2. `CLI default output format [None]`. We recommend `json`.
3. `CLI profile name [ROLENAME-ACCOUNT_ID]`.
- AWS will suggest a profile name in square brackets, which is usually a combination of the role name and account ID.
- If you do not enter an alternative, and simply press `<ENTER>` your profile name will now be what was suggested in square brackets.
- The point of the profile name is to allow you to specify your own configurations i.e. so you can reference this profile from among all those defined on the server.
- If you specify `default` as the profile name, this profile becomes the one used whenever you run an AWS CLI command and you do not specify a profile name.
If everything has gone as expected, you should now be able to list the set of buckets on AWS. **Remember: to use any AWS CLI command, your profile name must be specified, unless you chose to use the `default` profile.**
```console
$ aws s3 ls --profile your_profile_name
```
#### AWS SSO session duration
Any AWS session will be automatically signed out after 12 hours (this is the maximum session duration; for more information see [here](https://docs.aws.amazon.com/singlesignon/latest/userguide/howtosessionduration.html)). Please bear this in mind when downloading/uploading any files and checking data integrity. If you think your process may take longer than 12 hours, consider breaking up the process by using subfolders.
#### Upload a single file to AWS
Let's suppose we have a bucket on AWS called `my-bucket`. Let's also suppose you have a file called `myfile.txt` stored in your local that you would like to upload to AWS. To upload the file `myfile.txt` to the bucket `my-bucket`:
```console
$ aws s3 cp --profile your_profile_name myfile.txt s3://my-bucket/
```
Unless you have chosen to use the `default` profile, your profile name must be specified.
#### Download a single file from AWS
To download the file `myfile.txt` from the `s3://my-bucket/` AWS bucket into your local folder:
```console
$ aws s3 cp --profile your_profile_name s3://my-bucket/myfile.txt ./my_local_folder
```
Unless you have chosen to use the `default` profile, your profile name must be specified.
#### Upload multiple files to AWS
To upload multiple files, we can use the sync command. The **sync** command syncs objects under a specified local folder to files in a AWS bucket by uploading them to AWS.
```console
$ aws s3 sync --profile your_profile_name my_local_folder/ s3://my-bucket/
```
Unless you have chosen to use the `default` profile, your profile name must be specified.
#### Download multiple files from AWS
To download multiple files from an AWS bucket to your local folder, we can also use the **sync** command by changing the order of the parameters.
***Please, be aware the costs associated with downloading files correspond to $0.090 per GB - first 10 TB / month data transfer out beyond the global free tier.***
```console
$ aws s3 sync --profile your_profile_name s3://my-bucket/my_remote_folder/ ./my_local_folder
```
Unless you have chosen to use the `default` profile, your profile name must be specified.
### Checking AWS file integrity
Considering the example used in the previous section, let's check the integrity of the local folder `./my_local_folder` matches with the integrity of the remote AWS folder `s3://my-bucket/my_local_folder/`.
**It is important that when you run the integrity check that you are using the `default` profile. If not, AWS will expect your profile name to be specified with `--profile` (which the `aws-s3-integrity-check` repo is not currently configured for).
1. First, clone the `aws-s3-integrity-check` repo.
```console
$ git clone https://github.com/SoniaRuiz/aws-s3-integrity-check.git
```
2. Clone the `s3md5` repo.
```console
$ git clone https://github.com/antespi/s3md5.git
```
3. Move the `s3md5` folder within the `aws-s3-integrity-check` folder:
```console
$ mv ./s3md5 ./aws-s3-integrity-check
```
4. Next, grant execution access permission to he `s3md5` script file.
```console
$ chmod 755 ./aws-s3-integrity-check/s3md5/s3md5
```
5. The `aws-s3-integrity-check` folder should look similar to the following:
```console
total 16
-rw-r--r-- 1 your_user your_group 3573 date README.md
-rwxr-xr-x 1 your_user your_group 3301 date aws_check_integrity.sh
drwxr-xr-x 2 your_user your_group 4096 date s3md5
```
6. Execute the script `aws_check_integrity.sh` following the this structure: **`aws_check_integrity.sh` `local_path` `bucket_name` `bucket_folder`**
```console
$ aws_check_integrity.sh ./my_local_folder my-bucket my_local_folder/
```
### Creating a new AWS bucket
To create a new AWS bucket, we recommend using the following configuration:
1. Region: EU London
2. Block all public access: enabled
3. Bucket Versioning: enable
4. Tags:
1. Key = "data-owner" / Value = "your name"
2. Key = "data-origin" / Value = "the origin of the data in one word (i.e. the name of a project, the name of a server)"
5. Default encryption: enable - Amazon S3 key (SSE-S3)
6. Advanced settings
1. Object Lock: enable/disable, depending on your type of data (more info [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lock.html))
### Further info
Below are some extra useful links, also generated by Sonia. These links are hosted on the Ryten lab [Sharepoint](https://liveuclac.sharepoint.com/sites/rytenlab/Shared%20Documents/), which you can access provided you are a member.
- [Accessing AWS using the AWS Command Line Interface](https://liveuclac.sharepoint.com/:w:/r/sites/rytenlab/Shared%20Documents/Server%20and%20cloud%20computing/Amazon/AWS_CONFIGURATION.docx?d=w6195a43440b04e8cbff82e369e73befa&csf=1&web=1&e=vrFK5Q)
- [Checking AWS file integrity](https://liveuclac.sharepoint.com/:w:/r/sites/rytenlab/Shared%20Documents/Server%20and%20cloud%20computing/Amazon/AWS_ETag_MD5.docx?d=we739c02ec69947b6b4044e60204cc90b&csf=1&web=1&e=k2StC1)
- [Information about existing AWS buckets](https://liveuclac.sharepoint.com/:w:/r/sites/rytenlab/Shared%20Documents/Server%20and%20cloud%20computing/Amazon/AWS%20buckets%20info.docx?d=w3f931f44bea54aef98c08eb359fef169&csf=1&web=1&e=w8GNya)
## Github
A good explanation of how to get started with `git` using RStudio can be in the book "Happy Git with R" linked [here](#beginner-r).
TO BE ADDED (MAYBE A SEPARATE PAGE WITH GITHUB TUTORIAL?). Link worth adding: <https://www.makeareadme.com/>
## UCL RDS
### Registering a project
- Prior to using UCL RDS, you need to ensure that you have a project space allocated to you. This can be done [here](https://www.ucl.ac.uk/isd/services/research-it/research-data-storage-service).
- If you are not UCL staff, you will not be able to register your own project. If that is the case, reach out to one of us that is, and we can register a project on your behalf and add you as a project member.
### Accessing your project space
- When the project was generated, [researchdata-support\@ucl.ac.uk](mailto:researchdata-support@ucl.ac.uk){.email} will send you an e-mail with important details about the location of your storage area (an example is: `/mnt/gpfs/live/rd01__/path_to_storage_area/`)
- From the Ryten server, this project space can be accessed using the following command, and substituting your UCL username into the command below, instead of `ucl_username`.
```{bash ssh-ucl-rds, eval = FALSE}
ssh ucl_username@live.rd.ucl.ac.uk
```
- You will be prompted for a password, and upon entering this you will be re-directed to your home space on UCL RDS. You can then move to your project area using the following command, and substituting in the path to your storage area.
```{bash cd-storage-area, eval = FALSE}
cd /mnt/gpfs/live/path_to_storage_area/
```
### Using `rsync` to backup to UCL RDS
- `rsync` (remote sync), as the name suggests, permits file transfer across systems (e.g. from the Ryten server to UCL RDS).
- While the initial backup will take a while, `rsync` is a super handy tool for regular backups, as it will only transfer files that have been changed after the initial backup.
- Furthermore, with the correct flags enabled, you can ensure that any files that have been deleted on the local server (Ryten server) are not deleted on the remote server (UCL RDS) when running `rsync` again.
#### Simple use case
Google is a great resource, so feel free to read up a bit more on `rsync`. Also, you can always run the command `rsync --help` to see what options there are.
Below is a simple use case, with the various flags explained with the \#.
```{bash rsync-simple, eval = FALSE}
# -a preserve access times etc
# -r recursive
# -v verbose
# -h human readable figures
# -e select the protocol - ssh used here
# --progress display progress bar
# --relative preserves directory structure in the local backup location from /./ onwards
# Example
rsync -arvhe ssh --progress --relative /path_to_files/ ucl_username@live.rd.ucl.ac.uk:/mnt/gpfs/live/path_to_storage_area/
```
- Notice that the "source" location (`/path_to_files/`) is written first followed by the "target" location (`ucl_username@live.rd.ucl.ac.uk:/mnt/gpfs/live/path_to_storage_area/`).
#### Backing up multiple files across multiple UCL RDS spaces
We all have multiple ongoing projects, which cannot all be stored in the same UCL RDS project space (which are typically restricted to 5 Tb). Thus, there will come a time when you have to backup to multiple locations. With this in mind, it can be helpful to write a script to automate (or at least partially automate) some of this. Below is a bare-bones guide of how you might do this.
As you have probably noticed if you have been using `rsync` to transfer files from the Ryten server to UCL RDS, every time you use this command you are prompted for a password. Thus, partial automation requires that you:
1. Set up an SSH key pair, which allows automated user authentication.
2. Write a script, which you can add to everytime you have another location to backup from/to.
##### Step 1: Set up SSH keys for authentication {#ssh-agent}
I used this [guide](https://upcloud.com/community/tutorials/use-ssh-keys-authentication/) to help me set this up.
**Preparing the remote server (e.g. UCL RDS)**
> Do this in your home directory on UCL RDS e.g. `/mnt/gpfs/home/ucl_username/`
```{bash generate-ssh-folder, eval = F}
# Create hidden folder in home directory
mkdir -p ~/.ssh
# Restrict permissions
chmod 700 ~/.ssh
```
**Generating the key pair**
> This is done in user's home directory on the Ryten server where files will be transferred from e.g. `/home/rreynolds/`.
1. To generate the key pair, enter the command below. Don't enter a passphrase when prompted, if you would like to setup fully password-less login.
```{bash generate-key-pair, eval = F}
# Command to generate key pair
ssh-keygen -t rsa
# Output will look like this
Generating public/private rsa key pair.
Enter file in which to save the key (/home/rreynolds/.ssh/id_rsa): /home/rreynolds/.ssh/rds_rsa # press enter, or name accordingly as was done here
Created directory '/home/rreynolds/.ssh'.
Enter passphrase (empty for no passphrase): # press enter
Enter same passphrase again: # press enter
Your identification has been saved in /home/rreynolds/.ssh/rds_rsa.
Your public key has been saved in /home/rreynolds/.ssh/rds_rsa.pub.
The key fingerprint is:
SHA256:EXAMPLE rreynolds@ion-dmn-hpc3
The key's randomart image is:
+---[RSA 2048]----+
| .. |
| EXAMPLE|
| EXAMPLE|
| EXAMPLE |
| EXAMPLE |
| EXAMPLE |
| EXAMPLE |
| EXAMPLE |
| EXAMPLE |
+----[SHA256]-----+
```
2. If no different file name entered, the command will generate two files: `~/.ssh/id_rsa` (private key) and `~/.ssh/id_rsa.pub` (public key).
3. Copy the public half of the key pair to your cloud server using the following command. Replace the user and server with your username and the server address you wish to use the key authentication on. **This assumes you saved the key pair using the default file name and location. If not, just replace the key path `~/.ssh/id_rsa.pub` your own key name (as I did below)**.
```{bash cp-ssh-key, eval = F}
# Command
ssh-copy-id -i ~/.ssh/rds_rsa.pub ucl_username@live.rd.ucl.ac.uk
```
4. Set up SSH Agent to store the keys to avoid having to re-enter passphrase at every login. **If you close the Ryten server terminal, you will have to repeat the command below when you open up a new terminal.**
```{bash ssh-agent, eval = F}
ssh-agent $BASH
ssh-add ~/.ssh/rds_rsa
```
##### Step 2: Write a script for *your* data backup
> The author of this script, Regina, is hoping to make this a more generalised script at some point, but for now, feel free to copy and modify the skeleton below.
Key arguments to modify include:
- `rserver`: change to your UCL username.
- `lbackuploc_*`: change to your local backup locations. *Note: because I am using the `--relative` flag with `rsync`, I add a `/./` to indicate from what point I want `rsync` to copy the directory structure on the Ryten server.*
- `rbackuploc_*`: change to your remote backup locations.
- Main body of backup text to suit your purposes.
```{bash backup-script, eval = F}
#!/bin/bash
# Remote RDS server
rserver=ucl_username@live.rd.ucl.ac.uk
# Local backup location
lbackuploc_data=/data/
lbackuploc_user=/home/./ryten_username/
# Remote RDS backup locations
rbackuploc_GWAS=/mnt/gpfs/live/rd01__/path_to_GWAS_backup/
rbackuploc_PDseq=/mnt/gpfs/live/path_to_PD_backup/
#---Backups to $rbackuploc_GWAS---------------
echo "Starting backup to $rserver:$rbackuploc_GWAS"
rsync -arvhe ssh --progress --relative $lbackuploc_user/projects/LDSC_Regression/ $rserver:$rbackuploc_GWAS
rsync -arvhe ssh --progress --relative $lbackuploc_user/data/MAGMA/ $rserver:$rbackuploc_GWAS
rsync -arvhe ssh --progress --relative $lbackuploc_user/misc_projects $rserver:$rbackuploc_GWAS
rsync -arvhe ssh --progress --relative $lbackuploc_data/LDScore/ $rserver:$rbackuploc_GWAS
rsync -arvhe ssh --progress --relative $lbackuploc_data/Alasoo2018_MacrophageQTLs/ $rserver:$rbackuploc_GWAS
rsync -arvhe ssh --progress --relative $lbackuploc_data/Fairfax2014/ $rserver:$rbackuploc_GWAS
rsync -arvhe ssh --progress --relative $lbackuploc_data/eQTLGen/ $rserver:$rbackuploc_GWAS
rsync -arvhe ssh --progress --relative $lbackuploc_data/GTEx_eQTLs/ $rserver:$rbackuploc_GWAS
rsync -arvhe ssh --progress --relative $lbackuploc_data/psychencode/ $rserver:$rbackuploc_GWAS
#---Backups to $rbackuploc_PDseq--------------
echo "Starting backup to $rserver:$rbackuploc_PDseq"
rsync -arvhe ssh --progress --relative $lbackuploc_data/RNAseq_PD/ $rserver:$rbackuploc_PDseq
rsync -arvhe ssh --progress --relative $lbackuploc_user/projects/Aim2_PDsequencing_wd/ $rserver:$rbackuploc_PDseq
rsync -arvhe ssh --progress --relative $lbackuploc_data/recount/SRP058181/ $rserver:$rbackuploc_PDseq
echo "Backup finished!"
```
##### Step 3: Running your script
After creating script, remember to set an executable permission to the script. Substitute `RDS_backup.sh` for the name of your script.
```{bash chmod-script, eval = F}
# Command
chmod +x RDS_backup.sh
```
The script can then be run with the follow command. It is a good idea to keep a log of your backups to see if any errors have been returned during the process. For this reason, the output of the backup script is re-directed into a new log file named `20201102_RDSbackup.log`, with the date of transfer.
```{bash run-script, eval = F}
# Run script and save log file
bash ./path_to_script/RDS_backup.sh > ./path_to_log_directory/20201102_RDSbackup.log
```
Finally, remember that prior to running your backup script you will have to set up SSH Agent to store the keys (as mentioned in Section \@ref(ssh-agent), step 4), if you have not already done so.