Setup CI/CD Pipeline for MSStats #143

anshuman-raina · 2024-11-03T04:30:24Z

User description

Motivation and Context

Part of Dataset Benchmarking Task

Changes

Please provide a detailed bullet point list of your changes.

Added slurm file for job set up
Added github actions
Added benchmark script

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules
I have run the devtools::document() command after my changes and committed the added files

PR Type

enhancement, tests

Description

Added a GitHub Actions workflow to automate running R scripts on HPC using Slurm, including SSH setup, file transfer, job submission, and output retrieval.
Developed a comprehensive R benchmarking script leveraging MSstats and MSstatsConvert for data processing and analysis, with parallel execution for improved performance.
Created a Slurm configuration script to define job parameters and execute the benchmarking script on HPC.

Changes walkthrough 📝

Relevant files

Enhancement

slurm_poc.yml `Add GitHub Actions workflow for HPC integration using Slurm` .github/workflows/slurm_poc.yml Added a GitHub Actions workflow to run R scripts on HPC via Slurm. Configured SSH access and file transfer to HPC. Implemented job submission and monitoring using Slurm. Set up artifact upload for job output.	+48/-0

Tests

benchmark.R `Add benchmarking script with data processing and analysis` benchmark/benchmark.R Introduced a benchmarking script using MSstats and MSstatsConvert. Implemented data processing and result calculation functions. Utilized parallel processing for efficiency. Calculated and displayed performance metrics.	+178/-0

Configuration changes

config.slurm `Create Slurm configuration script for benchmarking job` benchmark/config.slurm Created a Slurm configuration script for job submission. Defined job parameters such as name, output, and resources. Configured the execution of the R benchmarking script.	+13/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

github-actions · 2024-11-03T04:31:18Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Hardcoded Values The workflow contains hardcoded server addresses and user names which could lead to security risks and lack of flexibility. Consider using environment variables or GitHub secrets to manage these values dynamically. Error Handling The script lacks error handling mechanisms throughout the data processing and analysis phases. Adding try-catch blocks or condition checks can help in managing exceptions and ensuring the robustness of the script. Performance Concern The script uses multiple instances of 'paste' and 'str_split' within loops which can be optimized. Consider pre-processing these operations outside of loops or using vectorized operations to enhance performance.

github-actions · 2024-11-03T04:31:47Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Security	Improve security by setting SSH key permissions before writing the key Ensure that the SSH key permissions are set before writing the key to prevent potential race conditions where the key file could be accessed with incorrect permissions. .github/workflows/slurm_poc.yml [18-20] mkdir -p ~/.ssh +touch ~/.ssh/id_rsa +chmod 600 ~/.ssh/id_rsa echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa -chmod 600 ~/.ssh/id_rsa Suggestion importance[1-10]: 9 Why: This suggestion addresses a potential security issue by ensuring that the SSH key file has the correct permissions before writing the key, reducing the risk of unauthorized access. It is a critical improvement for maintaining security in the workflow.	9
Possible bug	Add error handling to SSH and SCP commands to ensure workflow reliability Add error handling for SSH and SCP commands to ensure that the workflow fails gracefully if these commands do not execute successfully. .github/workflows/slurm_poc.yml [21-29] -ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts -scp benchmark.R config.slurm raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R -ssh raina.ans@login-00.discovery.neu.edu "cd R && sbatch config.slurm" +ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts \|\| exit 1 +scp benchmark.R config.slurm raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R \|\| exit 1 +ssh raina.ans@login-00.discovery.neu.edu "cd R && sbatch config.slurm" \|\| exit 1 Suggestion importance[1-10]: 8 Why: Adding error handling to SSH and SCP commands enhances the robustness of the workflow by ensuring that failures in these commands are detected and handled gracefully, preventing potential issues in subsequent steps.	8
Enhancement	Update the function used for group comparison to align with the latest MSstats package Replace the deprecated `groupComparison` function with the updated `groupComparisonTMT` or `groupComparisonLabelFree` depending on the experimental design to ensure compatibility with the latest MSstats package. benchmark/benchmark.R [10] -model = groupComparison("pairwise", summarized) +model = groupComparisonLabelFree("pairwise", summarized) Suggestion importance[1-10]: 7 Why: Updating the function to align with the latest MSstats package ensures compatibility and future-proofing of the code. This enhancement is beneficial for maintaining the code's functionality with updated libraries.	7
Maintainability	Use descriptive variable names for clarity and maintainability Consider using more descriptive variable names than `TP`, `FP`, `TN`, `FN` for better code readability and maintainability. benchmark/benchmark.R [75-84] -TP <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "\|"), Protein) & adj.pvalue < 0.05) %>% nrow() -FP <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "\|"), Protein) & adj.pvalue < 0.05) %>% nrow() -TN <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "\|"), Protein) & adj.pvalue >= 0.05) %>% nrow() -FN <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "\|"), Protein) & adj.pvalue >= 0.05) %>% nrow() +true_positives <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "\|"), Protein) & adj.pvalue < 0.05) %>% nrow() +false_positives <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "\|"), Protein) & adj.pvalue < 0.05) %>% nrow() +true_negatives <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "\|"), Protein) & adj.pvalue >= 0.05) %>% nrow() +false_negatives <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "\|"), Protein) & adj.pvalue >= 0.05) %>% nrow() Suggestion importance[1-10]: 5 Why: While using more descriptive variable names can improve code readability and maintainability, the current names (TP, FP, TN, FN) are standard in statistical contexts. The suggestion is valid but offers only a minor improvement.	5

tonywu1999 · 2024-11-06T19:47:58Z

.github/workflows/benchmark.yml

+        mkdir -p ~/.ssh
+        echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
+        chmod 600 ~/.ssh/id_rsa


I would follow ChatGPT's suggestion here.

tonywu1999 · 2024-11-06T19:53:28Z

benchmark/config.slurm

+#SBATCH --time=01:00:00         # Set the maximum run time
+#SBATCH --ntasks=1              # Number of tasks (one process)
+#SBATCH --cpus-per-task=8      # Use 8 CPU cores for the task
+#SBATCH --mem=128G              # Request 256GB of memory


Comment says 256, but says 128 here.

tonywu1999 · 2024-11-06T19:54:43Z

.github/workflows/benchmark.yml

+        mkdir -p ~/.ssh
+        echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
+        chmod 600 ~/.ssh/id_rsa
+        ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts


I think we should follow chatGPT's suggestion here (the error handling one)

tonywu1999 · 2024-11-06T19:55:58Z

.github/workflows/benchmark.yml

+
+    - name: Fetch Output
+      run: |
+        scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_output.txt job_output.txt


There seems to be a lot of dependency on using your /home/raina.ans folder. Could we instead use a folder in the /work/VitekLab directory? I think there's already a benchmarking folder in there.

tonywu1999 · 2024-11-06T19:56:22Z

benchmark/benchmark.R

+  # boxplot(human_comparisonResult$log2FC,
+  #         main = "Boxplot of log2FC for Human",
+  #         ylab = "log2FC",
+  #         col = "lightblue")


If you're not using the code, could you remove the comments?

tonywu1999 · 2024-11-06T19:58:13Z

benchmark/benchmark.R

+fragpipe_raw$Condition = unlist(lapply(fragpipe_raw$Run, function(x){
+  paste(str_split(x, "\\_")[[1]][4:5], collapse="_")
+}))
+
+fragpipe_raw$BioReplicate = unlist(lapply(fragpipe_raw$Run, function(x){
+  paste(str_split(x, "\\_")[[1]][4:7], collapse="_")
+}))


I thought the Fragpipe files already have BioReplicate and Condition information

Could you use the dataset stored in the /work/VitekLab/Data/MS/Benchmarking directory? It should have the name MSstats.csv

tonywu1999 · 2024-11-06T20:00:21Z

benchmark/benchmark.R

+    label = "Data process without Normalization",
+    result = function() dataProcess(msstats_format, normalization = "FALSE", n_top_feature = 20)
+  ),
+  list(
+    label = "Data process without Normalization with MBImpute False",
+    result = function() dataProcess(msstats_format, normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
+  )


Does n_top_feature parameter need to be initialized to anything here? i thought it's only needed if featureSubset = "topN"

tonywu1999 · 2024-11-06T20:01:48Z

.github/workflows/benchmark.yml

+    - name: Fetch Output
+      run: |
+        scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_output.txt job_output.txt
+        scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_error.txt job_error.txt


I'm wondering if it makes sense to use a login info for someone like Olga. I'm not sure if she has an OOD account though. Or maybe use my login.

How difficult would it be to use someone else's login. What would we need to adjust?

Setup CI/CD Pipeline for MSStats

6874be2

github-actions bot added the Review effort [1-5]: 4 label Nov 3, 2024

Anshuman Raina added 26 commits November 3, 2024 01:39

Added SSH private key

7eb8b32

Change #3

5411a18

Added changes #4

d01ece8

Added changes #5

e1d5963

Changes #6

caa633d

Added changes #7

c146ae8

Added changes #8

508f32f

Added changes #8

b9d4568

Added changes for cmake issues

bc1bffc

Added changes #9

b9cef8e

Added changes

ae465d1

Added changes

811089f

Added changes #10

6725f98

Added changes #11

0accfc7

Added changes #12

e19933b

Changes for script run added - monitorinh

5a86ee6

Added changes with diff slurm config

a61a9ad

Changes for slurm spec

751efa5

Changes for slurm spec #2

e84ac20

Changes for slurm spec #3

b25cb2e

Changes for slurm spec #4

c5ae18c

Added changes for triggering pipeline

2a0539d

Added changes for slurm job #5

58386d4

Added changes for R version change

62235e2

Added changes

4f0737e

Added changes for slurm job

4f00a86

Anshuman Raina added 7 commits November 4, 2024 21:33

Added changes for R script

8e57590

Added changes for slurm job #6

d1ddb0f

Added changes for job name

f7b835d

Added changes for getting id of job to be monitored

0f42743

Typo in file name

b2f427e

Added changes

0e35bc9

Added changes for R version explicit definition in slurm config

8237878

tonywu1999 self-requested a review November 5, 2024 20:54

tonywu1999 reviewed Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup CI/CD Pipeline for MSStats #143

Setup CI/CD Pipeline for MSStats #143

anshuman-raina commented Nov 3, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 3, 2024

github-actions bot commented Nov 3, 2024

tonywu1999 Nov 6, 2024

tonywu1999 Nov 6, 2024

tonywu1999 Nov 6, 2024 •

edited

Loading

tonywu1999 Nov 6, 2024

tonywu1999 Nov 6, 2024

tonywu1999 Nov 6, 2024

tonywu1999 Nov 6, 2024

tonywu1999 Nov 6, 2024

tonywu1999 Nov 6, 2024

tonywu1999 Nov 6, 2024

Setup CI/CD Pipeline for MSStats #143

Are you sure you want to change the base?

Setup CI/CD Pipeline for MSStats #143

Conversation

anshuman-raina commented Nov 3, 2024 • edited by github-actions bot Loading

User description

Motivation and Context

Changes

Testing

Checklist Before Requesting a Review

PR Type

Description

Changes walkthrough 📝

github-actions bot commented Nov 3, 2024

PR Reviewer Guide 🔍

github-actions bot commented Nov 3, 2024

PR Code Suggestions ✨

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonywu1999 Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anshuman-raina commented Nov 3, 2024 •

edited by github-actions bot

Loading

tonywu1999 Nov 6, 2024 •

edited

Loading