Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup CI/CD Pipeline for MSStats #143

Open
wants to merge 34 commits into
base: devel
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6874be2
Setup CI/CD Pipeline for MSStats
Nov 3, 2024
7eb8b32
Added SSH private key
Nov 3, 2024
5411a18
Change #3
Nov 3, 2024
d01ece8
Added changes #4
Nov 3, 2024
e1d5963
Added changes #5
Nov 3, 2024
caa633d
Changes #6
Nov 3, 2024
c146ae8
Added changes #7
Nov 3, 2024
508f32f
Added changes #8
Nov 3, 2024
b9d4568
Added changes #8
Nov 3, 2024
bc1bffc
Added changes for cmake issues
Nov 3, 2024
b9cef8e
Added changes #9
Nov 3, 2024
ae465d1
Added changes
Nov 3, 2024
811089f
Added changes
Nov 3, 2024
6725f98
Added changes #10
Nov 3, 2024
0accfc7
Added changes #11
Nov 3, 2024
e19933b
Added changes #12
Nov 3, 2024
5a86ee6
Changes for script run added - monitorinh
Nov 3, 2024
a61a9ad
Added changes with diff slurm config
Nov 4, 2024
751efa5
Changes for slurm spec
Nov 4, 2024
e84ac20
Changes for slurm spec #2
Nov 4, 2024
b25cb2e
Changes for slurm spec #3
Nov 4, 2024
c5ae18c
Changes for slurm spec #4
Nov 4, 2024
2a0539d
Added changes for triggering pipeline
Nov 5, 2024
58386d4
Added changes for slurm job #5
Nov 5, 2024
62235e2
Added changes for R version change
Nov 5, 2024
4f0737e
Added changes
Nov 5, 2024
4f00a86
Added changes for slurm job
Nov 5, 2024
8e57590
Added changes for R script
Nov 5, 2024
d1ddb0f
Added changes for slurm job #6
Nov 5, 2024
f7b835d
Added changes for job name
Nov 5, 2024
0f42743
Added changes for getting id of job to be monitored
Nov 5, 2024
b2f427e
Typo in file name
Nov 5, 2024
0e35bc9
Added changes
Nov 5, 2024
8237878
Added changes for R version explicit definition in slurm config
Nov 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: Run Simple R Script on HPC via Slurm

on:
push:
branches:
- feature/ci-cd-pipeline

jobs:
test-hpc:
runs-on: ubuntu-latest

steps:
- name: Checkout Repository
uses: actions/checkout@v3

- name: Set Up SSH Access
run: |
mkdir -p ~/.ssh
echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
Comment on lines +18 to +20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would follow ChatGPT's suggestion here.

ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts
Copy link
Contributor

@tonywu1999 tonywu1999 Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow chatGPT's suggestion here (the error handling one)


- name: Transfer Files to HPC
run: |
scp benchmark/benchmark.R benchmark/config.slurm raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R

- name: Submit Slurm Job and Capture Job ID
id: submit_job
run: |
ssh raina.ans@login-00.discovery.neu.edu "cd R && sbatch config.slurm" | tee slurm_job_id.txt
slurm_job_id=$(grep -oP '\d+' slurm_job_id.txt)
echo "Slurm Job ID is $slurm_job_id"
echo "slurm_job_id=$slurm_job_id" >> $GITHUB_ENV

- name: Monitor Slurm Job
run: |
ssh raina.ans@login-00.discovery.neu.edu "
while squeue -j ${{ env.slurm_job_id }} | grep -q ${{ env.slurm_job_id }}; do
echo 'Job Id : ${{ env.slurm_job_id }} is still running...'
sleep 10
done
echo 'Job has completed.'
"

- name: Fetch Output
run: |
scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_output.txt job_output.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of dependency on using your /home/raina.ans folder. Could we instead use a folder in the /work/VitekLab directory? I think there's already a benchmarking folder in there.

scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_error.txt job_error.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it makes sense to use a login info for someone like Olga. I'm not sure if she has an OOD account though. Or maybe use my login.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to use someone else's login. What would we need to adjust?


- name: Upload Output as Artifact
uses: actions/upload-artifact@v4
with:
name: benchmark-output
path: |
job_output.txt
job_error.txt
179 changes: 179 additions & 0 deletions benchmark/benchmark.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
library(MSstatsConvert)
library(MSstats)
library(ggplot2)
library(dplyr)
library(stringr)
library(parallel)


calculateResult <- function(summarized, label){

model = groupComparison("pairwise", summarized)
comparisonResult <- model$ComparisonResult

human_comparisonResult <- comparisonResult %>% filter(grepl("_HUMAN$", Protein))

ecoli_comparisonResult <- comparisonResult %>% filter(grepl("_ECOLI$", Protein))

yeast_comparisonResult <- comparisonResult %>% filter(grepl("_YEAST$", Protein))


human_median <- median(human_comparisonResult$log2FC, na.rm = TRUE)
ecoli_median <- median(ecoli_comparisonResult$log2FC, na.rm = TRUE)
yeast_median <- median(yeast_comparisonResult$log2FC, na.rm = TRUE)

cat("Expected Log Change Human:", human_median, "\n")
cat("Expected Log Change Ecoli:", ecoli_median, "\n")
cat("Expected Log Change Yeast:", yeast_median, "\n")

#calculate SD and mean


# Kept the code for Individual Boxplots

# boxplot(human_comparisonResult$log2FC,
# main = "Boxplot of log2FC for Human",
# ylab = "log2FC",
# col = "lightblue")
Comment on lines +34 to +37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not using the code, could you remove the comments?

#
#
boxplot(ecoli_comparisonResult$log2FC,
main = "Boxplot of log2FC for E. coli",
ylab = "log2FC",
col = "lightgreen")
#
# boxplot(yeast_comparisonResult$log2FC,
# main = "Boxplot of log2FC for Yeast",
# ylab = "log2FC",
# col = "lightpink")

combined_data <- list(
Human = human_comparisonResult$log2FC,
Ecoli = ecoli_comparisonResult$log2FC,
Yeast = yeast_comparisonResult$log2FC
)


unique_ecoli_proteins <- unique(ecoli_comparisonResult$Protein)
unique_yeast_proteins <- unique(yeast_comparisonResult$Protein)

all_proteins <- c(union(unique_ecoli_proteins, unique_yeast_proteins)) # find out the significant proteins in FragData

extracted_proteins <- sapply(all_proteins, function(x) {
split_string <- strsplit(x, "\\|")[[1]] # Split the string by '|'
if (length(split_string) >= 2) {
return(split_string[2]) # Return the second element
} else {
return(NA) # Return NA if there's no second element
}
})

extracted_proteins <- unname(unlist(extracted_proteins))

proteins <- c(extracted_proteins)


TP <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue < 0.05) %>% nrow()


FP <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue < 0.05) %>% nrow()


TN <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue >= 0.05) %>% nrow()


FN <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue >= 0.05) %>% nrow()

cat("True Positives (Yeast and EColi): ", TP, "\n")
cat("False Positives (Human Samples)", FP, "\n")
cat("True Negatives", TN, "\n")
cat("False Negatives", FN, "\n")

FPR <- FP / (FP + TN)

# Accuracy
accuracy <- (TP + TN) / (TP + TN + FP + FN)

# Recall
recall <- TP / (TP + FN)

results <- data.frame(
Label = label,
TP = TP,
FP = FP,
TN = TN,
FN = FN,
FPR = FPR,
Accuracy = accuracy,
Recall = recall
)

return(results)

}

start_time <- Sys.time()

# Use fread directly to read the CSV
fragpipe_raw = data.table::fread("..//data//FragPipeMsStatsBenchmarking.csv")

head(fragpipe_raw)

fragpipe_raw$Condition = unlist(lapply(fragpipe_raw$Run, function(x){
paste(str_split(x, "\\_")[[1]][4:5], collapse="_")
}))

fragpipe_raw$BioReplicate = unlist(lapply(fragpipe_raw$Run, function(x){
paste(str_split(x, "\\_")[[1]][4:7], collapse="_")
}))
Comment on lines +122 to +128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the Fragpipe files already have BioReplicate and Condition information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use the dataset stored in the /work/VitekLab/Data/MS/Benchmarking directory? It should have the name MSstats.csv


# Convert to MSstats format
msstats_format = MSstatsConvert::FragPipetoMSstatsFormat(fragpipe_raw, use_log_file = FALSE)


# Define the tasks with descriptive labels
data_process_tasks <- list(
list(
label = "Data process with Normalized Data",
result = function() dataProcess(msstats_format, featureSubset = "topN", n_top_feature = 20)
),
list(
label = "Data process with Normalization and MBImpute False",
result = function() dataProcess(msstats_format, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization",
result = function() dataProcess(msstats_format, normalization = "FALSE", n_top_feature = 20)
),
list(
label = "Data process without Normalization with MBImpute False",
result = function() dataProcess(msstats_format, normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
)
Comment on lines +145 to +151
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does n_top_feature parameter need to be initialized to anything here? i thought it's only needed if featureSubset = "topN"

)

# Start the timer
start_time <- Sys.time()

# Use mclapply to run the dataProcess tasks in parallel
num_cores <- detectCores() - 1 # Use one less than the total cores available

# Run data processing tasks in parallel and collect results with labels
summarized_results <- mclapply(data_process_tasks, function(task) {
list(label = task$label, summarized = task$result())
}, mc.cores = num_cores)

# Run calculateResult on each summarized result in parallel
results_list <- mclapply(summarized_results, function(res) {
calculateResult(res$summarized, res$label)
}, mc.cores = num_cores)

# Combine all results into a single data frame
final_results <- do.call(rbind, results_list)

# End the timer
end_time <- Sys.time()
total_time <- end_time - start_time

# Display the final results and execution time
print(final_results)
print(paste("Total Execution Time:", total_time))
29 changes: 29 additions & 0 deletions benchmark/config.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash
#SBATCH --job-name=msstats_benchmark_job
#SBATCH --output=job_output.txt
#SBATCH --error=job_error.txt
#SBATCH --time=01:00:00 # Set the maximum run time
#SBATCH --ntasks=1 # Number of tasks (one process)
#SBATCH --cpus-per-task=8 # Use 8 CPU cores for the task
#SBATCH --mem=128G # Request 256GB of memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says 256, but says 128 here.

#SBATCH --partition=short # Use the 'short' partition (or change as needed)

module load R-geospatial


module load gcc/11.1.0
module load cmake/3.23.2

export LC_ALL=C
export R_LIBS_USER=/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.2-geospatial


mkdir -p $R_LIBS_USER

module load R
Rscript -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager', lib = Sys.getenv('R_LIBS_USER'), repos = 'https://cloud.r-project.org'); \
BiocManager::install('MSstats', lib = Sys.getenv('R_LIBS_USER'), update = FALSE); \
BiocManager::install('MSstatsConvert', lib = Sys.getenv('R_LIBS_USER'), update = FALSE); \
install.packages(c('dplyr', 'stringr', 'ggplot2'), lib = Sys.getenv('R_LIBS_USER'), repos = 'https://cloud.r-project.org')"

Rscript benchmark.R
Loading