Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup CI/CD Pipeline for MSStats #143

Open
wants to merge 34 commits into
base: devel
Choose a base branch
from

Conversation

anshuman-raina
Copy link
Contributor

@anshuman-raina anshuman-raina commented Nov 3, 2024

User description

Motivation and Context

Part of Dataset Benchmarking Task

Changes

Please provide a detailed bullet point list of your changes.

  • Added slurm file for job set up
  • Added github actions
  • Added benchmark script

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • I have run the devtools::document() command after my changes and committed the added files

PR Type

enhancement, tests


Description

  • Added a GitHub Actions workflow to automate running R scripts on HPC using Slurm, including SSH setup, file transfer, job submission, and output retrieval.
  • Developed a comprehensive R benchmarking script leveraging MSstats and MSstatsConvert for data processing and analysis, with parallel execution for improved performance.
  • Created a Slurm configuration script to define job parameters and execute the benchmarking script on HPC.

Changes walkthrough 📝

Relevant files
Enhancement
slurm_poc.yml
Add GitHub Actions workflow for HPC integration using Slurm

.github/workflows/slurm_poc.yml

  • Added a GitHub Actions workflow to run R scripts on HPC via Slurm.
  • Configured SSH access and file transfer to HPC.
  • Implemented job submission and monitoring using Slurm.
  • Set up artifact upload for job output.
  • +48/-0   
    Tests
    benchmark.R
    Add benchmarking script with data processing and analysis

    benchmark/benchmark.R

  • Introduced a benchmarking script using MSstats and MSstatsConvert.
  • Implemented data processing and result calculation functions.
  • Utilized parallel processing for efficiency.
  • Calculated and displayed performance metrics.
  • +178/-0 
    Configuration changes
    config.slurm
    Create Slurm configuration script for benchmarking job     

    benchmark/config.slurm

  • Created a Slurm configuration script for job submission.
  • Defined job parameters such as name, output, and resources.
  • Configured the execution of the R benchmarking script.
  • +13/-0   

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    github-actions bot commented Nov 3, 2024

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Hardcoded Values
    The workflow contains hardcoded server addresses and user names which could lead to security risks and lack of flexibility. Consider using environment variables or GitHub secrets to manage these values dynamically.

    Error Handling
    The script lacks error handling mechanisms throughout the data processing and analysis phases. Adding try-catch blocks or condition checks can help in managing exceptions and ensuring the robustness of the script.

    Performance Concern
    The script uses multiple instances of 'paste' and 'str_split' within loops which can be optimized. Consider pre-processing these operations outside of loops or using vectorized operations to enhance performance.

    Copy link

    github-actions bot commented Nov 3, 2024

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Security
    Improve security by setting SSH key permissions before writing the key

    Ensure that the SSH key permissions are set before writing the key to prevent
    potential race conditions where the key file could be accessed with incorrect
    permissions.

    .github/workflows/slurm_poc.yml [18-20]

     mkdir -p ~/.ssh
    +touch ~/.ssh/id_rsa
    +chmod 600 ~/.ssh/id_rsa
     echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
    -chmod 600 ~/.ssh/id_rsa
    Suggestion importance[1-10]: 9

    Why: This suggestion addresses a potential security issue by ensuring that the SSH key file has the correct permissions before writing the key, reducing the risk of unauthorized access. It is a critical improvement for maintaining security in the workflow.

    9
    Possible bug
    Add error handling to SSH and SCP commands to ensure workflow reliability

    Add error handling for SSH and SCP commands to ensure that the workflow fails
    gracefully if these commands do not execute successfully.

    .github/workflows/slurm_poc.yml [21-29]

    -ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts
    -scp benchmark.R config.slurm raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R
    -ssh raina.ans@login-00.discovery.neu.edu "cd R && sbatch config.slurm"
    +ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts || exit 1
    +scp benchmark.R config.slurm raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R || exit 1
    +ssh raina.ans@login-00.discovery.neu.edu "cd R && sbatch config.slurm" || exit 1
    Suggestion importance[1-10]: 8

    Why: Adding error handling to SSH and SCP commands enhances the robustness of the workflow by ensuring that failures in these commands are detected and handled gracefully, preventing potential issues in subsequent steps.

    8
    Enhancement
    Update the function used for group comparison to align with the latest MSstats package

    Replace the deprecated groupComparison function with the updated groupComparisonTMT
    or groupComparisonLabelFree depending on the experimental design to ensure
    compatibility with the latest MSstats package.

    benchmark/benchmark.R [10]

    -model = groupComparison("pairwise", summarized)
    +model = groupComparisonLabelFree("pairwise", summarized)
    Suggestion importance[1-10]: 7

    Why: Updating the function to align with the latest MSstats package ensures compatibility and future-proofing of the code. This enhancement is beneficial for maintaining the code's functionality with updated libraries.

    7
    Maintainability
    Use descriptive variable names for clarity and maintainability

    Consider using more descriptive variable names than TP, FP, TN, FN for better code
    readability and maintainability.

    benchmark/benchmark.R [75-84]

    -TP <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue < 0.05) %>% nrow()
    -FP <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue < 0.05) %>% nrow()
    -TN <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue >= 0.05) %>% nrow()
    -FN <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue >= 0.05) %>% nrow()
    +true_positives <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue < 0.05) %>% nrow()
    +false_positives <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue < 0.05) %>% nrow()
    +true_negatives <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue >= 0.05) %>% nrow()
    +false_negatives <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue >= 0.05) %>% nrow()
    Suggestion importance[1-10]: 5

    Why: While using more descriptive variable names can improve code readability and maintainability, the current names (TP, FP, TN, FN) are standard in statistical contexts. The suggestion is valid but offers only a minor improvement.

    5

    @tonywu1999 tonywu1999 self-requested a review November 5, 2024 20:54
    Comment on lines +18 to +20
    mkdir -p ~/.ssh
    echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
    chmod 600 ~/.ssh/id_rsa
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I would follow ChatGPT's suggestion here.

    #SBATCH --time=01:00:00 # Set the maximum run time
    #SBATCH --ntasks=1 # Number of tasks (one process)
    #SBATCH --cpus-per-task=8 # Use 8 CPU cores for the task
    #SBATCH --mem=128G # Request 256GB of memory
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Comment says 256, but says 128 here.

    mkdir -p ~/.ssh
    echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
    chmod 600 ~/.ssh/id_rsa
    ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts
    Copy link
    Contributor

    @tonywu1999 tonywu1999 Nov 6, 2024

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I think we should follow chatGPT's suggestion here (the error handling one)


    - name: Fetch Output
    run: |
    scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_output.txt job_output.txt
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    There seems to be a lot of dependency on using your /home/raina.ans folder. Could we instead use a folder in the /work/VitekLab directory? I think there's already a benchmarking folder in there.

    Comment on lines +34 to +37
    # boxplot(human_comparisonResult$log2FC,
    # main = "Boxplot of log2FC for Human",
    # ylab = "log2FC",
    # col = "lightblue")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    If you're not using the code, could you remove the comments?

    Comment on lines +122 to +128
    fragpipe_raw$Condition = unlist(lapply(fragpipe_raw$Run, function(x){
    paste(str_split(x, "\\_")[[1]][4:5], collapse="_")
    }))

    fragpipe_raw$BioReplicate = unlist(lapply(fragpipe_raw$Run, function(x){
    paste(str_split(x, "\\_")[[1]][4:7], collapse="_")
    }))
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I thought the Fragpipe files already have BioReplicate and Condition information

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Could you use the dataset stored in the /work/VitekLab/Data/MS/Benchmarking directory? It should have the name MSstats.csv

    Comment on lines +145 to +151
    label = "Data process without Normalization",
    result = function() dataProcess(msstats_format, normalization = "FALSE", n_top_feature = 20)
    ),
    list(
    label = "Data process without Normalization with MBImpute False",
    result = function() dataProcess(msstats_format, normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Does n_top_feature parameter need to be initialized to anything here? i thought it's only needed if featureSubset = "topN"

    - name: Fetch Output
    run: |
    scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_output.txt job_output.txt
    scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_error.txt job_error.txt
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I'm wondering if it makes sense to use a login info for someone like Olga. I'm not sure if she has an OOD account though. Or maybe use my login.

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    How difficult would it be to use someone else's login. What would we need to adjust?

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants