Better files download and replication configurability #543

aaaditij · 2024-09-11T23:18:31Z

Problem

For the createjob api, one of the inputs to this API is a boolean flag called package_input_folder, which when set to true, packages the input folder (the folder containing the input notebook) and all nested files and subfolders within it during the job creation. This introduces the following problems:

download_files api copies the entire input folder from staging area to the output folder. This is currently done so that notebook downloaded with other output files would have access to all the same files as original and so that running notebook as a whole or some cells could be replicated if they refer to files via local paths.
This in essence is copying the entire input folder twice, once to the staging area
and then to the output folder and can quickly lead to storage exhaustion if the input folder is large.
The files in the staging area are never cleaned up again eating up storage space.

Proposed solution

Add a boolean flag to download_files api to allow the user to specify if they only want the output files to be copied over to the output folder.
Add a boolean flag to download_files api to delete all files belonging to an execution from the staging area after they have been copied over to the output folder.

The text was updated successfully, but these errors were encountered:

andrii-i · 2024-09-11T23:38:14Z

Hi @aaaditij. Thank you for creating this enhancement request. These options make sense to me as traitlet-configurable options. At the same time we are not working on any of them now and are not planning to as there are some high-priority deliverables in the pipeline.

andrii-i · 2024-09-23T18:08:05Z

Implementation overview for "1.) Add a boolean flag to download_files api to allow the user to specify if they only want the output files to be copied over to the output folder." based on discussion with @aaaditij:

Add optional side_effects : Optional[List[str]] = [] field to all Job-related data models https://github.com/jupyter-server/jupyter-scheduler/blob/main/jupyter_scheduler/models.py
Use side_effects to store side effect files created during the job run instead of adding them to packaged_files by changing DefaultExecutionManager.add_side_effects_files accordingly

jupyter-scheduler/jupyter_scheduler/executors.py

Line 147 in 1af9903

def add_side_effects_files(self, staging_dir: str):
Add config option (let’s call it download_output_files_only ) or API option by modify the download endpoint by changing FilesDownloadHandler to accept output files only option

jupyter-scheduler/jupyter_scheduler/handlers.py

Line 397 in 1af9903

class FilesDownloadHandler(ExtensionHandlerMixin, APIHandler):

Pass config option to JobFilesManager and Downloader as parameters, add them as arguments to both and any intermediary classes

jupyter-scheduler/jupyter_scheduler/job_files_manager.py

Lines 27 to 34 in 1af9903

    
           target=Downloader( 
        
               output_formats=job.output_formats, 
        
               output_filenames=output_filenames, 
        
               staging_paths=staging_paths, 
        
               output_dir=output_dir, 
        
               redownload=redownload, 
        
               include_staging_files=job.package_input_folder, 
        
           ).download

Change Downloader.generate_filepaths function to only return side effects and outputs and not packaged files if download_output_files_only is set

jupyter-scheduler/jupyter_scheduler/job_files_manager.py

Lines 56 to 70 in 1af9903

    
           def generate_filepaths(self): 
        
               """A generator that produces filepaths""" 
        
               output_formats = self.output_formats + ["input"] 
        
               for output_format in output_formats: 
        
                   input_filepath = self.staging_paths[output_format] 
        
                   output_filepath = os.path.join(self.output_dir, self.output_filenames[output_format]) 
        
                   if not os.path.exists(output_filepath) or self.redownload: 
        
                       yield input_filepath, output_filepath 
        
               if self.include_staging_files: 
        
                   staging_dir = os.path.dirname(self.staging_paths["input"]) 
        
                   for file_relative_path in self.output_filenames["files"]: 
        
                       input_filepath = os.path.join(staging_dir, file_relative_path) 
        
                       output_filepath = os.path.join(self.output_dir, file_relative_path) 
        
                       if not os.path.exists(output_filepath) or self.redownload: 
        
                           yield input_filepath, output_filepath

Add a test for job files manager using the existing fixtures https://github.com/jupyter-server/jupyter-scheduler/blob/main/jupyter_scheduler/tests/test_job_files_manager.py#L137

aaaditij added the enhancement New feature or request label Sep 11, 2024

andrii-i added this to the Future milestone Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better files download and replication configurability #543

Better files download and replication configurability #543

aaaditij commented Sep 11, 2024 •

edited by andrii-i

Loading

andrii-i commented Sep 11, 2024

andrii-i commented Sep 23, 2024 •

edited

Loading

Better files download and replication configurability #543

Better files download and replication configurability #543

Comments

aaaditij commented Sep 11, 2024 • edited by andrii-i Loading

Problem

Proposed solution

andrii-i commented Sep 11, 2024

andrii-i commented Sep 23, 2024 • edited Loading

aaaditij commented Sep 11, 2024 •

edited by andrii-i

Loading

andrii-i commented Sep 23, 2024 •

edited

Loading