Memory leakage during multi-processes/multi-gpus inference #430

EricLina · 2024-10-31T09:09:22Z

Hello! 🤗

I attempted to use multi-processing to speed up inference but encountered memory leakage issues. At the end of each process, memory cannot be fully reclaimed, which has led to residual processes and unused GPU memory occupancy. Here's a snippet of my code:

Code

import multiprocessing
import gc
# ...
def process_sequence(n_video, video_name, video_names, gpu_id, args):
    try:
        print(f"\n{n_video + 1}/{len(video_names)} - running on {video_name} with GPU {gpu_id}")
        torch.cuda.set_device(gpu_id)
        # if we use per-object PNG files, they could possibly overlap in inputs and outputs
        hydra_overrides_extra = [
            "++model.non_overlap_masks=" + ("false" if args.per_obj_png_file else "true")
        ]
        predictor = build_sam2_video_predictor(
            config_file=args.sam2_cfg,
            ckpt_path=args.sam2_checkpoint,
            apply_postprocessing=args.apply_postprocessing,
            hydra_overrides_extra=hydra_overrides_extra,
        )
        if not args.track_object_appearing_later_in_video:
            vos_inference(
                predictor=predictor,
                base_video_dir=args.base_video_dir,
                input_mask_dir=args.input_mask_dir,
                output_mask_dir=args.output_mask_dir,
                video_name=video_name,
                score_thresh=args.score_thresh,
                use_all_masks=args.use_all_masks,
                per_obj_png_file=args.per_obj_png_file,
                lower_gpu_memory=args.lower_gpu_memory,
            )
        else:
            vos_separate_inference_per_object(
                predictor=predictor,
                base_video_dir=args.base_video_dir,
                input_mask_dir=args.input_mask_dir,
                output_mask_dir=args.output_mask_dir,
                video_name=video_name,
                score_thresh=args.score_thresh,
                use_all_masks=args.use_all_masks,
                per_obj_png_file=args.per_obj_png_file,
                lower_gpu_memory=args.lower_gpu_memory,
            )
    finally:
        del predictor
        torch.cuda.empty_cache()
        gc.collect()
        
 def main():
    #...

    # for n_video, video_name in enumerate(video_names):
    #     print(f"\n{n_video + 1}/{len(video_names)} - running on {video_name}")
    #     if not args.track_object_appearing_later_in_video:
    #         vos_inference(
    #             predictor=predictor,
    #             base_video_dir=args.base_video_dir,
    #             input_mask_dir=args.input_mask_dir,
    #             output_mask_dir=args.output_mask_dir,
    #             video_name=video_name,
    #             score_thresh=args.score_thresh,
    #             use_all_masks=args.use_all_masks,
    #             per_obj_png_file=args.per_obj_png_file,
    #             lower_gpu_memory=args.lower_gpu_memory,
    #         )
    #     else:
    #         vos_separate_inference_per_object(
    #             predictor=predictor,
    #             base_video_dir=args.base_video_dir,
    #             input_mask_dir=args.input_mask_dir,
    #             output_mask_dir=args.output_mask_dir,
    #             video_name=video_name,
    #             score_thresh=args.score_thresh,
    #             use_all_masks=args.use_all_masks,
    #             per_obj_png_file=args.per_obj_png_file,
    #             lower_gpu_memory=args.lower_gpu_memory,
    #         )  
       

    param_list = [(
        n_video,
        video_name,
        video_names,
        n_video % torch.cuda.device_count(),
        args
    ) for n_video, video_name in enumerate(video_names)]

    multiprocessing.set_start_method('spawn')
    num_threads = 4
    with multiprocessing.Pool(processes=num_threads) as pool:
        results = [pool.apply_async(process_sequence, param) for param in param_list]

        for result in results:
            result.wait()

Issue Details:

When running this script on 2 NVIDIA 3090 GPUs and setting num_threads to 4, using small-size model, the GPU status from gpustat shows minimal residual GPU memory allocation of ~338MB per process, even after attempting to clear memory using del predictor, torch.cuda.empty_cache(), and gc.collect().

[0] NVIDIA GeForce RTX 3090 | 43°C,   0 % |  4171 / 24576 MB | usr(512M) usr(2968M) usr(338M) usr(338M)
[1] NVIDIA GeForce RTX 3090 | 55°C,   0 % |  3619 / 24576 MB | usr(338M) usr(338M) usr(512M) usr(2416M)

The residual GPU memory may be due to incomplete release of the predictor object in vos_separate_inference_per_object or vos_inference. Attempts with del predictor, torch.cuda.empty_cache(), and gc.collect() have not resolved this issue.

Solution Attempt

Setting maxtasksperchild=1 ensures that each process only handles one task before it is terminated and replaced with a new one.

with multiprocessing.Pool(processes=num_threads, maxtasksperchild=1) as pool:

This helps with memory release by forcing process termination, which effectively clears GPU memory. The processes number matching the num_threads=4, and there are no memory leakage processes anymore.

[0] NVIDIA GeForce RTX 3090 | 51°C,  40 % |  5791 / 24576 MB | usr(3432M) usr(2350M)
[1] NVIDIA GeForce RTX 3090 | 53°C,   0 % |  5801 / 24576 MB | usr(3432M) usr(2360M)

However, this approach can lead to significant overhead due to constant process reinitialization, which impacts inference speed when dealing with a large number of videos.

My Questions:

Is there an issue with the way I'm using multiprocessing in my code? (It seems that num_process has become the number of processes per GPU rather than the total number of processes).
Besides using maxtasksperchild=1, is there a way to completely clear the GPU memory used in process_sequence?

Any suggestions for fully resolving memory release in this context would be very helpful. Thanks.🙌

The text was updated successfully, but these errors were encountered:

EricLina mentioned this issue Oct 31, 2024

Does vos_inference.py support multi-GPU running? #228

Open

EricLina changed the title ~~Memory leakage during multi-process inference when using multiprocessing.Pool~~ Memory leakage during multi-processes/multi-gpus inference Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leakage during multi-processes/multi-gpus inference #430

Memory leakage during multi-processes/multi-gpus inference #430

EricLina commented Oct 31, 2024 •

edited

Loading

Memory leakage during multi-processes/multi-gpus inference #430

Memory leakage during multi-processes/multi-gpus inference #430

Comments

EricLina commented Oct 31, 2024 • edited Loading

Code

Issue Details:

Solution Attempt

My Questions:

EricLina commented Oct 31, 2024 •

edited

Loading