Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leakage during multi-processes/multi-gpus inference #430

Open
EricLina opened this issue Oct 31, 2024 · 0 comments
Open

Memory leakage during multi-processes/multi-gpus inference #430

EricLina opened this issue Oct 31, 2024 · 0 comments

Comments

@EricLina
Copy link

EricLina commented Oct 31, 2024

Hello! 🤗

I attempted to use multi-processing to speed up inference but encountered memory leakage issues. At the end of each process, memory cannot be fully reclaimed, which has led to residual processes and unused GPU memory occupancy. Here's a snippet of my code:

Code

import multiprocessing
import gc
# ...
def process_sequence(n_video, video_name, video_names, gpu_id, args):
    try:
        print(f"\n{n_video + 1}/{len(video_names)} - running on {video_name} with GPU {gpu_id}")
        torch.cuda.set_device(gpu_id)
        # if we use per-object PNG files, they could possibly overlap in inputs and outputs
        hydra_overrides_extra = [
            "++model.non_overlap_masks=" + ("false" if args.per_obj_png_file else "true")
        ]
        predictor = build_sam2_video_predictor(
            config_file=args.sam2_cfg,
            ckpt_path=args.sam2_checkpoint,
            apply_postprocessing=args.apply_postprocessing,
            hydra_overrides_extra=hydra_overrides_extra,
        )
        if not args.track_object_appearing_later_in_video:
            vos_inference(
                predictor=predictor,
                base_video_dir=args.base_video_dir,
                input_mask_dir=args.input_mask_dir,
                output_mask_dir=args.output_mask_dir,
                video_name=video_name,
                score_thresh=args.score_thresh,
                use_all_masks=args.use_all_masks,
                per_obj_png_file=args.per_obj_png_file,
                lower_gpu_memory=args.lower_gpu_memory,
            )
        else:
            vos_separate_inference_per_object(
                predictor=predictor,
                base_video_dir=args.base_video_dir,
                input_mask_dir=args.input_mask_dir,
                output_mask_dir=args.output_mask_dir,
                video_name=video_name,
                score_thresh=args.score_thresh,
                use_all_masks=args.use_all_masks,
                per_obj_png_file=args.per_obj_png_file,
                lower_gpu_memory=args.lower_gpu_memory,
            )
    finally:
        del predictor
        torch.cuda.empty_cache()
        gc.collect()
        
 def main():
    #...

    # for n_video, video_name in enumerate(video_names):
    #     print(f"\n{n_video + 1}/{len(video_names)} - running on {video_name}")
    #     if not args.track_object_appearing_later_in_video:
    #         vos_inference(
    #             predictor=predictor,
    #             base_video_dir=args.base_video_dir,
    #             input_mask_dir=args.input_mask_dir,
    #             output_mask_dir=args.output_mask_dir,
    #             video_name=video_name,
    #             score_thresh=args.score_thresh,
    #             use_all_masks=args.use_all_masks,
    #             per_obj_png_file=args.per_obj_png_file,
    #             lower_gpu_memory=args.lower_gpu_memory,
    #         )
    #     else:
    #         vos_separate_inference_per_object(
    #             predictor=predictor,
    #             base_video_dir=args.base_video_dir,
    #             input_mask_dir=args.input_mask_dir,
    #             output_mask_dir=args.output_mask_dir,
    #             video_name=video_name,
    #             score_thresh=args.score_thresh,
    #             use_all_masks=args.use_all_masks,
    #             per_obj_png_file=args.per_obj_png_file,
    #             lower_gpu_memory=args.lower_gpu_memory,
    #         )  
       

    param_list = [(
        n_video,
        video_name,
        video_names,
        n_video % torch.cuda.device_count(),
        args
    ) for n_video, video_name in enumerate(video_names)]

    multiprocessing.set_start_method('spawn')
    num_threads = 4
    with multiprocessing.Pool(processes=num_threads) as pool:
        results = [pool.apply_async(process_sequence, param) for param in param_list]

        for result in results:
            result.wait() 

Issue Details:

When running this script on 2 NVIDIA 3090 GPUs and setting num_threads to 4, using small-size model, the GPU status from gpustat shows minimal residual GPU memory allocation of ~338MB per process, even after attempting to clear memory using del predictor, torch.cuda.empty_cache(), and gc.collect().

[0] NVIDIA GeForce RTX 3090 | 43°C,   0 % |  4171 / 24576 MB | usr(512M) usr(2968M) usr(338M) usr(338M)
[1] NVIDIA GeForce RTX 3090 | 55°C,   0 % |  3619 / 24576 MB | usr(338M) usr(338M) usr(512M) usr(2416M)

The residual GPU memory may be due to incomplete release of the predictor object in vos_separate_inference_per_object or vos_inference. Attempts with del predictor, torch.cuda.empty_cache(), and gc.collect() have not resolved this issue.

Solution Attempt

Setting maxtasksperchild=1 ensures that each process only handles one task before it is terminated and replaced with a new one.

with multiprocessing.Pool(processes=num_threads, maxtasksperchild=1) as pool:

This helps with memory release by forcing process termination, which effectively clears GPU memory. The processes number matching the num_threads=4, and there are no memory leakage processes anymore.

[0] NVIDIA GeForce RTX 3090 | 51°C,  40 % |  5791 / 24576 MB | usr(3432M) usr(2350M)
[1] NVIDIA GeForce RTX 3090 | 53°C,   0 % |  5801 / 24576 MB | usr(3432M) usr(2360M)

However, this approach can lead to significant overhead due to constant process reinitialization, which impacts inference speed when dealing with a large number of videos.

My Questions:

  • Is there an issue with the way I'm using multiprocessing in my code? (It seems that num_process has become the number of processes per GPU rather than the total number of processes).
  • Besides using maxtasksperchild=1, is there a way to completely clear the GPU memory used in process_sequence?

Any suggestions for fully resolving memory release in this context would be very helpful. Thanks.🙌

@EricLina EricLina changed the title Memory leakage during multi-process inference when using multiprocessing.Pool Memory leakage during multi-processes/multi-gpus inference Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant