You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I attempted to use multi-processing to speed up inference but encountered memory leakage issues. At the end of each process, memory cannot be fully reclaimed, which has led to residual processes and unused GPU memory occupancy. Here's a snippet of my code:
Code
import multiprocessing
import gc
# ...
def process_sequence(n_video, video_name, video_names, gpu_id, args):
try:
print(f"\n{n_video + 1}/{len(video_names)} - running on {video_name} with GPU {gpu_id}")
torch.cuda.set_device(gpu_id)
# if we use per-object PNG files, they could possibly overlap in inputs and outputs
hydra_overrides_extra = [
"++model.non_overlap_masks=" + ("false" if args.per_obj_png_file else "true")
]
predictor = build_sam2_video_predictor(
config_file=args.sam2_cfg,
ckpt_path=args.sam2_checkpoint,
apply_postprocessing=args.apply_postprocessing,
hydra_overrides_extra=hydra_overrides_extra,
)
if not args.track_object_appearing_later_in_video:
vos_inference(
predictor=predictor,
base_video_dir=args.base_video_dir,
input_mask_dir=args.input_mask_dir,
output_mask_dir=args.output_mask_dir,
video_name=video_name,
score_thresh=args.score_thresh,
use_all_masks=args.use_all_masks,
per_obj_png_file=args.per_obj_png_file,
lower_gpu_memory=args.lower_gpu_memory,
)
else:
vos_separate_inference_per_object(
predictor=predictor,
base_video_dir=args.base_video_dir,
input_mask_dir=args.input_mask_dir,
output_mask_dir=args.output_mask_dir,
video_name=video_name,
score_thresh=args.score_thresh,
use_all_masks=args.use_all_masks,
per_obj_png_file=args.per_obj_png_file,
lower_gpu_memory=args.lower_gpu_memory,
)
finally:
del predictor
torch.cuda.empty_cache()
gc.collect()
def main():
#...
# for n_video, video_name in enumerate(video_names):
# print(f"\n{n_video + 1}/{len(video_names)} - running on {video_name}")
# if not args.track_object_appearing_later_in_video:
# vos_inference(
# predictor=predictor,
# base_video_dir=args.base_video_dir,
# input_mask_dir=args.input_mask_dir,
# output_mask_dir=args.output_mask_dir,
# video_name=video_name,
# score_thresh=args.score_thresh,
# use_all_masks=args.use_all_masks,
# per_obj_png_file=args.per_obj_png_file,
# lower_gpu_memory=args.lower_gpu_memory,
# )
# else:
# vos_separate_inference_per_object(
# predictor=predictor,
# base_video_dir=args.base_video_dir,
# input_mask_dir=args.input_mask_dir,
# output_mask_dir=args.output_mask_dir,
# video_name=video_name,
# score_thresh=args.score_thresh,
# use_all_masks=args.use_all_masks,
# per_obj_png_file=args.per_obj_png_file,
# lower_gpu_memory=args.lower_gpu_memory,
# )
param_list = [(
n_video,
video_name,
video_names,
n_video % torch.cuda.device_count(),
args
) for n_video, video_name in enumerate(video_names)]
multiprocessing.set_start_method('spawn')
num_threads = 4
with multiprocessing.Pool(processes=num_threads) as pool:
results = [pool.apply_async(process_sequence, param) for param in param_list]
for result in results:
result.wait()
Issue Details:
When running this script on 2 NVIDIA 3090 GPUs and setting num_threads to 4, using small-size model, the GPU status from gpustat shows minimal residual GPU memory allocation of ~338MB per process, even after attempting to clear memory using del predictor, torch.cuda.empty_cache(), and gc.collect().
The residual GPU memory may be due to incomplete release of the predictor object in vos_separate_inference_per_object or vos_inference. Attempts with del predictor, torch.cuda.empty_cache(), and gc.collect() have not resolved this issue.
Solution Attempt
Setting maxtasksperchild=1 ensures that each process only handles one task before it is terminated and replaced with a new one.
with multiprocessing.Pool(processes=num_threads, maxtasksperchild=1) as pool:
This helps with memory release by forcing process termination, which effectively clears GPU memory. The processes number matching the num_threads=4, and there are no memory leakage processes anymore.
However, this approach can lead to significant overhead due to constant process reinitialization, which impacts inference speed when dealing with a large number of videos.
My Questions:
Is there an issue with the way I'm using multiprocessing in my code? (It seems that num_process has become the number of processes per GPU rather than the total number of processes).
Besides using maxtasksperchild=1, is there a way to completely clear the GPU memory used in process_sequence?
Any suggestions for fully resolving memory release in this context would be very helpful. Thanks.🙌
The text was updated successfully, but these errors were encountered:
EricLina
changed the title
Memory leakage during multi-process inference when using multiprocessing.Pool
Memory leakage during multi-processes/multi-gpus inference
Oct 31, 2024
Hello! 🤗
I attempted to use multi-processing to speed up inference but encountered memory leakage issues. At the end of each process, memory cannot be fully reclaimed, which has led to residual processes and unused GPU memory occupancy. Here's a snippet of my code:
Code
Issue Details:
When running this script on 2 NVIDIA 3090 GPUs and setting num_threads to 4, using small-size model, the GPU status from gpustat shows minimal residual GPU memory allocation of ~338MB per process, even after attempting to clear memory using del predictor, torch.cuda.empty_cache(), and gc.collect().
The residual GPU memory may be due to incomplete release of the predictor object in vos_separate_inference_per_object or vos_inference. Attempts with del predictor, torch.cuda.empty_cache(), and gc.collect() have not resolved this issue.
Solution Attempt
Setting maxtasksperchild=1 ensures that each process only handles one task before it is terminated and replaced with a new one.
This helps with memory release by forcing process termination, which effectively clears GPU memory. The processes number matching the num_threads=4, and there are no memory leakage processes anymore.
However, this approach can lead to significant overhead due to constant process reinitialization, which impacts inference speed when dealing with a large number of videos.
My Questions:
Any suggestions for fully resolving memory release in this context would be very helpful. Thanks.🙌
The text was updated successfully, but these errors were encountered: