Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] implement MultiprocessingLauncher #2187

Open
Jasha10 opened this issue May 2, 2022 · 6 comments
Open

[Feature Request] implement MultiprocessingLauncher #2187

Jasha10 opened this issue May 2, 2022 · 6 comments
Labels
enhancement Enhanvement request launchers

Comments

@Jasha10
Copy link
Collaborator

Jasha10 commented May 2, 2022

We might consider implementing a launcher that employs Python's multiprocessing module (or some other simple mechanism from Python's standard library) for launching jobs in separate processes. This would be useful in multirun mode for the following reasons:

  • concurrency: such a launcher would enable process parallelism (unlike Hydra's BasicLauncher, which runs jobs sequentially)
  • isolation: Running each job in a separate process would prevent state from leaking between jobs (unlike BasicLauncher, which runs all jobs in the same process).

Concurrency and isolation are already possible via e.g. the Joblib launcher (as pointed out in discussion #2186) or the other advanced launchers (RayLauncher, SubititLauncher, RQLauncher). As I see it, the main advantages of a MultiprocessingLauncher over something like JoblibLauncher would be (1) simplicity and (2) less reliance on third-party packages (joblib).

@maxfrei750
Copy link
Contributor

maxfrei750 commented May 16, 2022

I'd like to add that the current workaround (see discussion #2186) does not result in a prefect run isolation, but will require jobs to be paired, due to a limitation of joblib (see joblib/joblib#1294).

@jbaczek
Copy link
Contributor

jbaczek commented Nov 10, 2022

Hi, what is the status of this FR?
In my case I have some objects created by PyTorch like CUDA context or caching allocator which are destroyed at the end of a process. According to this discussion it is impossible to destroy CUDA context programmatically. Caching allocator together with garbage collector are hard to control and sometimes they leave memory allocated even when torch.cuda.empty_cache() is called.
I need to run my trainings in isolation to prevent memory leaks and OOMs in subsequent trainings.
In joblib documentation there is a claim that they are able to save tenths of a second per process spawn which is irrelevant in my case.

@Jasha10
Copy link
Collaborator Author

Jasha10 commented Nov 10, 2022

Hi @jbaczek,

Deallocating pytorch CUDA memory is a compelling use-case, and this is an issue I've faced as well.

I'm not sure if we'll have the bandwidth to implement this feature in time for Hydra 1.3. I'll add a tentative v1.3 milestone.

If you have time to work on this, a PR would be welcome.
I think that, since multiprocessing is part of the standard library, we can include the proposed multiprocessing_launcher in the hydra/_internal/core_plugins directory instead of creating a new folder under the plugins directory.

@Zhylkaaa
Copy link

Zhylkaaa commented Nov 10, 2022

Hi @Jasha10, I was experimenting with ProcessPoolExecutor during my work on above mentioned PR. You cant have purely multiprocessing launcher without any 3rd party dependencies because you need to pickle methods defined in main and lambdas.
This can be achived by hacking pickle module of multiprocessing to use dill or cloudpickle: eg. https://stackoverflow.com/questions/19984152/what-can-multiprocessing-and-dill-do-together

EDIT: after some experiments I think the best chance of doing this is using cloudpickle (used internally by joblib) and multiprocessing. Only issue is how to implement waiting for AsyncResults, but I have few ideas. Anyway is it ok with you to make hydra depend on cloudpickle? We can probably write this plugin by Monday (including integration with proposition from #2461)

EDIT2: we are testing some solutions, are you interested? @Jasha10

@Zhylkaaa
Copy link

Hi, I was experimenting with multiprocessing and it's capabilities and limits. Also with our implementation of it and found few issue you might like to know before considering adding it to core (or even plugins).
In short:

  • you need to maintain your own version of Pool that doesn't spawn daemons (which is not that bad)
  • multiprocessing have known issues with handling out of ram errors (or even simple pool.apply_async(sys.exit, (1,))) which you can overcome, but at a great price.

If you want to have some more details I can offer an online meeting. (you probably have my contacts in CLA :))

@Jasha10
Copy link
Collaborator Author

Jasha10 commented Dec 2, 2022

Hi @Zhylkaaa,

Thanks for looking into this! I sincerely apologize for the delayed reply.

after some experiments I think the best chance of doing this is using cloudpickle (used internally by joblib)

This is consistent with the solution used by several of Hydra's other launcher plugins.

Anyway is it ok with you to make hydra depend on cloudpickle?

To guarantee long-term stability, I think it's best for hydra-core not to depend on cloudpickle. Currently we only depend on packages like antlr/pyyaml/omegaconf that are essential to Hydra's business logic.

We can create a new folder under hydra's plugins directory, however.

If you want to have some more details I can offer an online meeting. (you probably have my contacts in CLA :))

Thanks! Unfortunately I don't have access to the CLA -- I suspect I'd need to go through Meta's legal department to get that :P
Could you please send me an email at jasha10@meta.com ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhanvement request launchers
Projects
None yet
Development

No branches or pull requests

4 participants