Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Working prototype of experiment sequence #2461

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Zhylkaaa
Copy link

Motivation

This PR moves logic of batching and creating jobs to launcher, so resources can be utilized better. Boosts GPU utilization significantly.
(Write your motivation for proposed changes here.)

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

Not all launchers support new feature, but if this change is worth adding we will work on adopting all launchers to that feature.

1 test in optuna still doesn't work, I will debug it in nearest future.

Related Issues and PRs

PR is the result of #2435
(Is this PR part of a group of changes? Link the other relevant PRs and Issues here. Use https://help.github.com/en/articles/closing-issues-using-keywords for help on GitHub syntax)
@Jasha10 can you please take a look

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Nov 10, 2022

This pull request introduces 5 alerts when merging 421293e into d88aca2 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Unused local variable
  • 1 for Module is imported with 'import' and 'import from'
  • 1 for Nested loops with same variable

@Zhylkaaa
Copy link
Author

Also I saw this FR #2187 and I think that with some tricks on pickling it's possible to adopt loky launcher to what is described there (of substitute one for another as they are doing same thing)

@Jasha10
Copy link
Collaborator

Jasha10 commented Nov 11, 2022

Thanks @Zhylkaaa. I'll give this a review shortly.

@facebook-github-bot
Copy link
Contributor

Hi @Zhylkaaa!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@Zhylkaaa
Copy link
Author

@Jasha10 I have added my take on implementing multiprocessing launcher for hydra (I can open separate PR with that launcher removing experiment sequence part)

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Nov 18, 2022

This pull request introduces 12 alerts when merging 587c509 into 035ffb5 - view on LGTM.com

new alerts:

  • 5 for Unused import
  • 4 for Nested loops with same variable
  • 2 for Module is imported with 'import' and 'import from'
  • 1 for Unused local variable

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 22, 2022
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Copy link
Collaborator

@Jasha10 Jasha10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Zhylkaaa. I'm going to push a few minor changes and will follow up with some comments / questions.

@@ -65,6 +66,7 @@ def launch(
idx = initial_job_idx + idx
lst = " ".join(filter_overrides(overrides))
log.info(f"\t#{idx} : {lst}")
print(overrides)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(overrides)

@Jasha10 Jasha10 marked this pull request as draft December 5, 2022 21:43
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

@Jasha10 Jasha10 linked an issue Dec 5, 2022 that may be closed by this pull request
@lgtm-com
Copy link
Contributor

lgtm-com bot commented Dec 5, 2022

This pull request introduces 6 alerts when merging 162249f into afde761 - view on LGTM.com

new alerts:

  • 4 for Nested loops with same variable
  • 1 for Unused local variable
  • 1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed 🚀. For more information, please check out our post on the GitHub blog.

Copy link
Collaborator

@Jasha10 Jasha10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I like your idea of delegating scheduling to the launcher.

My main concern is backwards compatibility. Facebook/Meta has a pretty strong internal requirement for backwards compat, so I don't think we can merge this unless the below issues are addressed:

Comment on lines 173 to 174
# Number of parallel workers
n_jobs: int = 2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change.

Instead of changing the API for OptunaSweeperConf, what if we call ConfigStore.store twice? We can do something like store(node=OptunaSweeperConfV2, name="optuna_v2") for the new API and store(node=OptunaSweeperConf, name="optuna") for backward compatibility.

Comment on lines 33 to 49
self, job_overrides: Sequence[Sequence[str]], initial_job_idx: int
self,
job_overrides: Union[Sequence[Sequence[str]], ExperimentSequence],
initial_job_idx: int,
Copy link
Collaborator

@Jasha10 Jasha10 Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. Type checkers (e.g. mypy) will complain about downstream launchers (including custom launcher plugins that users have written).

Instead of changing the API of Launcher.launch, what if we define a new method Launcher.launch_experiment_sequence? We can provide a default implementation that raises NotImplementedError.

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Dec 8, 2022

This pull request introduces 14 alerts when merging f59ec7b into c48ef19 - view on LGTM.com

new alerts:

  • 7 for Unused import
  • 4 for Nested loops with same variable
  • 2 for Module is imported with 'import' and 'import from'
  • 1 for Unused local variable

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed 🚀. For more information, please check out our post on the GitHub blog.

@Zhylkaaa
Copy link
Author

Zhylkaaa commented Dec 8, 2022

Hi @Jasha10, I've edited this PR according to what we where talking about and it seems to work. Only issue is ax sweeper and aws launcher. I can't figure out that is the issue and if it's me who caused it?
However I wanted to ask if it looks like you have envisioned it will look? If so I will do the small refactor and remove cosmetic issues.

@Jasha10
Copy link
Collaborator

Jasha10 commented Dec 13, 2022

Thanks @Zhylkaaa. I'll take a look shortly.

Copy link
Collaborator

@Jasha10 Jasha10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only issue is ax sweeper and aws launcher. I can't figure out that is the issue and if it's me who caused it?

No, this is not your fault. The ax sweeper is failing on the main branch too.

@@ -193,6 +193,47 @@ def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None:
assert returns["best_value"] <= 2.27


@mark.parametrize("with_commandline", (True, False))
def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None:
def test_optuna_v2_example(with_commandline: bool, tmpdir: Path) -> None:

This prevents name-collision with the other test_optuna_example function above.

Comment on lines 201 to 203
"example/sphere_sequence.py",
"--multirun",
"hydra.sweep.dir=" + str(tmpdir),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"example/sphere_sequence.py",
"--multirun",
"hydra.sweep.dir=" + str(tmpdir),
"example/sphere_sequence.py",
"--multirun",
"hydra/sweeper=optuna_v2",
"hydra.sweep.dir=" + str(tmpdir),

Adding the override hydra/sweeper=optuna_v2 makes sure the new OptunaSweeperConfV2 gets used.

@Zhylkaaa
Copy link
Author

Also sorry @Jasha10 for not bringing this up earlier, but in optuna_v2 we actually change the way max_failure_rate works. Because of removing the notion of batch, we treat max_failure_rate as a global percent of failed runs, in a sens that out of n_trials, floor(n_trials * max_failure_rate) can fail without an error.

@Jasha10
Copy link
Collaborator

Jasha10 commented Dec 21, 2022

Thanks @Zhylkaaa. I'll review this shortly.

Copy link
Collaborator

@Jasha10 Jasha10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I've been slow on this.

Comment on lines +1 to 15
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
Copy link
Collaborator

@Jasha10 Jasha10 Dec 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved

No need to add the license to sweeper.py since sweeper.py is not otherwise modified.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me check, maybe I have added some changes and haven't committed them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I originally had ExperimentSequence in sweeper file and forgot to remove license, sorry. I think this is otherwise good to go (except we can refactor multiprocessing launcher, but it would take too much time, so I think it will be next PR)

@@ -2,7 +2,7 @@
import logging
from dataclasses import dataclass
from pathlib import Path
from typing import List, Optional, Sequence
from typing import List, Optional, Sequence, Union
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from typing import List, Optional, Sequence, Union
from typing import List, Optional, Sequence

@omry
Copy link
Collaborator

omry commented Jan 26, 2023

Can you explain the motivation?
"This PR moves logic of batching and creating jobs to launcher, so resources can be utilized better. Boosts GPU utilization significantly."

Can you explain why moving the batching logic to the launchers as opposed to implementing it in the sweepers boosts GPU utilization?

@Zhylkaaa
Copy link
Author

hey @omry , I would like to illustrate it with an example:
suppose you are running hp optimization for DL models and your sweep config contains architectural parameters like the number of layers or/and hidden layers size.
Now hidden layer size influences the evaluation time of the function because for DL models we typically have to perform (batch_size, hidden_size) x (hidden_size, hidden_size) matrix multiplications with ~O(hidden_size^2 * batch_size) complexity

Everything is fine when hidden_size values are close to each other, but for wide sweeps (with many different values to search through) we might quite possibly end up with a batch of let's say 8 jobs 7 of which will finish in an hour (because of the same size or early stopping), while one job will run for the next 3 hours. The current implementation will just wait for 3 hours before returning the batch of jobs to optuna study and then draw the next batch, which leaves 7 executors idle waiting for 1 (in fact we saw >25% idle time for 24+h experiments). In the case of GPUs, it's a big waste and this idle time costs a lot in terms of infrastructure (an hour of AWS 8xv100 costs ~18$) and experiment duration overall (consider man hours).

One possible way to solve this is to overdraw experiments (instead of sampling 8 jobs let's sample 16) and hope that this will smooth things out a bit. This is a viable approach, but considering that 8 points are sometimes sufficient to discard a relatively big region of search space end next 8 evaluated points are quite possibly outdated - I would call this approach sub-optimal

So we propose this solution that introduces the ExperimentSequence object that serves as a proxy between study and launcher and enables users to customize runs depending on current infrastructure (assign GPUs to jobs for example) and enables launchers to start experiments asynchronously and report results as they arrive to draw more meaningful samples. In the case of the joblib launcher, it won't make any difference, but for new and a few existing launchers this reduces overall experiment time significantly and utilizes resources better.

@omry
Copy link
Collaborator

omry commented Jan 27, 2023

The Optuna Sweeper is really expecting the launcher to be asynchronous.
In principle, defining an interface for async launching support feels like a more productive course of action here.

Something along these lines:

class AsyncLauncher(Plugin):
  def submit(self, job_overrides: Sequence[str]) -> int # job id
  def await(self, job_id: int) -> JobReturn
  def cancel(self, job_id: int)
  def awaitAll(self)

Synchronous launchers could be implemented in terms of asynchronous operations.
I am personally no longer involved with Hydra, but I am willing to review such a diff and help get it landed.

@Zhylkaaa
Copy link
Author

Zhylkaaa commented Jan 27, 2023

Thanks for the feedback @omry.
So, you propose to make launchers asynchronous and move all the scheduling and awaiting logic to sweeper? We were truing to actually avoid this to separate sweeper and launcher functionality as much as possible.
I think this removes a space for user customization (inheriting ExperimentSequence and keeping track of gpus for ex.)?
I am not sure how much time it will take to rewrite it all again :)

EDIT:
@Jasha10 do you have any thoughts?

@jbaczek
Copy link
Contributor

jbaczek commented Feb 8, 2023

Hi @omry,
The problem, that we wanted to solve is to properly encapsulate the mechanisms of sweeping and launching. Optuna sweeper requires feedback from the jobs to schedule next experiments (it preforms TPE optimization). Thus, it produces batches of experiments. Launchers in the current form consume the batch, launch experiments and return the batch of results.

As @Zhylkaaa mentioned, the batch can be extremely uneven leading to wasted resources. The proposed solution is meant to solve this feedback loop problem for uneven batches. We don't expect a GPU ordinal to influence final accuracy of a model, so we don't think that scheduling should take place in a sweeper. Also, we shouldn't expect that sweeper developers should take variety of different system architectures under consideration, while developing these plugins.
We don't want to defer scheduling solely to launchers, because it leads to wasted resources.

So this PR is meant to provide a further abstraction for scheduling/feedback loop, which we believe should be a layer between the sweeper and the launcher. Launchers are asynchronous right now and we don't want to mess with them too much. We discussed this approach with @Jasha10 , and came to the conclusion that our solution is decent enough. We probably won't have any more time to start with this from the ground up.

@omry
Copy link
Collaborator

omry commented Feb 10, 2023

  1. I understand that you are looking for the simplest solution for your problem. However, introducing this solution would make subsequent improvements more difficult (you treating the Sequence as some kind of extension points is making this point more obvious). As I am no longer actively working on this project, I am not in a position to accept or reject this PR. In my opinion it's not a great idea because it's not a complete solution and it will make subsequent fixes harder.

  2. Can you tell me how this solution works when the workers are in a different process or even machine than the sweeping process, hidden behind a particular Launcher implementation (For example they could be running on AWS instances via the Ray Launcher)?

@Zhylkaaa
Copy link
Author

Zhylkaaa commented Feb 10, 2023

Hi @omry,

  1. Well, I would argue that it took some time to arrive on this solution... and I am interested in how do you envision the complete solution?
  2. Absolutely same way as previously, because current hydra implementation of Ray(AWS) launcher is inherently batched and I don't see the easy way to decouple it enough to report results as they arrive, but I am sure there is a way if someone is willing to incorporate Sequence behavior to Ray launcher (maybe @Jasha10 has some insight into how it works and can comment on that). It's much easier to explain for submitit launcher which also submits jobs to other nodes on the cluster:
    Current implementation does something like this:
    return [j.results()[0] for j in jobs] <-- results() call is blocking, meaning that we will wait for the whole batch of jobs
    Now if we decouple that to
unfinished_jobs = jobs
while unfinished_jobs:
    finished_jobs, unfinished_jobs = wait_for_first(unfinished_jobs)
    experiment_sequence.update([finished_job.results() for finished_job in finished_jobs])

^ This will report results immediately as they come. And you also can launch new jobs with configurations sampled from updated study (with new results taken into account) by additionally writing something like:

unfinished_jobs = jobs
while unfinished_jobs:
    finished_jobs, unfinished_jobs = wait_for_first(unfinished_jobs)
    experiment_sequence.update([finished_job.results() for finished_job in finished_jobs])
    for next_job_config, _ in zip(experiment_sequence, range(batch_size - len(unfinished_jobs)):
        job = _launch_job(next_job_config)
        unfinished_jobs.append(job)

Probably another added benefit is that you can add custom class for ExperimentSequence and tailor slurm job config to not over allocate resources and utilize cluster nodes better (this also influences how fast your tasks will be scheduled). At least I think this is possible on job config side. (@Jasha10 correct me please if you can because I never worked with submitit only CLI sbatch)

We can add this feature to launchers that we know how update, but we need some kind of reassurance that this effort is worth something.
If you want to rebuild whole launcher+sweeper paradigm that exists now, I am afraid we cannot help you with that.

# Conflicts:
#	plugins/hydra_joblib_launcher/hydra_plugins/hydra_joblib_launcher/_core.py
@Zhylkaaa
Copy link
Author

Hi @Jasha10 @omry
I would like to get back to the question of asynchronous runs, since I changed the lab I am working on, but problem of time wasting remained. Would you be interested in discussing this feature and solutions?
I am currently thinking about winter project and it seems like a very good candidate. can you please let me know what do you think.
I remember that there were different views on how this issues should be solved and I don't really see more backwards compatible way then one proposed in this PR, meaning introducing additional abstraction on trial suggestion, to make it appear as just an sequence of experiments for launcher, instead of managing launcher logic in sweeper.

I was also considering major refactor of multiprocessing launcher, but i am not sure this make sense?
Best regards

@Jasha10
Copy link
Collaborator

Jasha10 commented Nov 30, 2023

Hi @Zhylkaaa, I'm no longer working at Meta -- Sorry to say that I don't have the bandwidth to give this feature the attention that it deserves.

@Zhylkaaa
Copy link
Author

Zhylkaaa commented Dec 1, 2023

Hi @Jasha10, I am sorry to hear that. Is there any option when we can get back to it in foreseeable future?
I will maintain my fork of hydra with this functionality and try to keep it up to date as long as I can.
In terms of functionality left to implement I think there are 2 sweepers left as well as Ray and Slurm launcher. I will try to work on launchers in a month, because this should be relatively easy and start on sweepers as well.
Can you give your opinion about general idea on ExperimentSequence abstraction, because if the abstraction itself is acceptable we can work out details and exact implementations later.

Thank you for your time.
Best regards,
Dima

@Jasha10
Copy link
Collaborator

Jasha10 commented Dec 5, 2023

Can you give your opinion about general idea on ExperimentSequence abstraction.

I seem to recall feeling that the abstraction was acceptable last time I looked at this PR. That being said, I do not completely understand the tradeoffs around @omry's AsyncLauncher idea. I think his most recent comment is suggesting that AsyncLauncher would be harder to implement later if ExperimentSequence is introduced.

You said earlier:

So, you propose to make launchers asynchronous and move all the scheduling and awaiting logic to sweeper? We were truing to actually avoid this to separate sweeper and launcher functionality as much as possible.

I will have to think about this... I am not clear at the moment about the advantages and disadvantages of the async API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Optuna experiment stream processing
5 participants