Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Check for Different Front End #883

Open
rstyd opened this issue Jul 18, 2024 · 2 comments · May be fixed by #933
Open

Add Check for Different Front End #883

rstyd opened this issue Jul 18, 2024 · 2 comments · May be fixed by #933
Assignees
Labels
bug Something isn't working High Priority

Comments

@rstyd
Copy link
Collaborator

rstyd commented Jul 18, 2024

Currently, if a user starts BEE on one front end of an arbitrary HPC cluster, let's call it clusterfe1, then tries to use beeflow commands on another front end of the same cluster the commands will fail since processes on different front ends usually can't communicate with each other.

If the user tries to run a beeflow core command they'll get the error:
Cannot connect to the beeflow daemon, is it running? Check the log at ".beeflow/logs/beeflow.log".

If the user tries to use any beeflow commands they'll get a message like
Submit: Could not reach WF Manager.

We should add a check in core.py and client.py for to make sure the host that the user is running on is the same as the host beeflow is currently running on.

One big issue is there isn't currently a clean way to get this information.

We have several options:

  1. The beeflow log at .beeflow/logs/beeflow.log says the front end on which beeflow was last started in the format.
Running on cluster-fe1
Launching components in order: ['redis', 'scheduler', 'celery', 'slurmrestd', 'wf_manager', 'task_manager']

We could grep the last Running message out of the log (and verify there wasn't a Kill operation afterwards) to get this info. This could break if we ever make changes to the beeflow log and is kind of brittle.

  1. Alternatively, we could add the hostname where beeflow is running to the workflow DB and get that information in the beeflow client. Currently, we're only using the workflow DB in the wf_manager so this would add another piece of code that depends on it which breaks our modularity somewhat. Another issue is that this won't work if in the future we enable a client to run on a separate system from the one where the workflow manager is running, but that situation wouldn't be impacted by this problem so we'd need to just not do this check if we're connecting to the workflow manager from another machine.

I think option 2 is the best solution at the moment.

@rstyd rstyd added the bug Something isn't working label Jul 18, 2024
@kchilleri kchilleri self-assigned this Aug 6, 2024
@pagrubel
Copy link
Collaborator

@kchilleri When you check for the location that beeflow is starting from can you also check if the environment variable SLURMD_NODENAME exists and print it out with a warning that they are on a compute node and not allow beeflow to start.

@kchilleri
Copy link
Collaborator

kchilleri commented Oct 1, 2024

@kchilleri When you check for the location that beeflow is starting from can you also check if the environment variable SLURMD_NODENAME exists and print it out with a warning that they are on a compute node and not allow beeflow to start.

issue #932 addresses this.

@kchilleri kchilleri linked a pull request Oct 1, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working High Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants