You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if a user starts BEE on one front end of an arbitrary HPC cluster, let's call it clusterfe1, then tries to use beeflow commands on another front end of the same cluster the commands will fail since processes on different front ends usually can't communicate with each other.
If the user tries to run a beeflow core command they'll get the error: Cannot connect to the beeflow daemon, is it running? Check the log at ".beeflow/logs/beeflow.log".
If the user tries to use any beeflow commands they'll get a message like Submit: Could not reach WF Manager.
We should add a check in core.py and client.py for to make sure the host that the user is running on is the same as the host beeflow is currently running on.
One big issue is there isn't currently a clean way to get this information.
We have several options:
The beeflow log at .beeflow/logs/beeflow.log says the front end on which beeflow was last started in the format.
Running on cluster-fe1
Launching components in order: ['redis', 'scheduler', 'celery', 'slurmrestd', 'wf_manager', 'task_manager']
We could grep the last Running message out of the log (and verify there wasn't a Kill operation afterwards) to get this info. This could break if we ever make changes to the beeflow log and is kind of brittle.
Alternatively, we could add the hostname where beeflow is running to the workflow DB and get that information in the beeflow client. Currently, we're only using the workflow DB in the wf_manager so this would add another piece of code that depends on it which breaks our modularity somewhat. Another issue is that this won't work if in the future we enable a client to run on a separate system from the one where the workflow manager is running, but that situation wouldn't be impacted by this problem so we'd need to just not do this check if we're connecting to the workflow manager from another machine.
I think option 2 is the best solution at the moment.
The text was updated successfully, but these errors were encountered:
@kchilleri When you check for the location that beeflow is starting from can you also check if the environment variable SLURMD_NODENAME exists and print it out with a warning that they are on a compute node and not allow beeflow to start.
@kchilleri When you check for the location that beeflow is starting from can you also check if the environment variable SLURMD_NODENAME exists and print it out with a warning that they are on a compute node and not allow beeflow to start.
Currently, if a user starts BEE on one front end of an arbitrary HPC cluster, let's call it clusterfe1, then tries to use beeflow commands on another front end of the same cluster the commands will fail since processes on different front ends usually can't communicate with each other.
If the user tries to run a
beeflow core
command they'll get the error:Cannot connect to the beeflow daemon, is it running? Check the log at ".beeflow/logs/beeflow.log".
If the user tries to use any
beeflow
commands they'll get a message likeSubmit: Could not reach WF Manager.
We should add a check in core.py and client.py for to make sure the host that the user is running on is the same as the host beeflow is currently running on.
One big issue is there isn't currently a clean way to get this information.
We have several options:
beeflow
log at.beeflow/logs/beeflow.log
says the front end on which beeflow was last started in the format.We could grep the last Running message out of the log (and verify there wasn't a Kill operation afterwards) to get this info. This could break if we ever make changes to the beeflow log and is kind of brittle.
I think option 2 is the best solution at the moment.
The text was updated successfully, but these errors were encountered: