Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help connecting to existing KubeCluster using the build-in Discovery Mechanism #255

Open
jerrygb opened this issue Jan 10, 2023 · 2 comments

Comments

@jerrygb
Copy link

jerrygb commented Jan 10, 2023

Describe the issue:

I am able to create clusters, connect using dask clients and perform Dask operations without issues using KubeCluster Operator on a Notebook. I am also able to connect to the status dashboard using port-forwarding to the scheduler.

However, I am not able to connect to these clusters when using the lab extensions. When I try to move to an active notebook and click search on the Dask Lab-extension, it does picks up a remote cluster address. The Dashboard URLs that are picked up the extension code look like,

http://internal-scheduler.namespace:8787/

But, I think the extension is not able to connect to it. I do not see any logs pertaining to this action.

Do these dashboards need to be external (meaning are these connections made from browser or backend service)?
Since I was not sure about this, I tried setting up AWS NLB. I tried connecting to the NLB address using the Client as seen in the second snippet below.

Minimal Complete Verifiable Example:

All of the following code snippets work fine from the notebook.

# Create a cluster
from dask_kubernetes.operator import make_cluster_spec, make_worker_spec
from dask_kubernetes.operator import KubeCluster
from dask.distributed import Client
import dask.dataframe as dd
import os
profile_name = namespace_name

custom_spec = make_cluster_spec(name=profile_name, image='ghcr.io/dask/dask:latest', resources={"requests": {"memory": "512Mi"}, "limits": {"cpu": "4","memory": "8Gi"}})

custom_spec['spec']['scheduler']['spec']['serviceAccount'] = 'default-editor'
custom_spec['spec']['worker']['spec']['serviceAccount'] = 'default-editor'


custom_worker_spec = make_worker_spec(image='ghcr.io/dask/dask:latest', n_workers=6, resources={"requests": {"memory": "512Mi"}, "limits": {"memory": "12Gi"}})
custom_worker_spec['spec']['serviceAccount'] = 'default-editor'
custom_worker_spec
cluster = KubeCluster(custom_cluster_spec=custom_spec, n_workers=0)
cluster.add_worker_group(name='highmem', custom_spec=custom_worker_spec)

As mentioned, let's assume that I have AWS NLB type LoadBalancer/Ingress Service. Then the Dask Client is able to successfully interact against 8787 and 8786 ports on the scheduler in order to manage the workers and jobs, externally.

# Connect to external endpoint works fine
import dask; from dask.distributed import Client
dask.config.set({'scheduler-address': 'tcp://nlb-address.region.elb.amazonaws.com:8786'})
client = Client()

Anything else we need to know?:

Another thing noticed was that the dask-extension relies on testDaskDashboard function to pick up the URL info (defined in https://github.com/dask/dask-labextension/blob/main/src/dashboard.tsx#L588),

In the console, I can see,

Found the dashboard link at 'http://internal-scheduler.namespace:8787/status'

However, the consequent dashboard-check request to the backend is replacing an extra / from protocol.

See the GET request below,

GET https://website/notebook/internal/test-dask-1/dask/dashboard-check/http%3A%2Finternal-scheduler.namespace%3A8787%2F?1673363416491

To be a bit more verbose,

http%3A%2Finternal-scheduler.namespace%3A8787%2F?1673363416491 translates to http:/internal-scheduler.namespace:8787/?1673363416491

I am not sure if this is expected or a bug.

Environment:

  • Dask version: 2022.12.1
  • Dask Kubernetes: 2022.12.0
  • @dask-labextension: v6.0.0
  • @jupyterlab/server-proxy: v3.2.2
  • Python version: 3.8.10
  • Platform: Kubeflow
  • Install method (conda, pip, source): pip
@thedeg123
Copy link

I'm having this same issue. I can connect to the dashboard fine from the notebook, but not from the lab extension. Looking at the Network request failure I see ERR_NAME_NOT_RESOLVED for a GET request to my-dask-scheduler.namespace/cluster-map. Did you ever solve this issue?

@jacobtomlinson
Copy link
Member

You will need to configure the dashboard address to use the Jupyter proxy, this varies between setups so it's hard for us to set a sane default for.

If you create the cluster with KubeCluster then the dashboard port will be proxied to the node where Jupyter is running. You likely need to set DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{host}:{port}/status" for this to work correctly, but this will vary depending on how you've configured your Jupyter environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants