Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection stuck without any error message #84

Open
Janus-Shiau opened this issue Aug 1, 2022 · 5 comments
Open

Connection stuck without any error message #84

Janus-Shiau opened this issue Aug 1, 2022 · 5 comments

Comments

@Janus-Shiau
Copy link

I have used W&B local for years. Recently, I sometime get stuck when finish training run.

There is no error or warning messages at all, the last terminal message I got on client side is:

Terminal Messages

wandb: Synced RUN_NAME: SERVER_ADDRESS
wandb: Synced 7 W&B file(s), 58800 media file(s), 0 artifact file(s) and 3 other file(s)
wandb: Find logs at: ../artifacts/wandb/run-20220730_190417-2ql6zqpk/logs

Environment & Version

My local instance is running on Ubuntu 16.04, and its version is 0.15.0.
My client side is running on Ubuntu 16.04 or 18.04, and its version is 0.12.21.

I really enjoy the experience of using W&B local, thank you guys for develop this awesome MLOps tool.
And I hope this issue can be reproduced and solved soon.

@vanpelt
Copy link
Contributor

vanpelt commented Aug 1, 2022

Looks like you logged 58800 media files like images or video. That will take a long time to upload and might fill up our overwhelm your disk. You should reduce the number of media logged or purchase a license for a commercial version that can connect to cloud storage.

@Janus-Shiau
Copy link
Author

The total size of these media files is not heavy. It's about 75 MB.

The Synchronization stuck also happen to a run without any media file.

@vanpelt
Copy link
Contributor

vanpelt commented Aug 5, 2022

You can find details about what our process is doing by looking at the wandb/debug-internal.log process relative to your script. We would need to see that to understand what's making the process stall.

@Janus-Shiau
Copy link
Author

This is the log in wandb/debug-internal.log. I copy the INFO and DEBUG right after last scan save is logged.

2022-08-16 12:36:41,934 INFO    SenderThread:4422 [sender.py:transition_state():459] send defer: 8
2022-08-16 12:36:41,935 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:41,941 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:41,941 INFO    HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 8
2022-08-16 12:36:41,942 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:41,942 INFO    SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 8
2022-08-16 12:36:41,942 INFO    SenderThread:4422 [file_pusher.py:finish():171] shutting down file pusher
2022-08-16 12:36:42,044 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,044 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,142 INFO    Thread-11 :4422 [sender.py:transition_state():459] send defer: 9
2022-08-16 12:36:42,143 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,143 INFO    HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 9
2022-08-16 12:36:42,143 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,143 INFO    SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 9
2022-08-16 12:36:42,154 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,213 INFO    SenderThread:4422 [sender.py:transition_state():459] send defer: 10
2022-08-16 12:36:42,214 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,229 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,229 INFO    HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 10
2022-08-16 12:36:42,229 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,230 INFO    SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 10
2022-08-16 12:36:42,230 INFO    SenderThread:4422 [sender.py:transition_state():459] send defer: 11
2022-08-16 12:36:42,231 DEBUG   SenderThread:4422 [sender.py:send():302] send: final
2022-08-16 12:36:42,231 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,231 DEBUG   SenderThread:4422 [sender.py:send():302] send: footer
2022-08-16 12:36:42,231 INFO    HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 11
2022-08-16 12:36:42,232 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,232 INFO    SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 11
2022-08-16 12:36:42,332 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,335 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,355 INFO    SenderThread:4422 [file_pusher.py:join():176] waiting for file pusher
2022-08-16 12:36:42,474 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: sampled_history
2022-08-16 12:36:42,489 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: get_summary
2022-08-16 12:36:42,491 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: shutdown
2022-08-16 12:36:42,491 INFO    HandlerThread:4422 [handler.py:finish():810] shutting down handler
2022-08-16 12:36:43,231 INFO    WriterThread:4422 [datastore.py:close():279] close: ../artifacts/wandb/run-20220815_150813-2jk5mgtn/run-2jk5mgtn.wandb
2022-08-16 12:36:43,372 INFO    SenderThread:4422 [sender.py:finish():1312] shutting down sender
2022-08-16 12:36:43,372 INFO    SenderThread:4422 [file_pusher.py:finish():171] shutting down file pusher
2022-08-16 12:36:43,372 INFO    SenderThread:4422 [file_pusher.py:join():176] waiting for file pusher
2022-08-16 12:36:48,400 INFO    MainThread:4422 [internal.py:handle_exit():80] Internal process exited

Thank you for your time, I hope this issue can solved soon.

@Janus-Shiau
Copy link
Author

I got different message today as following, just for your reference.

2022-08-23 10:29:30,880 INFO    SenderThread:15179 [sender.py:transition_state():459] send defer: 8
2022-08-23 10:29:30,880 INFO    SenderThread:15179 [sender.py:finish():1312] shutting down sender
2022-08-23 10:29:30,880 INFO    SenderThread:15179 [file_pusher.py:finish():171] shutting down file pusher
2022-08-23 10:29:30,880 INFO    SenderThread:15179 [file_pusher.py:join():176] waiting for file pusher
2022-08-23 10:29:31,006 INFO    WriterThread:15179 [datastore.py:close():279] close: ../artifacts/wandb/run-20220822_185928-2l3ms0qk/run-2l3ms0qk.wandb
2022-08-23 10:29:31,452 ERROR   StreamThr :15179 [internal.py:wandb_internal():165] Thread HandlerThread:
Traceback (most recent call last):
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 51, in run
    self._run()
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 98, in _run
    record = self._input_record_q.get(timeout=1)
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/queues.py", line 111, in get
    res = self._recv_bytes()
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants