Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask distributed workers die #103

Open
siebrenf opened this issue Jul 5, 2021 · 6 comments
Open

dask distributed workers die #103

siebrenf opened this issue Jul 5, 2021 · 6 comments

Comments

@siebrenf
Copy link
Member

siebrenf commented Jul 5, 2021

Running ananse network on the develop branch, on cn106. It seems the memory limit causes the workers to stop, rather than behave?

command:

nice -n 10 \
ananse network -g GRCz11.fa -a GRCz11.annotation.bed -n 12 \
-b binding_24hpf.tsv -e 24hpf_rep1.tsv 24hpf_rep2.tsv -o network_24hpf.txt

full log

log snippet (this error, and seemingly similar errors, keep repeating):

...
2021-07-05 20:29:19 | INFO | Computing network
[##########                              ] | 26% Completed | 58.7sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[##########                              ] | 26% Completed | 59.3sdistributed.nanny - WARNING - Restarting worker
[##########                              ] | 26% Completed | 60.0sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[##########                              ] | 26% Completed |  1min  0.8sdistributed.nanny - WARNING - Restarting worker
[###########                             ] | 28% Completed |  1min  9.5sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[###########                             ] | 28% Completed |  1min 10.0sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[###########                             ] | 28% Completed |  1min 10.2sdistributed.nanny - WARNING - Restarting worker
[###########                             ] | 28% Completed |  1min 10.8sdistributed.nanny - WARNING - Restarting worker
[###########                             ] | 28% Completed |  1min 12.3sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[###########                             ] | 28% Completed |  1min 13.0sdistributed.nanny - WARNING - Restarting worker
[###########                             ] | 28% Completed |  1min 13.1sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:40089
Traceback (most recent call last):
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2334, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3753, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3730, in _get_data
    comm = await rpc.connect(worker)
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 1012, in connect
    comm = await connect(
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 325, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:40089 after 10 s
[###########                             ] | 28% Completed |  1min 14.0sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[###########                             ] | 28% Completed |  1min 14.6sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:46609
...
@siebrenf
Copy link
Member Author

siebrenf commented Jul 5, 2021

without niceness and 1 cores: same issue

@simonvh
Copy link
Member

simonvh commented Jul 7, 2021

I think there are two issues:

  1. ananse network still seems to take quite some memory, especially if there are many TFs. This needs to be refactored (I have some ideas there)>
  2. The binding file contains all zebrafish TFs, but also all human TFs. The result is that this file becomes too big, with quite some unnecessary information. What were the arguments for ananase binding? This is also something that we should look at.

@siebrenf
Copy link
Member Author

siebrenf commented Jul 7, 2021

Angela ran Binding on the student server. I think it was something like this:

ananse binding \
-g GRCz11.fa \
-A GRCz11-GSM3396550.samtools-coordinate.bam \
-r GRCz11_raw.tsv \
-n 12 \
-o out_binding ;

head of GRCz11_raw.tsv:

1:5842-6042
1:11142-11342
1:16764-16964
1:18603-18803
1:21696-21896
1:27134-27334
1:27574-27774
1:29781-29981
1:36577-36777
1:42716-42916

head of the output factor_activity.tsv indeed contains a mix of upper and lower case:

factor  activity
SOX15   0.7532033426183844
SRY     0.7532033426183844
SOX13   0.652924791086351
SOX9    0.886908077994429
Sox7    0.4284122562674095
Sox15   0.6412256267409471
SOX30   0.4284122562674095
Sox18   0.4284122562674095
Sox12   0.4284122562674095

binding,txt output contains 315.961.873 lines, which also includes hits for the upper and lower case of each TF.

I've ran the top command to compare the outputs, and they seem very similar. Issue seems to be on the ANANSE side.

@Maarten-vd-Sande
Copy link
Member

  • Did she use a custom motif2factors file (she should for zebrafish!)?
  • Also, maybe I am misunderstanding, but how can Sox15 and SOX15 have different activity scores?

@simonvh
Copy link
Member

simonvh commented Jul 9, 2021

Also, maybe I am misunderstanding, but how can Sox15 and SOX15 have different activity scores?

Default uses both mouse and human TFs (from the messy motif db). These my be assigned to different motifs.

In any case, develop with the h5 integrated should now run without memory problems, regardless of the number of TFs.

@siebrenf
Copy link
Member Author

I think the issue was resolved by rerunning binding with the zebrafish motifs2factors, and adding this with --pfmfile.

Expanding documentation on --pfmfile and --pfmscorefile would help here. (I'll add it to my TODOs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants