dask distributed workers die #103

siebrenf · 2021-07-05T18:38:20Z

Running ananse network on the develop branch, on cn106. It seems the memory limit causes the workers to stop, rather than behave?

command:

nice -n 10 \
ananse network -g GRCz11.fa -a GRCz11.annotation.bed -n 12 \
-b binding_24hpf.tsv -e 24hpf_rep1.tsv 24hpf_rep2.tsv -o network_24hpf.txt

full log

log snippet (this error, and seemingly similar errors, keep repeating):

...
2021-07-05 20:29:19 | INFO | Computing network
[##########                              ] | 26% Completed | 58.7sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[##########                              ] | 26% Completed | 59.3sdistributed.nanny - WARNING - Restarting worker
[##########                              ] | 26% Completed | 60.0sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[##########                              ] | 26% Completed |  1min  0.8sdistributed.nanny - WARNING - Restarting worker
[###########                             ] | 28% Completed |  1min  9.5sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[###########                             ] | 28% Completed |  1min 10.0sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[###########                             ] | 28% Completed |  1min 10.2sdistributed.nanny - WARNING - Restarting worker
[###########                             ] | 28% Completed |  1min 10.8sdistributed.nanny - WARNING - Restarting worker
[###########                             ] | 28% Completed |  1min 12.3sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[###########                             ] | 28% Completed |  1min 13.0sdistributed.nanny - WARNING - Restarting worker
[###########                             ] | 28% Completed |  1min 13.1sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:40089
Traceback (most recent call last):
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2334, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3753, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3730, in _get_data
    comm = await rpc.connect(worker)
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 1012, in connect
    comm = await connect(
  File "/vol/mbconda/siebrenf/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 325, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:40089 after 10 s
[###########                             ] | 28% Completed |  1min 14.0sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[###########                             ] | 28% Completed |  1min 14.6sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:46609
...

The text was updated successfully, but these errors were encountered:

siebrenf · 2021-07-05T18:44:06Z

without niceness and 1 cores: same issue

simonvh · 2021-07-07T06:38:53Z

I think there are two issues:

ananse network still seems to take quite some memory, especially if there are many TFs. This needs to be refactored (I have some ideas there)>
The binding file contains all zebrafish TFs, but also all human TFs. The result is that this file becomes too big, with quite some unnecessary information. What were the arguments for ananase binding? This is also something that we should look at.

siebrenf · 2021-07-07T10:02:51Z

Angela ran Binding on the student server. I think it was something like this:

ananse binding \
-g GRCz11.fa \
-A GRCz11-GSM3396550.samtools-coordinate.bam \
-r GRCz11_raw.tsv \
-n 12 \
-o out_binding ;

head of GRCz11_raw.tsv:

1:5842-6042
1:11142-11342
1:16764-16964
1:18603-18803
1:21696-21896
1:27134-27334
1:27574-27774
1:29781-29981
1:36577-36777
1:42716-42916

head of the output factor_activity.tsv indeed contains a mix of upper and lower case:

factor  activity
SOX15   0.7532033426183844
SRY     0.7532033426183844
SOX13   0.652924791086351
SOX9    0.886908077994429
Sox7    0.4284122562674095
Sox15   0.6412256267409471
SOX30   0.4284122562674095
Sox18   0.4284122562674095
Sox12   0.4284122562674095

binding,txt output contains 315.961.873 lines, which also includes hits for the upper and lower case of each TF.

I've ran the top command to compare the outputs, and they seem very similar. Issue seems to be on the ANANSE side.

Maarten-vd-Sande · 2021-07-07T10:08:33Z

Did she use a custom motif2factors file (she should for zebrafish!)?
Also, maybe I am misunderstanding, but how can Sox15 and SOX15 have different activity scores?

simonvh · 2021-07-09T16:20:56Z

Also, maybe I am misunderstanding, but how can Sox15 and SOX15 have different activity scores?

Default uses both mouse and human TFs (from the messy motif db). These my be assigned to different motifs.

In any case, develop with the h5 integrated should now run without memory problems, regardless of the number of TFs.

siebrenf · 2021-07-12T09:44:56Z

I think the issue was resolved by rerunning binding with the zebrafish motifs2factors, and adding this with --pfmfile.

Expanding documentation on --pfmfile and --pfmscorefile would help here. (I'll add it to my TODOs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask distributed workers die #103

dask distributed workers die #103

siebrenf commented Jul 5, 2021

siebrenf commented Jul 5, 2021 •

edited

Loading

simonvh commented Jul 7, 2021

siebrenf commented Jul 7, 2021

Maarten-vd-Sande commented Jul 7, 2021

simonvh commented Jul 9, 2021

siebrenf commented Jul 12, 2021

dask distributed workers die #103

dask distributed workers die #103

Comments

siebrenf commented Jul 5, 2021

siebrenf commented Jul 5, 2021 • edited Loading

simonvh commented Jul 7, 2021

siebrenf commented Jul 7, 2021

Maarten-vd-Sande commented Jul 7, 2021

simonvh commented Jul 9, 2021

siebrenf commented Jul 12, 2021

siebrenf commented Jul 5, 2021 •

edited

Loading