You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
xpmem_close_handler is forcing a sigkill of the current thread group. In certain cases that means it is killing off the PBS_MOM. Initial guess is that we have some kind of race to free up memory when user job processes are exiting and the appropriate xpmem_detach isn't winning the race.
xpmem_close_handler is forcing a sigkill of the current thread group. In certain cases that means it is killing off the PBS_MOM. Initial guess is that we have some kind of race to free up memory when user job processes are exiting and the appropriate xpmem_detach isn't winning the race.
Original Stack trace recovered via systemtap:
0xffffffffa1170e57 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x8e57/0x0]
0xffffffffa117297e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xa97e/0x0]
0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0]
0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0]
0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0]
0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel]
0xffffffff8101bfe4 : try_stack_unwind+0x194/0x1b0 [kernel]
0xffffffff8101ae04 : dump_trace+0x64/0x3b0 [kernel]
0xffffffffa1172e88 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xae88/0x0]
0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0]
0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0]
0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0]
0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel]
0xffffffff810934fe : send_signal+0x3e/0x80 [kernel]
0xffffffff81093d30 : force_sig_info+0xb0/0xe0 [kernel]
0xffffffff81093d76 : force_sig+0x16/0x20 [kernel]
0xffffffffa0238a01 : xpmem_close_handler+0x151/0x270 [xpmem]
0xffffffff811d774d : remove_vma+0x2d/0x70 [kernel]
0xffffffff811db09a : exit_mmap+0xea/0x150 [kernel]
0xffffffff81082edf : mmput+0x4f/0x110 [kernel]
We enabled xpmem_debug and captured the following trace along with dmesg log.
The job was started at 16:30, so you can extract the log with (grep "Jun 11 16:3" r1i6n18.gbe.ice.issp.u-tokyo.ac.jp
20180611.tar.gz
The text was updated successfully, but these errors were encountered: