Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where did the communication graph topology go? #4

Open
ccaamad opened this issue Oct 17, 2014 · 1 comment
Open

Where did the communication graph topology go? #4

ccaamad opened this issue Oct 17, 2014 · 1 comment

Comments

@ccaamad
Copy link

ccaamad commented Oct 17, 2014

Hi there,

Great to see development of IPM continuing, thanks for such a great tool :)

Version 0.983 could measure how much each rank talked to each other rank, which ended up in the Communication Topology graph. This seems to have disappeared in 2.x, but I see that the old code is still kicking around in there.

Are there any plans to get this working again, please?

Cheers,

Mark

@ccaamad
Copy link
Author

ccaamad commented Oct 17, 2014

Yikes! Just noticed that, even in 0.983, it only reports point to point communications - so collectives don't end up in there anyway.

I'm trying to figure out if there are "hot" patterns I can exploit to influence rank placement. Seems exactly the sort of thing IPM could be good for, but I'm not sure if it can be done?

Mark

cdaley added a commit to cdaley/IPM that referenced this issue Oct 24, 2016
We now check the return value of fopen. This prevents passing a
possible null pointer to md5_stream. The reason why the executable is
missing is unclear, however it is always good practice to check the
return value of fopen.

The stack trace from the observed error on Cori Phase 1 at NERSC is
shown below. Notice that the string in t->exec_realpath contains
"/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats
(deleted)". The "(deleted)" text indicates that the executable no
longer exists on disk???

Program terminated with signal SIGSEGV, Segmentation fault.
#0  _IO_fread (buf=0x21f6620, size=1, count=32768, fp=0x0) at iofread.c:41
41	iofread.c: No such file or directory.

(gdb) bt
#0  _IO_fread (buf=0x21f6620, size=1, count=32768, fp=0x0) at iofread.c:41
nerscadmin#1  0x000000000067e4f5 in __wrap_fread (ptr=0x21f6620, size=1, nmemb=32768, stream=0x0)
    at GEN.wrapper_posixio.c:796
nerscadmin#2  0x0000000000686ac8 in md5_stream (stream=stream@entry=0x0, resblock=resblock@entry=0x7fffffff2780)
    at md5.c:160
nerscadmin#3  0x0000000000685d1c in ipm_get_exec_md5sum (exec_md5sum=exec_md5sum@entry=0x1340a50 <task+8464> "",
    rpath=<optimized out>,
    rpath@entry=0x133fa50 <task+4368> "/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted)") at jobdata.c:137
nerscadmin#4  0x0000000000686d2b in taskdata_init (t=0x133e940 <task>) at perfdata.c:54
nerscadmin#5  0x0000000000684b24 in ipm_init (flags=flags@entry=0) at ipm_core.c:132
nerscadmin#6  0x0000000000671983 in MPI_Init (argc=argc@entry=0x7fffffff488c, argv=argv@entry=0x7fffffff4880)
    at mpi_init.c:125
nerscadmin#7  0x000000000040a88c in main (argc=20, argv=0x7fffffff6858) at main.cpp:82

(gdb) f 4
nerscadmin#4  0x0000000000686d2b in taskdata_init (t=0x133e940 <task>) at perfdata.c:54
54	  ipm_get_exec_md5sum(t->exec_md5sum, t->exec_realpath);

(gdb) p t->exec_realpath
$1 = "/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted)", '\000' <repeats 4012 times>
(gdb) printf "%s\n", t->exec_realpath
/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted)

(gdb) f 3
nerscadmin#3  0x0000000000685d1c in ipm_get_exec_md5sum (exec_md5sum=exec_md5sum@entry=0x1340a50 <task+8464> "",
    rpath=<optimized out>,
    rpath@entry=0x133fa50 <task+4368> "/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted)") at jobdata.c:137
137	   md5_stream(fh,sbuf);

(gdb) p fh
$2 = (FILE *) 0x0

(gdb) f 2
nerscadmin#2  0x0000000000686ac8 in md5_stream (stream=stream@entry=0x0, resblock=resblock@entry=0x7fffffff2780)
    at md5.c:160

160		  n = fread (buffer + sum, 1, BLOCKSIZE - sum, stream);
(gdb) p stream
$3 = (FILE *) 0x0
cdaley added a commit that referenced this issue Aug 24, 2017
This commit wraps the functions `writev`, `readv`, `pwritev` and `preadv`. Only the `writev` function has been tested so far. The `writev` function is used by `std::ostream::write` in C++:

```
(gdb) bt
#0  0x00007ffff68d3c90 in writev () from /lib64/libc.so.6
#1  0x00007ffff75acda5 in std::__basic_file<char>::xsputn_2(char const*, long, char const*, long) () from /usr/lib64/libstdc++.so.6
#2  0x00007ffff75e4632 in std::basic_filebuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib64/libstdc++.so.6
#3  0x00007ffff76069f3 in std::ostream::write(char const*, long) () from /usr/lib64/libstdc++.so.6
#4  0x00000000004c7aed in amrex::VisMF::Write (mf=..., mf_name=..., how=how@entry=amrex::VisMF::NFiles, set_ghost=set_ghost@entry=false) at ../../../amrex/Src/Base/AMReX_VisMF.cpp:1012
#5  0x000000000056000e in amrex::StateData::checkPoint (this=0xb61c70, name=..., fullpathname=..., os=..., how=how@entry=amrex::VisMF::NFiles, dump_old=dump_old@entry=false) at ../../../amrex/Src/Amr/AMReX_StateData.cpp:749
#6  0x000000000054fb76 in amrex::AmrLevel::checkPoint (this=this@entry=0xb512a0, dir=..., os=..., how=amrex::VisMF::NFiles, dump_old=false) at ../../../amrex/Src/Amr/AMReX_AmrLevel.cpp:420
#7  0x00000000004253be in Nyx::checkPoint (this=0xb512a0, dir=..., os=..., how=<optimized out>, dump_old_default=<optimized out>) at ../../Source/Nyx_output.cpp:724
#8  0x0000000000544014 in amrex::Amr::checkPoint (this=0xb5b0b0) at ../../../amrex/Src/Amr/AMReX_Amr.cpp:1667
#9  0x000000000054a84f in amrex::Amr::init (this=0xb5b0b0, strt_time=0, stop_time=-1) at ../../../amrex/Src/Amr/AMReX_Amr.cpp:1022
#10 0x0000000000408ade in main (argc=2, argv=0x7fffffffd1a8) at ../../Source/main.cpp:386
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant