-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where did the communication graph topology go? #4
Comments
Yikes! Just noticed that, even in 0.983, it only reports point to point communications - so collectives don't end up in there anyway. I'm trying to figure out if there are "hot" patterns I can exploit to influence rank placement. Seems exactly the sort of thing IPM could be good for, but I'm not sure if it can be done? Mark |
cdaley
added a commit
to cdaley/IPM
that referenced
this issue
Oct 24, 2016
We now check the return value of fopen. This prevents passing a possible null pointer to md5_stream. The reason why the executable is missing is unclear, however it is always good practice to check the return value of fopen. The stack trace from the observed error on Cori Phase 1 at NERSC is shown below. Notice that the string in t->exec_realpath contains "/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted)". The "(deleted)" text indicates that the executable no longer exists on disk??? Program terminated with signal SIGSEGV, Segmentation fault. #0 _IO_fread (buf=0x21f6620, size=1, count=32768, fp=0x0) at iofread.c:41 41 iofread.c: No such file or directory. (gdb) bt #0 _IO_fread (buf=0x21f6620, size=1, count=32768, fp=0x0) at iofread.c:41 nerscadmin#1 0x000000000067e4f5 in __wrap_fread (ptr=0x21f6620, size=1, nmemb=32768, stream=0x0) at GEN.wrapper_posixio.c:796 nerscadmin#2 0x0000000000686ac8 in md5_stream (stream=stream@entry=0x0, resblock=resblock@entry=0x7fffffff2780) at md5.c:160 nerscadmin#3 0x0000000000685d1c in ipm_get_exec_md5sum (exec_md5sum=exec_md5sum@entry=0x1340a50 <task+8464> "", rpath=<optimized out>, rpath@entry=0x133fa50 <task+4368> "/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted)") at jobdata.c:137 nerscadmin#4 0x0000000000686d2b in taskdata_init (t=0x133e940 <task>) at perfdata.c:54 nerscadmin#5 0x0000000000684b24 in ipm_init (flags=flags@entry=0) at ipm_core.c:132 nerscadmin#6 0x0000000000671983 in MPI_Init (argc=argc@entry=0x7fffffff488c, argv=argv@entry=0x7fffffff4880) at mpi_init.c:125 nerscadmin#7 0x000000000040a88c in main (argc=20, argv=0x7fffffff6858) at main.cpp:82 (gdb) f 4 nerscadmin#4 0x0000000000686d2b in taskdata_init (t=0x133e940 <task>) at perfdata.c:54 54 ipm_get_exec_md5sum(t->exec_md5sum, t->exec_realpath); (gdb) p t->exec_realpath $1 = "/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted)", '\000' <repeats 4012 times> (gdb) printf "%s\n", t->exec_realpath /global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted) (gdb) f 3 nerscadmin#3 0x0000000000685d1c in ipm_get_exec_md5sum (exec_md5sum=exec_md5sum@entry=0x1340a50 <task+8464> "", rpath=<optimized out>, rpath@entry=0x133fa50 <task+4368> "/global/cscratch1/sd/csdaley/software_packages/bdcats/tests/ipm-new/bdats (deleted)") at jobdata.c:137 137 md5_stream(fh,sbuf); (gdb) p fh $2 = (FILE *) 0x0 (gdb) f 2 nerscadmin#2 0x0000000000686ac8 in md5_stream (stream=stream@entry=0x0, resblock=resblock@entry=0x7fffffff2780) at md5.c:160 160 n = fread (buffer + sum, 1, BLOCKSIZE - sum, stream); (gdb) p stream $3 = (FILE *) 0x0
cdaley
added a commit
that referenced
this issue
Aug 24, 2017
This commit wraps the functions `writev`, `readv`, `pwritev` and `preadv`. Only the `writev` function has been tested so far. The `writev` function is used by `std::ostream::write` in C++: ``` (gdb) bt #0 0x00007ffff68d3c90 in writev () from /lib64/libc.so.6 #1 0x00007ffff75acda5 in std::__basic_file<char>::xsputn_2(char const*, long, char const*, long) () from /usr/lib64/libstdc++.so.6 #2 0x00007ffff75e4632 in std::basic_filebuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib64/libstdc++.so.6 #3 0x00007ffff76069f3 in std::ostream::write(char const*, long) () from /usr/lib64/libstdc++.so.6 #4 0x00000000004c7aed in amrex::VisMF::Write (mf=..., mf_name=..., how=how@entry=amrex::VisMF::NFiles, set_ghost=set_ghost@entry=false) at ../../../amrex/Src/Base/AMReX_VisMF.cpp:1012 #5 0x000000000056000e in amrex::StateData::checkPoint (this=0xb61c70, name=..., fullpathname=..., os=..., how=how@entry=amrex::VisMF::NFiles, dump_old=dump_old@entry=false) at ../../../amrex/Src/Amr/AMReX_StateData.cpp:749 #6 0x000000000054fb76 in amrex::AmrLevel::checkPoint (this=this@entry=0xb512a0, dir=..., os=..., how=amrex::VisMF::NFiles, dump_old=false) at ../../../amrex/Src/Amr/AMReX_AmrLevel.cpp:420 #7 0x00000000004253be in Nyx::checkPoint (this=0xb512a0, dir=..., os=..., how=<optimized out>, dump_old_default=<optimized out>) at ../../Source/Nyx_output.cpp:724 #8 0x0000000000544014 in amrex::Amr::checkPoint (this=0xb5b0b0) at ../../../amrex/Src/Amr/AMReX_Amr.cpp:1667 #9 0x000000000054a84f in amrex::Amr::init (this=0xb5b0b0, strt_time=0, stop_time=-1) at ../../../amrex/Src/Amr/AMReX_Amr.cpp:1022 #10 0x0000000000408ade in main (argc=2, argv=0x7fffffffd1a8) at ../../Source/main.cpp:386 ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi there,
Great to see development of IPM continuing, thanks for such a great tool :)
Version 0.983 could measure how much each rank talked to each other rank, which ended up in the Communication Topology graph. This seems to have disappeared in 2.x, but I see that the old code is still kicking around in there.
Are there any plans to get this working again, please?
Cheers,
Mark
The text was updated successfully, but these errors were encountered: