Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with cA2.09.48 on JUWELS #109

Closed
martin-ueding opened this issue Oct 15, 2019 · 54 comments
Closed

Issues with cA2.09.48 on JUWELS #109

martin-ueding opened this issue Oct 15, 2019 · 54 comments

Comments

@martin-ueding
Copy link
Contributor

For the lack of a better place I put it just here. I am trying to run the contractions for cA2.09.48 on JUWELS. I got it compiled and I think that I have set it up correctly from the input file. The paths might be wrong as the perambulators are not in one subdirectory per random vector. So I do get the following here:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 19) >= this->size() (which is 19)
@kostrzewa
Copy link
Member

Yeah, one of the inconsistencies... Should be easy enough to account for -> https://github.com/HISKP-LQCD/sLapH-contractions/blob/master/src/global_data_build_IO_names.cpp

@martin-ueding
Copy link
Contributor Author

Oh, you know where this hits?

@kostrzewa
Copy link
Member

I would have expected a perambulator or random vector reading failure, however...

@martin-ueding
Copy link
Contributor Author

Guess I will be looking into the traceback stuff that we had discussed a few times.

@kostrzewa
Copy link
Member

The IO names are constructed in src/global_data_build_IO_names.cpp, at the start of contract (or VdaggerV), build_IO_names(gd, config_i) is called and this populates the various members

gd.rnd_vec_construct.filename_list
gd.peram_construct.filename_list
gd.filename_eigenvectors

Just to be sure:

  1. you did not reorganize the perambulators into the expected directory structure, right?
  2. you chose appropriate dilution sizes for this lattice?

@kostrzewa
Copy link
Member

I've answered my second question:

quark = u:6:TB:2:EI:6:DF:4:/p/scratch/chbn28/project/sLapH_perambulators/cA2a.09.48/light

should be

quark = u:6:TB:3:EI:4:DF:4:/p/scratch/chbn28/project/sLapH_perambulators/cA2a.09.48/light

@kostrzewa
Copy link
Member

Guess I will be looking into the traceback stuff that we had discussed a few times.

Indeed, it would be useful to know where this was triggered without having to fire up a debugger...

@martin-ueding
Copy link
Contributor Author

martin-ueding commented Oct 15, 2019

I had tried to compile the software with the -traceback flag for the ICC, but it did not magically print the backtrace.

I'm back from lunch and I got this backtrace from gdb:

#0  0x00002aaaabe76207 in raise () from /lib64/libc.so.6
#1  0x00002aaaabe778f8 in abort () from /lib64/libc.so.6
#2  0x00002aaaabd45993 in __gnu_cxx::__verbose_terminate_handler ()
    at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00002aaaabd4bb26 in __cxxabiv1::__terminate(void (*)()) ()
    at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00002aaaabd4bb61 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00002aaaabd4bd93 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x12e3b60, 
    tinfo=0x2aaaabe31998 <typeinfo for std::out_of_range>, dest=0x2aaaabd60a20 <std::out_of_range::~out_of_range()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x00002aaaabd478a1 in std::__throw_out_of_range_fmt (__fmt=<optimized out>)
    from /gpfs/software/juwels/stages/2019a/software/GCCcore/8.3.0/lib64/libstdc++.so.6
#7  0x0000000000815b73 in std::vector<DilutedFactorIndex, std::allocator<DilutedFactorIndex> >::_M_range_check (
    this=0x12d7b70, __n=19)
    at /gpfs/software/juwels/stages/2019a/software/GCCcore/8.3.0/include/c++/8.3.0/bits/stl_vector.h:960
#8  0x0000000000815bab in std::vector<DilutedFactorIndex, std::allocator<DilutedFactorIndex> >::at (this=0x12d7b70, 
    __n=19) at /gpfs/software/juwels/stages/2019a/software/GCCcore/8.3.0/include/c++/8.3.0/bits/stl_vector.h:981
#9  0x00000000008082a3 in read_parameters (gd=..., ac=7, av=0x7ffffffeeeb8)
    at /p/project/chbn28/ueding1/Code/sLapH-contractions/src/global_data.cpp:393
#10 0x0000000000710eb8 in main (ac=7, av=0x7ffffffeeeb8)
    at /p/project/chbn28/ueding1/Code/sLapH-contractions/main/contract.cpp:41

The offending line (global_data.cpp:393) is this one:

      auto const size2 = gd.quarkline_lookuptable.at("Q2").at(q[1]).rnd_vec_ids.size();

The type of that field is this:

  DilutedFactorIndicesCollection quarkline_lookuptable;

And that type is:

using DilutedFactorIndicesCollection =
    std::map<std::string, std::vector<DilutedFactorIndex>>;

So it is the second .at that fails the range check. The q come from TraceIndicesCollection trace_indices_map; which is typedef std::map<std::string, std::vector<Indices>> TraceIndicesCollection;. So the q are Indices which is just std::vector<ssize_t>.

It could be the dilution specification, I will check that.

@martin-ueding
Copy link
Contributor Author

I have updated the dilution specification, it shows up in the logs, but there still is the same exception. I'll run in it GDB again to make sure that it is the same spot.

@kostrzewa
Copy link
Member

The offending line (global_data.cpp:393) is this one:

Did you add something to the function? We seem to have different line numbers (Q0Q2-optimization branch):

[...]
389     int total_number_of_random_combinations_in_trQ1 = 0;
390     for (auto const &q : gd.trace_indices_map.at("trQ1")) {
391       total_number_of_random_combinations_in_trQ1 +=
392           gd.quarkline_lookuptable.at("Q1").at(q[0]).rnd_vec_ids.size();
393     }                                                                                                                                                    
[...]

@martin-ueding
Copy link
Contributor Author

I have moved the perambulators like they are on QBIG, but that also did not resolve the issue:

./cnfg0240/rnd_vec_00/perambulator.rndvecnb00.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_00/randomvector.rndvecnb00.u.nbev0660.0240
./cnfg0240/rnd_vec_01/perambulator.rndvecnb01.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_01/randomvector.rndvecnb01.u.nbev0660.0240
./cnfg0240/rnd_vec_02/perambulator.rndvecnb02.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_02/randomvector.rndvecnb02.u.nbev0660.0240
./cnfg0240/rnd_vec_03/perambulator.rndvecnb03.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_03/randomvector.rndvecnb03.u.nbev0660.0240
./cnfg0240/rnd_vec_04/perambulator.rndvecnb04.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_04/randomvector.rndvecnb04.u.nbev0660.0240
./cnfg0240/rnd_vec_05/perambulator.rndvecnb05.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_05/randomvector.rndvecnb05.u.nbev0660.0240

@pittlerf
Copy link
Collaborator

Is it sure, that it is connected to the perambulators? I remember having issue, that the sign of the required momentum in the VdaggerV was different that actually was written to the disk. Could that be the case here?

@martin-ueding
Copy link
Contributor Author

It could also be that I have the VdaggerV incorrectly. I have never used pre-generated VdaggerV, so perhaps I am doing it wrong.

So in /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48 I have this structure:

cnfg0240/operators.0240.p_00-1.d_000.t_000
cnfg0240/operators.0240.p_00-1.d_000.t_001
cnfg0240/operators.0240.p_00-1.d_000.t_002
cnfg0240/operators.0240.p_00-1.d_000.t_003
cnfg0240/operators.0240.p_00-1.d_000.t_004
cnfg0240/operators.0240.p_00-1.d_000.t_005
cnfg0240/operators.0240.p_00-1.d_000.t_006
cnfg0240/operators.0240.p_00-1.d_000.t_007
cnfg0240/operators.0240.p_00-1.d_000.t_008
cnfg0240/operators.0240.p_00-1.d_000.t_009

And then in the file I have these options:

delta_config = 1
#path_config = /hiskp4/gauges/quenched/wilson_b5.85_L12T24

# eigenvector handling:
number_of_eigen_vec = 660
#path_eigenvectors   = /hiskp4/eigensystems/nf211/A40.24/
#name_eigenvectors   = eigenvectors

handling_vdaggerv   = read
path_vdaggerv = /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48/cnfg0240

From the code I have these snippets that make up the file names:

  auto const full_path =
      (boost::format("/%s/cnfg%04d/operators.%04d") % path_vdaggerv % config % config)
          .str();
          std::string dummy = full_path + ".p_" + std::to_string(op.momentum[0]) +
                              std::to_string(op.momentum[1]) +
                              std::to_string(op.momentum[2]) + ".d_" +
                              to_string(op.displacement);

          auto const infile = (boost::format("%s.t_%03d") % dummy % t).str();

In case that these could not be opened there would be errors thrown. So either they are built or they can be found.

@martin-ueding
Copy link
Contributor Author

Let's try that without the configuration number in the path!

path_vdaggerv = /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48

But that on the other hand means that there is no clear error when the file cannot be found!

@martin-ueding
Copy link
Contributor Author

It seems that the path to the perambulators is not the real problem. Taking the output from some working contraction we can see what happens:

Memory consumption:
        OperatorFactory:        0.18 Gb
        RandomVector:   0.00 Gb
        Perambulator:   1.19 Gb
        DiagramParts:
                Q0:     0.10 Gb
                Q1:     0.00 Gb
                Q2:     0.10 Gb
        trQ1Q1: 0.00 Gb

And this is missing in the new case here:

        trQ0Q2: 17179869153.22 Gb
        trQ1:   0.00 Gb
        Diagrams:
        Perambulators initialised
        Random vectors initialised
        Meson operators initialised

processing configuration: 1430


[Start    ] Perambulator I/O
        Reading perambulator from file:
                /hiskp4/perambulators/nf211/A40.24/light/cnfg1430/rnd_vec_00/perambulator.rndvecnb00.u.TsoB0024.VsoI0006.DsoF4.TsiF00
[   Finish] Perambulator I/O. Duration: 0.7837574035 seconds. Threads: 1
…
[Start    ] Eigenvector and Gauge I/O
        Reading eigenvectors from files:/hiskp4/eigensystems/nf211/A40.24//eigenvectors.1430.000
…

So the actual reading of the files happens later on. So it seems that it would crash no matter presence of the actual files. I am not sure how that works with the VdaggerV “read” mode here, the other contractions were in “build”.

I will just try to run it with “build” and see whether that crashes later.

@pittlerf
Copy link
Collaborator

Hi Martin,

The VdaggerV reading happens after the perambulator reading:

     Meson operators initialised

processing configuration: 468


[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_00/perambu
lator.rndvecnb00.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 3.9700660706 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_01/perambu
lator.rndvecnb01.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 5.1294889450 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_02/perambu
lator.rndvecnb02.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 5.1566550732 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_03/perambu
lator.rndvecnb03.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 5.1603319645 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_04/perambulator.rndvecnb04.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 4.8916890621 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_05/perambulator.rndvecnb05.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 4.8554379940 seconds. Threads: 1
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_00/randomvector.rndvecnb00.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_01/randomvector.rndvecnb01.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_02/randomvector.rndvecnb02.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_03/randomvector.rndvecnb03.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_04/randomvector.rndvecnb04.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_05/randomvector.rndvecnb05.u.nbev0660.0468
[Start    ] VdaggerV I/O
        reading VdaggerV from file://p/scratch/chbn28/hbn28d/vdaggerv/nf2/cA2.09.48/data//cnfg0468/operators.0468.p_-100.d_000.t_009
        reading VdaggerV from file://p/scratch/chbn28/hbn28d/vdaggerv/nf2/cA2.09.48/data//cnfg0468/operators.0468.p_-100.d_000.t_006

@pittlerf
Copy link
Collaborator

at least in the case of the rho

@martin-ueding
Copy link
Contributor Author

Okay, thank you! This means that something apart from the files is still wrong. Okay, I will try to figure that out.

@pittlerf
Copy link
Collaborator

I have tried your code with the three pion and the rho input file as well, and the three pion crashed with the same error and the rho actually did not crashed.

Shell debugging restarted
This is sLapH-contractions:
  git branch refs/heads/Q0Q2-optimization
  git revision 426b708b50927b22428ec50958aa104bbff2ab4a
  git state CLEAN
  compiled by pittler1 on juwels08.ib.juwels.fzj.de
  running with up to 24 OpenMP threads

@pittlerf
Copy link
Collaborator

I have tested that with

operator_list = g5.d0.p0

works

@martin-ueding
Copy link
Contributor Author

Okay, thank you very much for testing this! That means that it is independent of any perambulator or VdaggerV files but there is something wrong prior to that. I will try to triage that further.

@martin-ueding
Copy link
Contributor Author

I just realized that you were using a somewhat older version of the code from 2019-07-22. Since then a few things have happened:

$ git diff 426b708b50927b22428ec50958aa104bbff2ab4a.. --stat
 CMakeLists.txt                                                  |  26 ++---
 deprecated/VdaggerV.cpp                                         | 253 +++++++++++++++++++++++++++++++++++++++++++
 include/EigenVector.hpp                                         |   4 +
 include/OperatorsForMesons.hpp                                  |   9 +-
 include/global_data.hpp                                         |   5 +
 include/h5-wrapper.hpp                                          |   2 +-
 include/timings.hpp                                             |  53 +++++----
 job-script-generator/generate-contraction-jobs                  |   6 +-
 job-script-generator/job_script_juwels.sh.j2                    |  38 +++++++
 job-script-generator/job_script_qbig_slurm.sh.j2                |   2 +-
 job-script-generator/kill_runs.sh                               |   8 --
 main/VdaggerV.cpp                                               | 275 ++++++++++-------------------------------------
 main/contract.cpp                                               |  26 +++--
 make-jureca-build.sh                                            |  44 --------
 make-jurecabooster-build.sh                                     |  46 --------
 make-qbig-build.sh                                              |  31 ------
 migrate-plus                                                    |  82 --------------
 src/Correlators.cpp                                             |   2 +-
 src/DilutedFactorFactory.cpp                                    |   1 -
 src/DilutedTraceFactory.cpp                                     |  22 ++--
 src/EigenVector.cpp                                             |  49 ++++++++-
 src/OperatorFactory.cpp                                         | 234 +++++++++++++++++++++++++++++++---------
 src/global_data.cpp                                             | 176 ++++++++++++++++++------------
 src/init_lookup_tables.cpp                                      |   4 +
 tests/integration-L4-new/correlators-reference/C6cC_cnfg1000.h5 | Bin 195608 -> 195608 bytes
 tests/integration-L4/correlators-reference/C6cC_cnfg1000.h5     | Bin 195608 -> 195608 bytes
 utility/make-qbig-build.sh                                      |   6 +-
 utility/module_load_juwels_icc.sh                               |   8 ++
 utility/reverse-time-in-C6cC                                    |  79 ++++++++++++++
 29 files changed, 862 insertions(+), 629 deletions(-)

I have tried with your version and it also crashes. So I would say that the error has been there all along. On my laptop it takes 384 seconds until the program crashes. I realized that I did not have the cutoffs in place. Then all the sudden it crashed after 6 seconds. The range check now fails because it hits the index 7, but the same principle applies.

Just running it with operator_list = g5.d0.p0,1 works. More momenta do not work. On my laptop it tried to allocate the memory and unfortunately I have a swap partition and it succeeded 🙄.

Okay, then we have it pinned down pretty well, I believe.

@martin-ueding
Copy link
Contributor Author

These combinations works:

momentum_cutoff_0 = 2
operator_list = g5.d0.p0,1
momentum_cutoff_0 = 4
operator_list = g5.d0.p0,1,2

But these ones does not:

momentum_cutoff_0 = 2
operator_list = g5.d0.p0,1,2
momentum_cutoff_0 = 4
operator_list = g5.d0.p0,1,2,3

This should not be the case, of course. The cutoffs should just remove all the operators that are not needed. So perhaps we have the problem that some operators are not used at all with the given cutoffs and that they therefore do not appear in some list, letting the indexing run over the vector.

I'm reaching the conclusion that I do not want to do anything with that code and just outright replace it now.

@kostrzewa
Copy link
Member

kostrzewa commented Oct 16, 2019

Looking just at your output file on JUWELS for guidance,

[...]
    gd.path_config: 
    gd.handling_vdaggerv: read
    gd.path_vdaggerv: /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48
  Momentum Cutoff: 0->4 1->5 2->6 3->7 4->4 

Memory consumption:
        OperatorFactory:        10.59 Gb
        RandomVector:   0.02 Gb
        Perambulator:   11.60 Gb
        DiagramParts:
                Q0:     0.12 Gb
                Q1:     0.00 Gb
                Q2:     0.07 Gb
        trQ1Q1: 0.00 Gb
Di 15. Okt 16:46:41 CEST 2019

It seems to me that the problem occurs in the iteration over the "quantum number indices" for trQ0Q2 accessing the indices of quark lines "Q0" or "Q2". Looking at the corresponding lines of global_data.cpp (around line 380), this seems to be the spot where things go wrong, hence my question above regarding line numbers in your backtrace.

I'm reaching the conclusion that I do not want to do anything with that code and just outright replace it now.

Yes, although one should keep in mind that all the lookup table stuff in the code is solving a rather complicated organisational problem and that replacing it is similarly not so simple. Going from the output of the subduction code back to VdaggerV objects and perambulators is not totally straightforward, I think.... Also, one doesn't just have to generate the required dependencies on these basic objects at this stage, but also on intermediate results such as Q0Q2.

As it is right now, it seems that objects are requested which are not added to some list at an appropriate stage, most likely in the construction of the dependencies for trQ0Q2, which themselves come from Cc4C and Cc6C, correct?

If a complete redesign is not feasible, one might be able to reshape the construction of the lookup tables by making use of maps with clear names for each required object instead of having to manually ensure unique indexing (as exemplified by the requirement of the unique_push_back used all over the place). Having unique (and human-readable) object names would also mean that it would be completely clear at this stage which object is missing, perhaps even allowing to quickly identify why this is so. Instead, all the information we have right now is an index number ...

@martin-ueding
Copy link
Contributor Author

Sorry, I did not see that comment before.

Did you add something to the function? We seem to have different line numbers (Q0Q2-optimization branch):

Yes, a simple print statement which I later removed again.

The spot that you name certainly is the one where the code crashes.

Yes, although one should keep in mind that all the lookup table stuff in the code is solving a rather complicated organisational problem and that replacing it is similarly not so simple. Going from the output of the subduction code back to VdaggerV objects and perambulators is not totally straightforward, I think...

I still have not dug into it too deeply, so I likely underestimate the effort required. But in principle the correlators have all the information in them. Also I have the diagram specifications that tells me which Q0/Q1/Q2 objects are involved at which positions.

  map["C6cCD"] = {Vertices({0, 2, 4}, {1, 3, 5}),
                  TraceSpecs{{{"Q2", 3, 0}, {"Q0", 0, 1}, {"Q2", 1, 2}, {"Q0", 2, 3}},
                             {{"Q0", 4, 5}, {"Q2", 5, 4}}}};

Here you see that we have a Q2Q0Q2Q0 involving the vertices [0, 1, 2, 3] and another Q0Q2 involving the vertices [4, 5]. Then I chunk the HDF5 name and pick out the required momenta and gamma structures. I can also see whether they are source and sink indices and take into account that I might need to negate the momentum.

Using this I should be able to populate the operators, the Q objects and the trace(Q…) objects that are needed bottom up. And then use the indices from these to fill the index tables of the more complex objects.

@kostrzewa
Copy link
Member

I still have not dug into it too deeply, so I likely underestimate the effort required. But in principle the correlators have all the information in them. Also I have the diagram specifications that tells me which Q0/Q1/Q2 objects are involved at which positions.

I agree, the diagram spec should provide the necessary handle.

@kostrzewa
Copy link
Member

Once the json based correlator input is merged, what's (roughly) the modus operandi if I wanted to run contractions for some new (or old) system?

@martin-ueding
Copy link
Contributor Author

For a new system you would run my projection code to obtain the prescriptions. The contraction code contains an utility script which goes through all these prescriptions and extracts the dataset names and compiles them into the list of correlator names.

For an existing system, like the integration tests, I used h5ls and then created the file myself using some Vim commands. But also here it would be easy to write a script which does this for you. I can write that eventually, but skipped that for now.

@kostrzewa
Copy link
Member

For a new system you would run my projection code to obtain the prescriptions. The contraction code contains an utility script which goes through all these prescriptions and extracts the dataset names and compiles them into the list of correlator names.

What if I don't have Mathematica? Just playing devil's advocate here to understand all steps.

What about the fact that QCT does not have real support for the twisted mass symmetries, does that matter?

@kostrzewa
Copy link
Member

Supporting new types of contractions then necessarily also means having to touch the subduction code, correct?

@martin-ueding
Copy link
Contributor Author

A third option I forgot is to just generate such a list of diagram names with some other means. Basically the part of the code that I removed (and had some bug in it) could be rewritten in some other language to output such a list.

What if I don't have Mathematica? Just playing devil's advocate here to understand all steps.

You could just personally buy an academically discounted version if your institution does not have it. I mean this has happened with Maple here once already.

But if you don't have Mathematica, you could not use the projection at all. This would mean that you would need to either acquire a license or rewrite all this in some other product.

Even if you use the old code and just let it generate a bunch of correlation function, you will need to know how these are to be combined. And once you have figured that out, you can also build a list. Unless you write a code which never explicitly constructs this list but uses the numeric data too soon.

What about the fact that QCT does not have real support for the twisted mass symmetries, does that matter?

It does not know anything about the Dirac matrices. I explicitly do the gamma-5 trick with pattern matching in the code. But there might be more opportunities to compute less correlators.

Supporting new types of contractions then necessarily also means having to touch the subduction code, correct?

I would rather say that once you have new operators you need to touch the projection code. The fact that you get different contractions from that is just a consequence of that.


But you are right, the projection code becomes somewhat of a dependency to the contraction code. On the other hand there is a clear interface, the JSON file with the correlator names, so one could substitute it with a different implementation.

@kostrzewa
Copy link
Member

Thanks for the detailed explanation.

I would rather say that once you have new operators you need to touch the projection code. The fact that you get different contractions from that is just a consequence of that.

Yes, that what I meant.

A third option I forgot is to just generate such a list of diagram names with some other means. Basically the part of the code that I removed (and had some bug in it) could be rewritten in some other language to output such a list.

For simple systems (i.e., stuff at zero momentum basically) this will be necessary to have I believe. I agree that there's a clear interface now and it should be relatively straightforward to bang something together.

@martin-ueding
Copy link
Contributor Author

For simple systems (i.e., stuff at zero momentum basically) this will be necessary to have I believe.

Well, for zero momentum you can just write them by hand. There is just one correlator per diagram type.


On JUWELS I advanced further, now I am at this error message:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Can't open //p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48/cnfg0240/operators.0240.p_011.d_000.t_003

Taking a look into that directory shows that not all momenta are actually built.

[ueding1 juwels03 .../vdaggerv/nf2/cA2.09.48/cnfg0240]
$ ls | cut -d . -f 3 | sort -u
p_00-1
p_00-2
p_0-10
p_0-11
p_0-1-1
p_0-20
p_-100
p_-101
p_-10-1
p_-110
p_-1-10
p_-111
p_-1-11
p_-11-1
p_-1-1-1
p_-200

The one that I want (0, 1, 1) is not available. But there is the conjugated (0, -1, -1) which I could use. I presume that somehow the conjugation of VdaggerV objects has not been resolved in the same fashion as it has been for the generation of these files.

If a file could not be found I could take the opposite parity file and do a hermitian conjugation. Or in the resolving of the VdaggerV that I need I could make sure that I always have a minus sign in the first non-zero momentum component. That is likely the easier method and will use a bit less memory.

@kostrzewa
Copy link
Member

kostrzewa commented Oct 18, 2019

Well, for zero momentum you can just write them by hand. There is just one correlator per diagram type.

Yes, I agree completely, this is a very good input interface for the purpose, generalizing straightforwardly to multiple flavours and simple enough to write by hand for simple cases.

This approach is on the way towards a more general "diagram description language" which I thought might have been optimal for this type of rather general contraction code. At the time, I had the idea of a slightly more elaborate format which also included different modifiers for the quark lines (to support multiple smearing types, for example) and with an explicit listing of temporal and spatial coordinates, providing the ordering that is now provided by the diagram specifications internally. I dropped the idea to pursue this as there seemed to be no interest in the group at the time.

The one that I want (0, 1, 1) is not available. But there is the conjugated (0, -1, -1) which I could use. I presume that somehow the conjugation of VdaggerV objects has not been resolved in the same fashion as it has been for the generation of these files.

Yes, there is now way around checking both +p and -p VdaggerV files, taking the one that exists and tagging that it should be multiplied conjugated.

@kostrzewa
Copy link
Member

kostrzewa commented Oct 18, 2019

Yes, there is now way around checking both +p and -p VdaggerV files, taking the one that exists and tagging that it should be multiplied conjugated.

It might make sense to create a map of available VdaggerV objects at startup (when in "read" mode) in order to avoid overly frequent requests to the file system.

@kostrzewa
Copy link
Member

On JUWELS I advanced further, now I am at this error message:

at least it throws now :)

@martin-ueding martin-ueding changed the title Issues with cA2.09.48 on JEWELS Issues with cA2.09.48 on JUWELS Oct 18, 2019
@martin-ueding
Copy link
Contributor Author

Okay, we are all set to tackle the next problem 😄😫. The code reads in the perambulators, the VdaggerV (thanks for your help in the meeting!). It starts the contractions:

[Start    ] All contractions
Thread   0 of   4 starts with block pair     1 of   528.
Thread   2 of   4 starts with block pair     0 of   528.
Thread   3 of   4 starts with block pair     2 of   528.
Thread   1 of   4 starts with block pair     3 of   528.

But then of course, as we all have guessed, it did not finish further:

srun: error: jwc00n001: task 0: Killed
srun: Force Terminated job step 1739254.0

I noticed that I did not ask for any memory explicitly, so I have no idea how much I did get.

@martin-ueding
Copy link
Contributor Author

I guess I got the full 90 GB. Logged into the node and using watch free -h I could see how the memory just got consumed and the job killed. And I was only running 4 threads! On this machine we should be running with 48 threads.

We likely waste a bunch of memory for the reduction in the end, I wanted to switch that over to atomics. But then still the per-thread cache is just an idiotic design for machines with this core/memory ratio. I see these options, none of them are great:

  1. Use atomics for the reduction and therefore reduce the memory load and then it will magically fit into the memory with four threads. We will still waste almost all of the computing resources.

  2. Clear the cache much more often. We already only have cross-like diagrams that we compute and also only the gamma-five operator. So we are in a good situation but still it does not work. This would allow us to go for many more threads, hopefully outperforming the cache with fewer threads.

  3. Use less and less threads until it fits. Shudder

  4. Try some other parallelization direction. Perhaps over the quantum numbers? We have like 250k different correlators that we want to compute, so we ought to be able to parallelize over them and just have one cache. That of course will take time and delay production even further.

  5. Start with some subset, like discarding the higher moving frames or something.

@kostrzewa
Copy link
Member

Try the large memory nodes and see what happens to begin with.

@kostrzewa
Copy link
Member

Wasting resources is not really a problem if the calculation fits into our compute time budget

@martin-ueding
Copy link
Contributor Author

Nope, 180 GB of memory are not enough with four threads either 🤷‍♂️. I have just implemented the atomics stuff, but that will not be much. We have a result variable for every correlator and every time slice. Before we also had another one per thread, so basically it has been this:

sizeof(Complex) * num_correlators * (extent_time + num_threads)

With 96 time slices and four threads the change is large and that is only part of memory usage. With like 250k correlators that is 4 MB per time slice or thread. Hmm, so I have only saved a couple of MB with that change, I guess.

@martin-ueding
Copy link
Contributor Author

Besides that I looked into it when it ran for already like 40 minutes and it has been on the first time slice. Start and end are 14:24:49 and 15:33:58, so it ran over an hour to fill up the memory. But each thread did not even finish its first time slice combination out of 528. So optimistically assuming that it just did not finish by a few GB of memory, we would have an hour for 4 slices. So that would be a fun 132 hours (5.5 days) for the whole configuration. Completely infeasible.

But we are using just 4 of 48 cores that the machine has. So if we would manage to use all without additional overheads it would be ×12 faster. JUWELS does not have a walltime limit, but I guess we have the usual 24h with quota and the 6h without quota? So we need to achieve ×5.5 to fit into the walltime and potentially have ×12 in resources available. Perhaps we can clear the caches more often and then it might fit and the recomputation does not kill us.

@martin-ueding
Copy link
Contributor Author

I have now added a little change which will clear the QQ cache right after each trace. I have the feeling that this will not really change anything, because everything gets reused and therefore the stuff that is produced in each trace for a given time slice combination should just be reused completely. The second trace should not add anything to it.

I have submitted that on a large memory node on JUWELS, but I expect it crash. And even if that helps, it still will not allow us to fit 48 threads in there. We need a different strategy for parallelization, this just does not scale.

@kostrzewa
Copy link
Member

JUWELS does not have a walltime limit, but I guess we have the usual 24h with quota and the 6h without quota?

Yes, the time limits are applied via QOS at the submit stage.

@kostrzewa
Copy link
Member

Nope, 180 GB of memory are not enough with four threads either man_shrugging

perhaps you could try to run on QBIG just to see the actual memory load as a test (using the devel QOS)

the other thing that might be worth doing is to actually fix the memory load output that Markus started, that way you will be able to predict exactly how much memory you require.

I have now added a little change which will clear the QQ cache right after each trace. I have the feeling that this will not really change anything, because everything gets reused and therefore the stuff that is produced in each trace for a given time slice combination should just be reused completely. The second trace should not add anything to it.

and the relevant part of the cache is cleared anyway once a particular time slice pair has been computed, correct?

@kostrzewa
Copy link
Member

kostrzewa commented Oct 18, 2019

There are relatively straightforward memory savings that one can obtain, although only for small numbers of concurrent threads.

As the correlators are computed cumulatively on paired time slices, one might load perambulator columns and VdaggerV on given time slices on the fly only when required and then discard them. (This needs some synchronisation between threads, as there will be common elements, consider the pairs (0,0), (0,1), (0,2), (0,3), for example, when running with four threads).

Even with 48 threads, this part of the memory load would be reduced markedly as one would have only a subset of the VdaggerV and perambulator data in memory at any one time.

Of course, this will mean that there will be some overhead, although the I/O time for perams and VdaggerV is quite acceptable and there would be at most a couple hundred read operations. (rather than one big one)

It does mean, however, that functions like create_operators and contract would have to be modified substantially.

@kostrzewa
Copy link
Member

Or rather, VdaggerV and perambulator submatrices corresponding to sets of blocks would need to be read on the fly.

@kostrzewa
Copy link
Member

At the level of diagrams, one could use nested parallelism, threading over the correlator requests.

@kostrzewa
Copy link
Member

At the level of diagrams, one could use nested parallelism, threading over the correlator requests.

In fact, how long are the lists of correlator requests on average? These should be huge, right?

@kostrzewa
Copy link
Member

kostrzewa commented Oct 18, 2019

At the level of diagrams, one could use nested parallelism, threading over the correlator requests.

In fact, how long are the lists of correlator requests on average? These should be huge, right?

I think this might solve all of our problems, it might even lead to better load balancing. But only if these lists are actually as long as I naively assume that they might be.

In practice, for systems with many momenta, one would use few threads over block pairs, but many threads over correlator requests, while for systems with few or a single momentum, one would thread over the block pairs only instead (as done in the past).

@martin-ueding
Copy link
Contributor Author

I was thinking about parallelizing over the correlators as well. There are cheap diagram and there are expensive ones. But in each case there are lots of them:

{'C4cC': 4709, 'C4cD': 4709, 'C6cC': 81391, 'C6cCD': 124021, 'C6cD': 47833}

And I would like this parallelization direction because it would mean that memory consumption would stay constant with the number of threads. The cache would need to be synchronized, but with omp critical this should not be too hard actually.

@martin-ueding
Copy link
Contributor Author

Shoot, I just realized that there is no C2c included in it, we won't get the pion out of there. Do we just want zero momentum there or do we want to verify the dispersion relation? So I'd need to add a couple of these manually because my projection code did not cover the single particle case yet.

@martin-ueding
Copy link
Contributor Author

I did run the version with the Q2Q0 optimization with a single thread and a single time slice combination on JUWELS. So far only the version in the devel QOS has run. It exhausted the 2 hours of walltime and was killed, I don't have memory or performance readings. Another job is running in the batch queue, but that has not started running yet.

It was built to keep the caches, so the memory load did not exceed the 90 GB within the two hours. That does not necessarily mean that it would not do that later on, but at least there is some hope that the 24 hour jobs in the batch partition will give us.

We can get a speed-up of up to 48 on JUWELS. But there are 528 time slices to run. This gives us a lower bound of 22 hours per configuration. Fortunately we can run the time slices individually and there is is just the IO that we would have to re-do every single time. We could also tweak it such that it does not load everything but only what it needs. This way we can split the jobs into more smaller jobs at the cost of having to merge and accumulate the HDF5 files that come out.

I will watch the jobs on JUWELS and at the same time proceed with Issue #111 to realize that speedup.

@martin-ueding
Copy link
Contributor Author

I'll close this ticket because running it on JUWELS now works in principles, it is just too slow and we have #111 for that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants