Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the parallelization axis #111

Open
martin-ueding opened this issue Oct 22, 2019 · 59 comments
Open

Change the parallelization axis #111

martin-ueding opened this issue Oct 22, 2019 · 59 comments

Comments

@martin-ueding
Copy link
Contributor

Currently we parallelize over the time slices and keeping a cache per thread. This means that the memory usage depends linearly on the number of threads with a pretty big slope. It is so steep that we cannot use it on machine likes JUWELS where we just have a few GB of memory per core.

The solution could be the full graph treatment, see #102. But instead we could just try to parallelize over the correlators first. For the 3pi project we have of the order of 250k correlators that we want to compute. We could just parallelize over these instead. This would mean moving a few #pragma omp for and also making sure that the cache is synchronized. I will first try that with an #omp critical and see how bad the performance is. Perhaps we can get a synchronized map from Boost or something.

I'll first go through the code and see what we would need to do.

@kostrzewa
Copy link
Member

Sounds good. Just to keep it on record: this should definitely be nested threading, such that one can also parallelize over the block pairs if possible. Even in the use-case at hand, having a 2x24 parallelisation is probably beneficial...

@martin-ueding
Copy link
Contributor Author

I took out all the parallelization in the actual contractions for now and re-introduce it as focused parallel sections from the bottom. I thought that we could pull out the parallel regions and do some things redundantly and only collaborate on the for loops. But that would require a bit more synchronization in the data structures, so I want to do it in small steps.

In OpenMP terms we would have a parallel section outside and then use teams for the loops over the correlators? And one would choose the team size to be the total number of threads to get minimal memory usage? And a team size of 1 to have the current state of the code?

@kostrzewa
Copy link
Member

I took out all the parallelization in the actual contractions for now and re-introduce it as focused parallel sections from the bottom. I thought that we could pull out the parallel regions and do some things redundantly and only collaborate on the for loops. But that would require a bit more synchronization in the data structures, so I want to do it in small steps.

Agreed.

In OpenMP terms we would have a parallel section outside and then use teams for the loops over the correlators? And one would choose the team size to be the total number of threads to get minimal memory usage? And a team size of 1 to have the current state of the code?

I think that teams might be overly complicated for this. Instead, one would have nested parallel regions, the one at the block pair level would spawn nsockets threads (for example) while the one at the correlator request level would spawn ncores/nsockets threads.

#pragma omp parallel num_threads(nsockets)
{
#pragma omp for
for( block_pair : block_pairs )
{
  #pragma omp parallel num_threads(ncores/nsockets)
  {
     #pragma omp for
     for( correlator_request : requests ){
      }
  }
}

The above, suitably modified to have integer-valued loop counters, should be perfectly valid code. Of course, nsockets and ncores may be user-configurable. The current behaviour would be recovered when then first num_threads is given the number of available cores and the second is taken as num_threads(1).

@kostrzewa
Copy link
Member

Disregarding the complication of synchronizing the cache(s) [which may prevent this simple approach, not sure], it should be sufficient to modify the top-level parallel section and to parallelise the assemble function.

@kostrzewa
Copy link
Member

The above, suitably modified to have integer-valued loop counters, should be perfectly valid code. Of course, nsockets and ncores may be user-configurable. The current behaviour would be recovered when then first num_threads is given the number of available cores and the second is taken as num_threads(1).

Note that the second parallel section can also be inside a function, no problems there.

@kostrzewa
Copy link
Member

kostrzewa commented Oct 22, 2019

pps: nsockets is of course only a stand-in name, as this kind of parallelisation would also be beneficial on a KNL-like architecture or basically anything with multiple physical or virtual NUMA regions. For KNL, one might have 4 threads on the outside and 17 on the inside, for example.

@martin-ueding
Copy link
Contributor Author

I think I got the basic part right, but the tests are slower on my laptop now:

Test Before After
All 9.37 15.70
Charged 7.70 12.53
Neutral 1.06 2.48
All New 13.88 23.48

Caveats:

  • It was running with nice while my cloud backup just kicked in at 15:00. But I ran them after each other, so it should be similar.
  • It uses four threads on my dual core SMT Intel laptop CPU while another thread was running, so the four threads were likely confined to three virtual cores, so one and a half physical cores.

But I guess with all the finer parallelization and the endless locks this is no suprise. We can change the final accumulation to atomics rather easily, I had tried that already. Actually it might still be in there and now have locks around atomics.

The memory scaling should be much better now, additional threads would not consume more memory in the cache. But the parallelization is horrible, still:

  • I just put the whole operator[] of the caches into a critical region. This means that whenever some thread is building something, no other thread can start building something else, nor can any thread look up something that exists. That might not scale at all.

  • Similarly the get for the trace factory. These also completely block.

I have the impression that one should go through the list of correlators without doing anything and just collect the requests. And then build them eagerly upfront with many threads. Then the cache is locked and every thread can happily retrieve things. Does that sound sensible? I guess there is no real alternative to do it in parallel.

@martin-ueding
Copy link
Contributor Author

You can see the difference here: Q0Q2-optimization...other-parallelization

@kostrzewa
Copy link
Member

Is it possible to sort the correlator requests into disjoint sets according to which intermediates they need? Of course, this again goes into the direction of building a dependency tree...

@martin-ueding
Copy link
Contributor Author

On Travis CI it fails because the older OpenMP cannot do != in loops but only < and <=. My laptop has GCC 9.2.1 which is just fine with that. I'll need to change that eventually.

@martin-ueding
Copy link
Contributor Author

I do not think that one can really make disjoint sets out of them in a trivial way. The dependency tree would be needed here. But going through the list twice, once triggering the cache to build and then actually retrieving them does not sound too hard, actually.

@kostrzewa
Copy link
Member

I do not think that one can really make disjoint sets out of them in a trivial way. The dependency tree would be needed here. But going through the list twice, once triggering the cache to build and then actually retrieving them does not sound too hard, actually.

But this would still involve locking at the building stage, right? Well, I guess it might help somewhat.

@martin-ueding
Copy link
Contributor Author

No, it would not. Take this one here:

  std::vector<DilutedFactor> const &get(Key const &time_key,
                                        std::array<ssize_t, 2> const &key) {
    TimingScope<4> timing_scope("DilutedProductFactoryQ0Q2::get");

    std::vector<DilutedFactor> *result;
#pragma omp critical(DilutedProductFactoryQ0Q1_get)
    {
      if (Q0Q2_.count(time_key) == 0 || Q0Q2_.at(time_key).count(key) == 0) {
        build(time_key, key);
      }

      result = &Q0Q2_.at(time_key).at(key);
    }
    return *result;
  }

Instead I picture something like this:

  • Call a function request(time_key, key) that basically just puts that into a set. This is done single-threaded.
  • Convert the set into a vector.
  • Then we use OpenMP to call the build(time_key, key) for all elements in the set and store the results into a vector as well.
  • A single thread then converts the vector with keys and the vector with values into a map.
  • The map should be safe for parallel read access and then the cache will just hand out what it needs.

Then the products could just run just as they are, with OpenMP added:

  for (ssize_t i = 0; i != ssize(diagram_index_collection); ++i) {
    const auto &c_look = diagram_index_collection[i];
    Tr[{t1, t2}][i] =
        factor_to_trace(df1[{t1, b2}].at({c_look[0]}), df2[{t2, b1}].at({c_look[1]}));
  }

The cache would just serve whatever it has available and it would be everything because everything had already be built.

@martin-ueding
Copy link
Contributor Author

This is how that would look like with OpenMP:

#pragma omp parallel for
  for (ssize_t i = 0; i != ssize(diagram_index_collection); ++i) {
    auto const &c_look = diagram_index_collection[i];
    auto const &value = 
        factor_to_trace(df1[{t2}].at({c_look[1]}), df2[{b2, t1, b2}].at({c_look[0]}));
#pragma omp critical(DilutedTrace2Factory_build)
    Tr[{t1, t2}][i] = value;
  }

And even there we could think of replacing the map with something that would not need to be locked. Or pre-populate the map.

@martin-ueding
Copy link
Contributor Author

Wait, I'm shooting myself in the foot as I don't have a clear plan. I just parallelized the building of the cache elements. So I can remove a bunch of synchronization stuff again.

@kostrzewa
Copy link
Member

No, it would not. Take this one here.

I agree, I misunderstood what you were planning to do. In some sense, you are processing the dependency tree "depth-by-depth", ignoring disjointness (which is fine).

@martin-ueding
Copy link
Contributor Author

We have the problem that I have introduced only a Q0Q2 cache but not a Q1Q1 cache. This means that only charged diagrams have this optimization and the non-charged ones not included in this. The code therefore is in a similar state before I started working on it: Only the things that the current PhD needed were fleshed out, the remainders left aside. But I will keep it this way and focus on the changes needed for the I=3 project and worry about generalizations later. I'll just call them straightforward.

There is one more layer that I missed so far in our discussion an hour ago, and it is exactly this cache for the relevant diagrams. The call chain is the following, from top to bottom level.

  • Diagram::assemble(slice_pair) is called on each diagram and it will construct the full diagram. Internally this will gather the needed traces and multiply them together and accumulate the result.

  • DilutedTraceFactory::get(time_key) is used to get a trace from the cache. This is itself a map with all the quantum numbers. The access to this cache is not fine grained but one time slice pair at a time.

  • DilutedTraceFactory::build(time_key) is called when something is not in the cache. Internally it has a loop over the diagram_index_collection which was set up as part of the lookup tables and contains an eager list of everything that it needs to build. I believe that I wanted to change that to lazy at some point as well but never actually did that.

    This loop for the diagrams relevant in 3pi will then work by multiplying Q0Q2 products together that are also taken from the following cache.

  • DilutedProductFactory::get(time_key, key[2]) is called to produce the Q0Q2 objects. This cache is fine grained as we supply both the time key as well as the quantum number (momenta and Dirac structre) index keys for the Q0 and Q2. Internally this cache is organized in a hierarchy of two std::map objects.

  • DilutedProductFactory::build(time_key, key[2]) then does the actual job of computing that element. It uses the multiply function which takes two full maps and a key of result quantum numbers, just taking from the two maps what it needs to build the key. This function feels strange because it gets two containers and told what to retrieve from them instead of just the two elements.

  • DilutedFactorFactory::operator[](time_key) is called to retrieve the needed Q factors.

  • DilutedFactorFactory::build(time_key) does the actual building. The strange thing is that this again uses the lookup tables to eagerly produce a full set with all the quantum numbers that can occur, but the calling function (DilutedProductFactory::build(time_key, key[2])) will only pick one of them at a time while it builds its cache.

We have three levels of cache and the middle one also needs the quantum number indices as keys. There are the lookup tables which are not exactly part of this cache. So thinking in the graph goal terms, we have the dependency information spread out in our cache std::map instance as well as these lookup tables.

The plan as discussed with Bartek today is to try to make the caches first gather all the requests in one iteration of the whole diagram assembly code. Then to eagerly build all the needed things from the bottom to the top levels using OpenMP concurrency.

I sketched this in my head so far, I'll add a second request function which has the same signature as the build functions. These will be called first and the requests are just stored in a std::set. Then we call build_all and it will convert the std::set into an std::vector and populate the resulting std::map with the keys and empty values. We then call build concurrently which populates the values of the map. The get/operator[] will just take from the map with range checking, everything should be available at that point.

This brings us also further towards the dependency graph, so this is incremental work towards the full graph.

@martin-ueding
Copy link
Contributor Author

I have added the explicit request and build_all to the DilutedFactorFactory. Now in the DilutedTraceFactory::build() I just use this:

  df1.request({t0, b1});
  df2.request({t1, b2});
  df3.request({t2, b3});
  df4.request({t3, b0});
  df1.build_all();
  df2.build_all();
  df3.build_all();
  df4.build_all();

This of course does not change the performance of the code, but I can now work on splitting that further up the chain.

@martin-ueding
Copy link
Contributor Author

During the train rides I added request, get and build_all member functions to these factories. Now the request of the trace factory will call request on the product factory and the factor factory. The build_all will also delegate that down.

One of the troubles that I found is that the trace and factor factory just receive the time slices that the various operators should sit on. The build functions then take the global list of operators or correlators and just build all of them. The value of the get functions usually is a map which has all the Dirac structures and momenta in it.

There seem to be only few time combinations for the various objects. For the QQ products you have three times, for the tr(QQ) there are two times. And for the tr(QQQQ) there are four times. But as we only have one source and one sink time slice we actually only have a handful combinations there. This means that just parallelizing over the internal time degrees of freedom with threads will not saturate a machine like JEWELS with 48 cores per node. We need to parallelize over the Dirac structures (only one for 3pi) and the momenta (lots for 3pi). But as it is a two-level thing, I would have a parallel for loop for every of these time slice combinations. I'd wish that they were more easily accessible, but that will be another round of reworking.

In the process I have broken the diagrams not needed for 3pi and I don't even recall what exactly I did there. But they use the multiply function for ”manual QQ optimization“ whereas the other ones already use the cache. I could also add a Q1Q1 product factory to make it more uniform, but I'd like to defer that.

Also it hit me that before I started all this refactoring we basically had everything in lookup tables which told us what we need to compute. Now I have spent effort in making half of the code lazy only to find that we want to make it eager again. I know that as part of this we have gained more unification and abstraction, but it feels like I am going backward again.

Regarding the general strategy, I'd like to discuss that with you, @kostrzewa, when you have some time.

@martin-ueding
Copy link
Contributor Author

    // Build the diagrams.
    for (auto &diagram : diagrams) {
      if (diagram.correlator_requests().empty()) {
        continue;
      }
      TimingScope<1> timing_scope("request diagram", diagram.name());

      for (auto const slice_pair : block_pair) {
        int const t = get_time_delta(slice_pair, Lt);

        diagram.request(t, slice_pair, q);
      }  // End of slice pair loop.
    }    // End of diagram loop.

    q.build_all();

    for (auto &diagram : diagrams) {
      if (diagram.correlator_requests().empty()) {
        continue;
      }
      TimingScope<1> timing_scope("contract diagram", diagram.name());

      for (auto const slice_pair : block_pair) {
        int const t = get_time_delta(slice_pair, Lt);

        diagram.assemble(t, slice_pair, q);
      }  // End of slice pair loop.
    }    // End of diagram loop.

    q.clear();

I know have the delegation pulled out, as far as I can tell.

  1. You first have the diagram just request what they need. This in turns calls the DilutedTraceFactory::request, which then calls DilutedProductFactory::request and (directly and indirectly) DilutedFactorFactory::request.

  2. The q.build_all() will call the build_all of the three layers of factories, bottom up of course.

  3. Then the diagrams are assembled and nothing new is built. In principle the factories now know what they need and as this is done via time keys and not indices as with the lookup tables, we should be somewhat closer to what we initially had.

Next I will see where I can insert the concurrency. This time I will make a plan first, though 😄.

@martin-ueding
Copy link
Contributor Author

This is the call graph:

IMG_20191028_120045

Using the 4⁴ test lattice on the charged diagrams only I have this profile with a code which is not parallelized at this moment: profile.pdf

The time spent in contract() is 13.9 seconds. Now this is split up into these parts:

Part Seconds
Requesting 4.58
Building parts 8.91
Diagram assembly 0.20
Writing 0.18

The correlator names files has 2223 lines, so removing a bit of JSON stuff will give us around 2200 correlators that were computed.

I had hoped that building the parts would be clearly the most expensive part but it seems that the scheduling needs to be either parallelized or be pulled out such that it is re-used for every time slice combination. At the moment it is not, although not everything depends on it.

@martin-ueding
Copy link
Contributor Author

I've added some more grouping elements into the profile such that we can see the time it took for the factors (Q), the products (QQ) and the traces (tr(Q…)) to build:

Bildschirmfoto_030

As there are many more combinatoric possibilities for the traces I think that it is reasonable that they take the most time whereas there are limited options for the factors and products. Also the TraceFactory::build function is called just a reasonable number of times. I think that opening a new parallel section within each function would not be that bad. But likely we could do better than that even.

@martin-ueding
Copy link
Contributor Author

I have added parallelization to just the build function of the 6 trace like so:

  // We populate the whole map such that we can change its elements in a parallel way
  // later..
  for (ssize_t i = 0; i != ssize(diagram_index_collection); ++i) {
    Tr[time_key][i];
  }

#pragma omp parallel for
  for (ssize_t i = 0; i != ssize(diagram_index_collection); ++i) {
    auto const &c_look = diagram_index_collection[i];

    auto const &l01 = dpf_.get({b0, t1, b2, t2}, {c_look[1], c_look[2]});
    auto const &l23 = dpf_.get({b2, t3, b4, t4}, {c_look[3], c_look[4]});
    auto const &l45 = dpf_.get({b4, t5, b0, t0}, {c_look[5], c_look[0]});

    auto const &value = factor_to_trace(l01 * l23, l45);
    Tr[time_key][i] = value;
  }

This directly makes it faster:

Bildschirmfoto_031

But then we can also pull out the parallel region a bit into the build_all function:

  void build_all() {
    std::vector<Key> unique_requests;
    unique_requests.reserve(requests_.size());
    for (auto const &time_key : requests_) {
      if (Tr.count(time_key) == 0) {
        unique_requests.push_back(time_key);
        Tr[time_key];
      }
    }

    requests_.clear();

#pragma omp parallel
    {
      for (auto i = 0; i < ssize(unique_requests); ++i) {
        build(unique_requests[i]);
      }
    }
  }

And then in the build function we just have the #pragma omp for, the population of the map is done with #pragma omp master. I am not sure whether I need a barrier before the for loop to make sure that the map is fully populated.

@kostrzewa
Copy link
Member

We need to parallelize over the Dirac structures (only one for 3pi) and the momenta (lots for 3pi).

Yes, that was what I had thought was the plan from the beginning. I realize that the data structures are not ideal for this.

Also it hit me that before I started all this refactoring we basically had everything in lookup tables which told us what we need to compute. Now I have spent effort in making half of the code lazy only to find that we want to make it eager again. I know that as part of this we have gained more unification and abstraction, but it feels like I am going backward again.

Sure, lazy evaluation kind of necessarily comes with a memory cost and a loss of control unless one can pre-compute how to partition the problem and the "evaluator" is aware of this partitioning.

The nice thing about having a lazy "base" is that it can be used straightforwardly to build dependency lists, as it does now. In other words: in the loop over block pairs, we want to work lazily to derive the dependencies and then eagerly fulfill them.

In the process I have broken the diagrams not needed for 3pi and I don't even recall what exactly I did there. But they use the multiply function for ”manual QQ optimization“ whereas the other ones already use the cache. I could also add a Q1Q1 product factory to make it more uniform, but I'd like to defer that.

That's okay, but depending on how many combinations we need to compute for the <P(t+t_i) J_mu(t_i) J_nu(t_i)> correlator (@pittlerf, @marcuspetschlies), it might be beneficial to attempt to keep all of the code running. If there are not too many combinations there, we can always use an old version of the code and use this one for 3pi only for now.

I had hoped that building the parts would be clearly the most expensive part but it seems that the scheduling needs to be either parallelized or be pulled out such that it is re-used for every time slice combination. At the moment it is not, although not everything depends on it.

The balance might be a little different on a real configuration, it might be worth running a timing on one of the 24c48 lattices to get an idea.

@kostrzewa
Copy link
Member

And then in the build function we just have the #pragma omp for, the population of the map is done with #pragma omp master. I am not sure whether I need a barrier before the for loop to make sure that the map is fully populated.

#pragma omp master does not have an implicit barrier, so you definitely need one. Alternatively, use #pragma omp single, which includes a barrier.

@martin-ueding
Copy link
Contributor Author

I managed to get a race condition, I think. So in release mode I just had this here:

[Start    ] All contractions
Starts with block pair     0 of     3.
terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
/home/mu/Lattice/Code/sLapH-contractions/tests/integration-L4/run-integration-test: Zeile 14:  5389 Abgebrochen             (Speicherabzug geschrieben) "$1/contract" -i test_all.ini --start_config 1000 --end_config 1000

Then I ran it with debug mode in GDB and got this here:

[Start    ] All contractions
Starts with block pair     0 of     3.
[New Thread 0x7ffff17ff700 (LWP 6144)]
[New Thread 0x7ffff0ffe700 (LWP 6148)]
[New Thread 0x7ffff07fd700 (LWP 6149)]
Starts with block pair     1 of     3.
contract: /home/mu/Lattice/Code/sLapH-contractions/src/DilutedTrace.cpp:9: Complex inner_product(const DilutedTraces&, const DilutedTraces&): Assertion `right_vec.traces.size() > 0' failed.

Thread 1 "contract" received signal SIGABRT, Aborted.
0x00007ffff6af2e35 in raise () from /lib64/libc.so.6
#0  0x00007ffff6af2e35 in raise () from /lib64/libc.so.6
#1  0x00007ffff6add895 in abort () from /lib64/libc.so.6
#2  0x00007ffff6add769 in __assert_fail_base.cold () from /lib64/libc.so.6
#3  0x00007ffff6aeb566 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000008e6925 in inner_product (left_vec=..., right_vec=...)
    at /home/mu/Lattice/Code/sLapH-contractions/src/DilutedTrace.cpp:9
#5  0x000000000089c7fd in resolve_request (trace_requests=std::vector of length 2, capacity 2 = {...}, 
    slice_pair=..., q=...) at /home/mu/Lattice/Code/sLapH-contractions/src/Diagram.cpp:43
#6  0x000000000089d463 in Diagram::assemble_impl (this=0x61a000004a38, t=2, slice_pair=..., q=...)
    at /home/mu/Lattice/Code/sLapH-contractions/src/Diagram.cpp:81
#7  0x000000000089d189 in Diagram::assemble (this=0x61a000004a38, t=2, slice_pair=..., q=...)
    at /home/mu/Lattice/Code/sLapH-contractions/src/Diagram.cpp:71
#8  0x000000000084a7a5 in contract (Lt=4, dilT=2, dilE=2, nev=16, meson_operator=..., randomvectors=..., 
    perambulators=..., operator_lookup=..., trace_indices_map=std::map with 8 elements = {...}, 
    correlator_requests_map=std::map with 9 elements = {...}, quark_lookup=std::map with 3 elements = {...}, 
    output_path="correlators", output_filename="_cnfg1000.h5", single_time_slice_combination=-1)
    at /home/mu/Lattice/Code/sLapH-contractions/src/Correlators.cpp:120
#9  0x000000000083729c in main (ac=7, av=0x7fffffffe478)
    at /home/mu/Lattice/Code/sLapH-contractions/main/contract.cpp:96

I also let it run with debug mode without GDB and it just took very long but did also produce that.

As discussed with Bartek I have changed the omp master to omp single to benefit from the implicit barrier. But that does not seem to be sufficient. It sometimes crashes in both release and debug mode and at different time slices. So there is still something going on.

@kostrzewa
Copy link
Member

kostrzewa commented Oct 28, 2019

Could this simply be an ordering problem? There's currently no way to know which diagram triggered this, but I have a feeling it might have been one of the *D ones, for which the smaller traces must already have been built for it to work. This would have of course been taken care of automatically in the lazy implementation.

@martin-ueding
Copy link
Contributor Author

martin-ueding commented Nov 2, 2019

My run with a single time slice and single thread has finished on JUWELS. I have used this version here:

jwc08n219.adm08.juwels.fzj.de
This is sLapH-contractions:
  git branch refs/heads/Q0Q2-optimization
  git revision 778af83af02aba90b3aafb857cac10df14b12514
  git state CLEAN
  compiled by ueding1 on juwels02.ib.juwels.fzj.de
  running with up to 1 OpenMP threads

It has taken 2.17 hours to do one source-sink combination:

[Start    ] All contractions
Thread   0 of   1 starts with block pair    10 of   528.
[   Finish] All contractions. Duration: 7009.8596739769 seconds. Threads: 1

We learn that it has taken 77.496 GB of memory in its peak:

7438.49user 266.94system 2:10:11elapsed 98%CPU (0avgtext+0avgdata 77496076maxresident)k
0inputs+0outputs (0major+25397400minor)pagefaults 0swaps

This is great news because the planned parallelization in the diagram assembly will not increase the RAM because the threads will share everything. This means that we can get the job done with 90 GB on JUWELS. Together with the 20% reduction that I found on Friday we likely have it done pretty well.

For some reasons I have an assertion failure in the timing code but still got a timing output. I don't fully understand why, but I'm happy for it. The following is the profile output (profile-dot.pdf):

profile-dot

This is the branch where we have the Q₂Q₀ optimization in place, but the parallelization is not changed with the requests and build_all. That might use a bit more memory for all the requests, I would need to check that.

Unfortunately the requesting also takes some time and at the moment we do that once per time slice. Just taking the numbers from the above, 2 hours for one of the 528 combinations will indeed give us 22 hours in total assuming perfect parallelization. But with the requesting and imperfect parallelization we likely need to split up either the correlators or the time slices. I think that the latter makes more sense because then we can compute all the diagrams for a time slice combination in one go.

Doing the requesting only once would make sense, but that might be a bit more work with the code. I would suggest that I first try my best with the new parallelization and gather timings as well. Then I adjust it such that one can give a time slice range and adjust the job generator to produce a few batches. This should allow us to finally start production and let me improve the performance without delaying getting results.

@kostrzewa
Copy link
Member

kostrzewa commented Nov 3, 2019

Excellent news, I agree with all your conclusions. One more thing that is not clear to me: the intermediate Q2,Q0,Q2,Q0 is not explicitly cached, is it? Explicitly doing so might help reduce the cost for C6cC

@kostrzewa
Copy link
Member

This is great news because the planned parallelization in the diagram assembly will not increase the RAM because the threads will share everything. This means that we can get the job done with 90 GB on JUWELS. Together with the 20% reduction that I found on Friday we likely have it done pretty well.

Looks great, for maximum efficiency we might want to run with 2xN threads, however (where N will be between 16 and 24, I would think). This might mean having to run on the 192 GB nodes, but we have to see.

I'll try to find some time tomorrow to look at the logic you mention in HISKP-LQCD/sLapH-projection-NG#26

@martin-ueding
Copy link
Contributor Author

One more thing that is not clear to me: the intermediate Q2,Q0,Q2,Q0 is not explicitly cached, is it? Explicitly doing so might help reduce the cost for C6cC

At the moment it is not cached. I would think that it would require even more memory, so I'd like to postpone this issue for a bit.

There is not much parallelization going on but we already have a race condition yielding sporadic segmentation faults. I have changed the profiling such that the calls made within parallel regions are colored in red. Also for the unit test it makes sense to output the profile in a vertical fashion, there is a command line option for that as well now.

profile-dot.pdf

profile-dot

I was going to add more parallelization there, but I first need to find the cause of that race before complicating the problem.

@martin-ueding
Copy link
Contributor Author

I think I understood and fixed that face. But I'd appreciate some code review at some point. I have parallelized the whole building part now, as you can see in the red marked regions here:

profile-dot

The assembly can be parallelized as well, I have already started with an AtomicAccumulator. For Kahan summation we would need to think how to do that atomically, but I'll skip that for the moment. That profile was on my laptop with 16 threads to make races more likely to surface. It is not representative for the real thing but it seems that the building phase actually gets cheaper this way. I've submitted this state of the code to JUWELS, hopefully I will get some results by tomorrow.

@martin-ueding
Copy link
Contributor Author

The job with the devel QOS has not been able to finish within 2 hours. The other one is at it for 13:16:19 hours now and currently in a single-threaded section of the code. That could either mean requesting or assembling. But since it it only at 33 GB of memory usage on that given node, I fear that it is still in the requesting phase.

The timing level is set to 3, so it should not busy itself with too many timings. The NDEBUG flag is missing, so it does a couple of runtime checks such as range checks or checks for empty results. Though with potentially lurking race conditions I think that this might actually be nice and should not be that expensive.

I would think that the requesting phase needs to be sped up as the next action, right?

@kostrzewa
Copy link
Member

I would think that the requesting phase needs to be sped up as the next action, right?

Seems so... I'm still surprised at how slow this seems to be..

@kostrzewa
Copy link
Member

How can it be, however, that the requests don't take very much time at all for the test lattice?

@kostrzewa
Copy link
Member

In request_impl, is it necessary to go via a reference to the value?

@kostrzewa
Copy link
Member

kostrzewa commented Nov 5, 2019

Finally, timing level 3 suggests to me that tens to hundreds of millions of timings will be performed, judging by the right-most number of calls in your call tree above.

strike that..

@martin-ueding
Copy link
Contributor Author

As you can see from the last profile from the test lattice on my laptop, the requesting phase is the most expensive part of the program. The number of various entities varies between the test and the 3pi project, but we have around 2100 correlators in the test and 250k in the real thing. So it is a factor 120 more work to do.

On my laptop the test takes 11 seconds there. Assuming O(N) scaling, it should take 21 minutes on JUWELS. Assuming O(N log(N)) we would have 35 minutes. If for some reason the whole thing scales as O(N²) then the expectation is 43 hours.

I don't really see where the algorithm should be quadratic. We just iterate through all the correlators, parse them, and schedule the parts into std::map objects which have O(log(N)) member access and insertion. We could use std::unordered_map as somebody suggested on Stack Overflow, that has O(1) access times. Then the requesting phase should actually be linear.

As far as I recall I have compiled it with -O3 on JUWELS:

./CMakeFiles/lcontract.dir/flags.make:CXX_FLAGS = -O3 -traceback   -Wall -pedantic -qopenmp -std=gnu++11

Perhaps the missing -DNDEBUG and the presence of -traceback really spoil the performance, I could try with just the regular flags.

@martin-ueding
Copy link
Contributor Author

Now that we fixed that stupid O(N²) thing and made it O(N log(N)), it ran through! This is a single block pair (so a few time slices given the dilution), and it took just ten minutes and 74 GB of memory:

10125.64user 156.02system 10:09.27elapsed 1687%CPU (0avgtext+0avgdata 73614600maxresident)k
0inputs+0outputs (0major+33098413minor)pagefaults 0swaps

And this is the profile for this thing: profile-dot.pdf

profile-dot

We can see that the time spent in contract is 0:05:24 per block pair. That means that taking together all the block pairs will give us 1 day, 23:31:12. With the current code we likely can do it on the small nodes, and we just need to cut it in half. Also we can add another 20% reduction which I have not deployed yet.

The speed-up of the parallel parts can perhaps be increased still, perhaps there are really silly things in there. But it seems to be in reach now, without reducing the amount of physics that we get.

@martin-ueding
Copy link
Contributor Author

I have recompiled with -O3 -DNDEBUG now and increased the timing level to 4. The needed time has not changed much, but we now have one more level:

profile-dot

@kostrzewa
Copy link
Member

Looking very good. This means that we can do the physical point lattice with about 2 million core-hours if things go as planned, such that we should count on about two months of real runtime, although perhaps it can be done in one month by going into negative contingent a little.

@martin-ueding
Copy link
Contributor Author

I have just tried it with and without SMT on JUWELS. For some reason the selftime of main has increased from 30 to 260 seconds when I used SMT. Using htop on the node I saw that it was in disk sleep for a pretty long time. Perhaps there is a bottleneck with the file system when using this many threads, or there was a random performance glitch.

It does not really matter, though. The time spent in the building of the diagram is 78 seconds with SMT and 80 seconds without SMT, so we can just ignore the SMT. Also with 78 seconds per block pair we aim for 14 hours per configuration.

I'll submit a job for production now! 🎉

@martin-ueding
Copy link
Contributor Author

Since the bars are green and not red, I presume that we do not have many thread synchronization overheads but that it is doing actual work:

Bildschirmfoto_007

And memory seems to be pretty fine.

And I still have the 20% reduction of the number of correlators to implement, that will make it even better.

@kostrzewa
Copy link
Member

I would guess that SMT will almost certainly hurt you on Skylake for this problem type. Definitely just one thread per core.

@kostrzewa
Copy link
Member

I would guess that SMT will almost certainly hurt you on Skylake for this problem type.

Actually, disregard that comment. It's not at all clear cut, as some of the memory traffic involves very small objects, it might be that using SMT actually allows one to saturate memory bandwidth better. In any case, the difference, up to the self time, is small.

@martin-ueding
Copy link
Contributor Author

We could measure it again, but I think that from the profile it is rather clear to see. I forgot to post them. So here is the one without SMT:

profile-juwels-I3-noSMT

And here is the one with SMT:

profile-jules-I3-SMT

@martin-ueding
Copy link
Contributor Author

The run using every third source time slice on JUWELS has finished. This is the profile from the run without the 20% reduction, taking around 13 hours:

profile-dot

It has taken 11 hours with the 20% reduction in the correlator names. I will have adjust the analytic prescriptions and then I can generate an overview over the correlator matrices that have been generated.

The assembly of the diagrams has taken several hours now. It is good enough such that we can start production on the regular nodes on JUWELS. I could then also parallelize the assembly and reduction such that we get a little more gains, but I will do that when the contractions are running with the current version.

Memory usage is 38 GB with the reduced one. From the profile it seems that caching the QQQQ objects could be beneficial as well. I am not sure whether they would fit into the memory, though.

@kostrzewa
Copy link
Member

Looks good!

The assembly of the diagrams has taken several hours now. It is good enough such that we can start production on the regular nodes on JUWELS. I could then also parallelize the assembly and reduction such that we get a little more gains, but I will do that when the contractions are running with the current version.

Agreed, production can proceed now with improvements in the future.

Memory usage is 38 GB with the reduced one. From the profile it seems that caching the QQQQ objects could be beneficial as well. I am not sure whether they would fit into the memory, though.

Perhaps one could churn through the combinatorics and estimate the memory load?

Another avenue for slight optimisation might be to run with 2x24 threads (rather than 48 threads), parallelising in a nested fashion over the block pairs and the combinatorics, respectively. I would hope that this would improve memory locality, reducing cross-socket accesses.

@martin-ueding
Copy link
Contributor Author

martin-ueding commented Nov 6, 2019

Agreed, production can proceed now with improvements in the future.

Jobs are queued, I'll see about space to line up more already.

Perhaps one could churn through the combinatorics and estimate the memory load?

The numbers are easy. I just took all the C6cC correlators, split them into pairs of operators and see how many multiplications are needed. Without the cache, we need two multiplications per correlator from the already available QQ objects. Then I saw how many unique QQQQ objects I can form from the first two pairs of the correlators. These need one multiplication each, and then one multiplication with the remaining QQ object.

The results are then these:

  • Cache entries: 30596
  • Multiplications without QQQQ cache: 129326
  • Multiplications with QQQQ cache: 95259

This is a 26 % decrease in the number of multiplications. We cannot reuse from the C4cC because there the outer random vector indices are the same because we want a trace.

I am not sure how large these intermediate QQQQ objects are each. They contain a vector of DilutedFactor which is a matrix and tracks the random vectors used. We are at six random vectors and four terms, so there should be 6×5×4×3 = 360 combinations there. So there are 11,014,560 of these matrices. We have 40 to 50 GB more memory available that we could use. Using 40 GB for an estimate, we have space for 226 complex numbers in that matrix. This means that the side length must not exceed 15. But I think that the side length of the matrix actually is the product of the dilutions, right? So the number of eigenvectors alone exceeds 15, so I would think that we cannot afford this cache.

@martin-ueding
Copy link
Contributor Author

We actually do not need a cache. We just need to order the requests by the QQ objects that they use. Then we can multiply the first two and keep them, do all the multiplications with the third. When the second (or the first) changes, we need to form a new intermediate product. This needs constant additional memory. Yet another thing towards the graph.

@kostrzewa
Copy link
Member

@martin-ueding can you remind me what the status of this is? Is it generally applicable now for any type of diagram?

@martin-ueding
Copy link
Contributor Author

As there is only a Q0Q2 optimization this will not work for any sort of neutral diagram and also I believe that the C3c and C3cV also won't work. It should not be too hard to extend this, but at the moment only the work needed for three pions is done.

@kostrzewa
Copy link
Member

Alright, thanks. So for the rho contractions we should use the tagged version as we did a few months ago.

@kostrzewa
Copy link
Member

Alright, thanks. So for the rho contractions we should use the tagged version as we did a few months ago.

Actually, which version was that? There's no obviously named tag. @pittlerf, @matfischer , do you remember the exact commit that you used for the rho contractions?

@martin-ueding
Copy link
Contributor Author

Exactly. With the smaller number of intermediate objects that should still fit into memory with a reasonable number of threads.

Isn't the master version just the one for the rho? And the Q0Q2-optimization branch has all the additional changes for the 3pi?

@kostrzewa
Copy link
Member

Isn't the master version just the one for the rho? And the Q0Q2-optimization branch has all the additional changes for the 3pi?

I see, I thought you had partially merged the changes already. Noted.

@pittlerf
Copy link
Collaborator

pittlerf commented Feb 6, 2020

Alright, thanks. So for the rho contractions we should use the tagged version as we did a few months ago.

Actually, which version was that? There's no obviously named tag. @pittlerf, @matfischer , do you remember the exact commit that you used for the rho contractions?

Hi, I used the last commit from the rho_nf2 branch
https://github.com/HISKP-LQCD/sLapH-contractions/tree/rho_nf2

@martin-ueding
Copy link
Contributor Author

I have merged the Q0Q2 optimization and the change of the parallelization axis as just the Q0Q2 optimization alone was not usable due to the memory scaling.

@kostrzewa
Copy link
Member

Hi, I used the last commit from the rho_nf2 branch

Ah, yes, we have branches (duh...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants